# Overview of Recipe Recommendation System

This project involves building a recipe recommendation system that suggests recipes to users based on their preferences, such as ingredients, dietary restrictions, cuisine, complexity, and rating. The recommendation system uses a variety of features extracted from the dataset, including:

- **Ingredients**: Recipes are classified based on the ingredients listed in the dataset. TF-IDF (Term Frequency-Inverse Document Frequency) is used to vectorize the ingredients for comparison.
- **Dietary Restrictions**: The system considers dietary restrictions such as vegetarian, vegan, gluten-free, etc., when recommending recipes.
- **Cuisine**: The cuisine type (e.g., Italian, Indian) plays a role in the recommendation, helping users find recipes within their preferred cuisine.
- **Complexity**: Recipes are categorized into different levels of complexity based on the number of steps in the recipe instructions.
- **Rating**: A rating score is used to adjust recommendations, with higher-rated recipes being favored.

The system uses cosine similarity to match recipes based on user input and recommends the most similar recipes based on a variety of factors. The user provides a set of preferences (ingredients, dietary restrictions, complexity, etc.), and the system suggests the top N recipes that closely match these preferences.


# Loading Libraries and Initial Setup
In this section, we are importing the necessary libraries for data manipulation, machine learning, and text processing. The libraries include pandas, numpy, sklearn, and others.


In [1]:
import pandas as pd
import numpy as np
import ast
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity

# Reading the Dataset
Here, we load the original dataset from the CSV file into a pandas DataFrame and display the first few rows to get an overview of the data.


In [2]:
original_data=pd.read_csv('../data/full_dataset.csv')

# Data Preprocessing
In this step, we perform various preprocessing tasks:
- Dropping irrelevant columns
- Checking for unique values in columns
- Handling missing values
- Sampling a subset of the data for faster processing and analysis.


In [3]:
original_data.head()

Unnamed: 0.1,Unnamed: 0,title,ingredients,directions,link,source,NER
0,0,No-Bake Nut Cookies,"[""1 c. firmly packed brown sugar"", ""1/2 c. eva...","[""In a heavy 2-quart saucepan, mix brown sugar...",www.cookbooks.com/Recipe-Details.aspx?id=44874,Gathered,"[""brown sugar"", ""milk"", ""vanilla"", ""nuts"", ""bu..."
1,1,Jewell Ball'S Chicken,"[""1 small jar chipped beef, cut up"", ""4 boned ...","[""Place chipped beef on bottom of baking dish....",www.cookbooks.com/Recipe-Details.aspx?id=699419,Gathered,"[""beef"", ""chicken breasts"", ""cream of mushroom..."
2,2,Creamy Corn,"[""2 (16 oz.) pkg. frozen corn"", ""1 (8 oz.) pkg...","[""In a slow cooker, combine all ingredients. C...",www.cookbooks.com/Recipe-Details.aspx?id=10570,Gathered,"[""frozen corn"", ""cream cheese"", ""butter"", ""gar..."
3,3,Chicken Funny,"[""1 large whole chicken"", ""2 (10 1/2 oz.) cans...","[""Boil and debone chicken."", ""Put bite size pi...",www.cookbooks.com/Recipe-Details.aspx?id=897570,Gathered,"[""chicken"", ""chicken gravy"", ""cream of mushroo..."
4,4,Reeses Cups(Candy),"[""1 c. peanut butter"", ""3/4 c. graham cracker ...","[""Combine first four ingredients and press in ...",www.cookbooks.com/Recipe-Details.aspx?id=659239,Gathered,"[""peanut butter"", ""graham cracker crumbs"", ""bu..."


In [4]:
original_data.drop(['Unnamed: 0'], axis=1, inplace=True)

In [6]:
original_data.nunique()

title          1312870
ingredients    2226362
directions     2211644
link           2231142
source               2
NER            2133496
dtype: int64

In [7]:
original_data['source'].unique()

array(['Gathered', 'Recipes1M'], dtype=object)

In [8]:
notitle=original_data[original_data['title'].isnull()]
notitle

Unnamed: 0,title,ingredients,directions,link,source,NER
1394448,,"[""2 pieces bacon""]","[""Slice bacon into lardons, place in nonstick ...",food52.com/recipes/57431-none,Gathered,"[""bacon""]"


In [9]:
subset_data = original_data.sample(frac=0.005, random_state=55)  

### Save a sample of data for faster processing later we can use the full dataset

In [10]:
subset_data.to_csv('../output/subset_data.csv',index=False)

# Load the sample data

In [11]:
subset_data=pd.read_csv('../output/subset_data.csv')

In [12]:
subset_data.sample(3)

Unnamed: 0,title,ingredients,directions,link,source,NER
10641,Pow!,"[""2 (10 1/2 oz. each) cans condensed beef brot...","[""Heat all ingredients to simmering."", ""Serve ...",www.cookbooks.com/Recipe-Details.aspx?id=888766,Gathered,"[""condensed beef broth"", ""tomato juice"", ""wate..."
9097,Beer Bread,"[""3 c. self-rising flour"", ""3 Tbsp. sugar"", ""1...","[""Mix together."", ""Put in greased and floured ...",www.cookbooks.com/Recipe-Details.aspx?id=508644,Gathered,"[""flour"", ""sugar"", ""beer"", ""butter""]"
8380,Pecan Pralines,"[""1 1/2 c. sugar"", ""1/2 c. buttermilk"", ""1 Tbs...","[""Combine sugar, buttermilk, syrup and soda."",...",www.cookbooks.com/Recipe-Details.aspx?id=1079353,Gathered,"[""sugar"", ""buttermilk"", ""white corn syrup"", ""b..."


In [13]:
subset_data.shape

(11156, 6)

In [14]:
subset_data.dtypes

title          object
ingredients    object
directions     object
link           object
source         object
NER            object
dtype: object

In [15]:
subset_data.drop(['source',"link"], axis=1, inplace=True)

# Exploring Unique Ingredients
This part of the notebook extracts the unique ingredients present in the recipes, which will be used in subsequent steps for dietary classification and recommendations.


In [16]:
subset_data.columns

Index(['title', 'ingredients', 'directions', 'NER'], dtype='object')

In [17]:
subset_data['NER'].unique()

array(['["mayonnaise", "chili sauce", "relish", "eggs", "chives", "mustard", "Salt", "sauce"]',
       '["butter", "flour", "beet juice", "vinegar", "heavy cream", "salt", "pepper", "sugar", "beets"]',
       '["pastry", "egg", "salt", "water", "Cajun seasoning", "Parmesan cheese"]',
       ..., '["butter", "sumac", "fresh mint", "shallot"]',
       '["ground beef", "flour", "salt", "water"]',
       '["butter", "shortening", "sugar", "eggs", "flour", "baking powder", "vanilla", "milk"]'],
      dtype=object)

In [None]:

subset_data['NER'] = subset_data['NER'].apply(ast.literal_eval)

# Flatten the lists in the 'NER' column into a single list of ingredients
all_ingredients = [ingredient for sublist in subset_data['NER'] for ingredient in sublist]

# Convert to a set to get unique ingredients
unique_ingredients = set(all_ingredients)

print(unique_ingredients)


{'red chili pods', 'arrowroot powder', 'Whipping cream', 'buttermilk baking mix', 'Bay Seasoning', 'goat cheese', 'whole coriander seeds', 'water chestnuts', 'lamb rack', 'macaroni', 'full fat coconut milk', 'chocolate-covered raisins', 'carob powder', 'lean boneless chuck', 'Parmesan cheese', 'black sesame seed', 'velveeta', 'rosemary', 'agar agar', 'unflavored gelatin', 'cocoa', 'cold rice', 'turkey meatballs', 'mayonnais', 'Hi-C punch', 'tomato condensed', 'sazon goya with azafran', 'extra light vegetable oil', 'cilantro stems', 'iceberg lettuce', 'liquid egg substitute', 'sauerkraut Vlasic', 'brownies', 'Whipping Cream', 'Brownie Mix', 'ground almond', 'broth', 'fresh gingerFor serving', 'fruit apples', '# hot ground pork sausage', 'vanilla chocolate chips', 'lids', 'Provolone cheese', 'Shredded lettuce', 'Margine', 'Cheddar Cream Cheese', 'Lard', 'Sauvignon', 'reposada tequila', 'rabbit carcass', 'rigatoni pasta', 'Rosemary Ham', 'Italian Blend', 'mashed strawberries', 'swiss chee

In [19]:
len(unique_ingredients)

8271

In [20]:
subset_data.head()

Unnamed: 0,title,ingredients,directions,NER
0,Shrimp Lamaze Sauce (Warwick Hotel) Recipe,"[""1 pt. mayonnaise"", ""1 pt. chili sauce"", ""1/2...","[""Mix all ingredients together."", ""Add in salt...","[mayonnaise, chili sauce, relish, eggs, chives..."
1,Sweet And Sour Beets,"[""2 Tbsp. butter"", ""2 Tbsp. flour"", ""1/2 c. be...","[""Melt the butter."", ""Blend in the flour."", ""S...","[butter, flour, beet juice, vinegar, heavy cre..."
2,Crispy Crisps,"[""1 box puff pastry"", ""1 egg"", ""1 tsp. salt"", ...","[""Preheat oven to 375\u00b0."", ""Thaw pastry."",...","[pastry, egg, salt, water, Cajun seasoning, Pa..."
3,Chicken French Bread Pizza,"[""1 loaf (1 pound) French bread"", ""1/2 cup but...","[""Cut bread in half lengthwise, then in half w...","[bread, butter, cheddar cheese, Parmesan chees..."
4,Butterfinger Delight,"[""2 1/2 cups crushed graham crackers"", ""1/2 cu...","[""Mix 2 Cups Crushed Graham Cracker and melted...","[graham crackers, butter, instant chocolate pu..."


# Dietary Restriction Classification
Here, we define and apply a function to classify dietary restrictions (e.g., vegetarian, vegan) based on the ingredients found in the recipes.


In [21]:
dietary_restrictions = {
    "vegetarian": {"carrot", "milk", "cheese", "potato", "flour", "butter", "bread", "chives"},
    "vegan": {"carrot", "broccoli", "quinoa"},
    "gluten_free": {"rice", "quinoa", "corn"},
    "keto": {"butter", "cheese", "egg", "bacon",'eggs'},
    "non_vegetarian": {"chicken", "fish", "beef", "pork", "lamb", "shrimp", "crab", "egg",'eggs'},
    "eggitarian": {"carrot", "milk", "cheese", "potato", "egg",'eggs'}
}

def classify_dietary_restriction(ingredients):
    """Classify dietary restrictions for a list of ingredients without considering keto."""
    categories = set()

    # Normalize ingredient by making it lowercase
    normalized_ingredients = [ingredient.strip().lower() for ingredient in ingredients]

    # Check for non-vegetarian and eggitarian categories first
    non_veg_or_eggitarian = False
    for ingredient in normalized_ingredients:
        if ingredient in dietary_restrictions["non_vegetarian"]:
            non_veg_or_eggitarian = True
            categories.add("non_vegetarian")
        elif ingredient in dietary_restrictions["eggitarian"]:
            non_veg_or_eggitarian = True
            categories.add("eggitarian")
    
    # If the recipe is non-veg or eggitarian, it can't be vegetarian or other conflicting categories
    if non_veg_or_eggitarian:
        return list(categories)  # Return only non-veg or eggitarian categories
    
    # Now check for other dietary restrictions excluding keto
    for ingredient in normalized_ingredients:
        for category, allowed_items in dietary_restrictions.items():
            if category != "keto" and ingredient in allowed_items:
                categories.add(category)

    # Ensure vegetarian is included if no non-veg or eggitarian ingredients are found
    if "non_vegetarian" not in categories and "eggitarian" not in categories:
        if any(ingredient in dietary_restrictions["vegetarian"] for ingredient in normalized_ingredients):
            categories.add("vegetarian")

    return list(categories) if categories else ["unknown"]


In [22]:
subset_data["dietary_restriction"] = subset_data["NER"].apply(classify_dietary_restriction)

In [23]:
subset_data["dietary_restriction"].sample(10)

10666                       [unknown]
3345                 [non_vegetarian]
5128     [eggitarian, non_vegetarian]
1209                        [unknown]
3216                        [unknown]
5325                 [non_vegetarian]
4131                        [unknown]
2962                 [non_vegetarian]
2687                 [non_vegetarian]
10203                       [unknown]
Name: dietary_restriction, dtype: object

In [24]:
subset_data["dietary_restriction"].value_counts()

dietary_restriction
[unknown]                           4475
[non_vegetarian]                    2770
[vegetarian]                        1432
[eggitarian]                        1178
[eggitarian, non_vegetarian]         642
[non_vegetarian, eggitarian]         361
[gluten_free]                        173
[vegan]                               46
[gluten_free, vegetarian]             39
[vegan, gluten_free]                  28
[vegan, vegetarian]                    7
[gluten_free, vegan, vegetarian]       2
[vegetarian, vegan, gluten_free]       2
[vegan, gluten_free, vegetarian]       1
Name: count, dtype: int64

In [25]:
dietary_dummies = subset_data['dietary_restriction'].apply(lambda x: pd.Series(1, index=x)).fillna(0).astype(int)

# Concatenate the dummy columns to the original DataFrame
subset_data = pd.concat([subset_data, dietary_dummies], axis=1)


In [26]:
subset_data.sample(5)

Unnamed: 0,title,ingredients,directions,NER,dietary_restriction,non_vegetarian,vegetarian,unknown,eggitarian,gluten_free,vegan
2603,Redcurrant And Custard Cupcakes,"[""2 cups all purpose flour"", ""3 tsp baking pow...","[""Preheat the oven to 400\u00b0F. Line a 12-ho...","[flour, baking powder, vanilla pod, olive oil,...","[eggitarian, non_vegetarian]",1,0,0,1,0,0
5755,"Easy ""Blondies"" Recipe","[""1 pkg. yellow cake mix"", ""1/4 c. salad oil"",...","[""To cake mix, add oil, milk, egg and nuts or ...","[yellow cake mix, salad oil, milk, egg, nuts]","[non_vegetarian, eggitarian]",1,0,0,1,0,0
5954,Baked Vegetable Lasagna,"[""3 tablespoons olive oil, divided"", ""1/2 cup ...","[""Preheat oven to 375\u00b0."", ""Heat 2 tablesp...","[olive oil, white onion, garlic, kosher salt, ...","[non_vegetarian, eggitarian]",1,0,0,1,0,0
7428,Fresh Apple Cake,"[""1 1/2 c. cooking oil"", ""2 c. white sugar"", ""...","[""Combine oil, sugar and eggs."", ""Add apples, ...","[cooking oil, white sugar, eggs, peeled apples...",[non_vegetarian],1,0,0,0,0,0
7486,Caramelized Onion Dip,"[""2 onions cut in half and sliced medium-thin""...","[""Heat olive oil in large skillet on medium he...","[onions, olive oil, butter, salt, pepper, sour...",[vegetarian],0,1,0,0,0,0


In [27]:
subset_data['rating']=3

# Cuisine Classification
In this section, we create a mapping of common ingredients to specific cuisines (Italian, Indian, Mexican, etc.). A function is then applied to classify the cuisine for each recipe based on its ingredients.


In [28]:
subset_data.head()

Unnamed: 0,title,ingredients,directions,NER,dietary_restriction,non_vegetarian,vegetarian,unknown,eggitarian,gluten_free,vegan,rating
0,Shrimp Lamaze Sauce (Warwick Hotel) Recipe,"[""1 pt. mayonnaise"", ""1 pt. chili sauce"", ""1/2...","[""Mix all ingredients together."", ""Add in salt...","[mayonnaise, chili sauce, relish, eggs, chives...",[non_vegetarian],1,0,0,0,0,0,3
1,Sweet And Sour Beets,"[""2 Tbsp. butter"", ""2 Tbsp. flour"", ""1/2 c. be...","[""Melt the butter."", ""Blend in the flour."", ""S...","[butter, flour, beet juice, vinegar, heavy cre...",[vegetarian],0,1,0,0,0,0,3
2,Crispy Crisps,"[""1 box puff pastry"", ""1 egg"", ""1 tsp. salt"", ...","[""Preheat oven to 375\u00b0."", ""Thaw pastry."",...","[pastry, egg, salt, water, Cajun seasoning, Pa...",[non_vegetarian],1,0,0,0,0,0,3
3,Chicken French Bread Pizza,"[""1 loaf (1 pound) French bread"", ""1/2 cup but...","[""Cut bread in half lengthwise, then in half w...","[bread, butter, cheddar cheese, Parmesan chees...",[vegetarian],0,1,0,0,0,0,3
4,Butterfinger Delight,"[""2 1/2 cups crushed graham crackers"", ""1/2 cu...","[""Mix 2 Cups Crushed Graham Cracker and melted...","[graham crackers, butter, instant chocolate pu...",[vegetarian],0,1,0,0,0,0,3


In [29]:
cuisine_map = {
    'Italian': ['cheese', 'tomato', 'pasta', 'olive oil', 'basil'],
    'Indian': ['spices', 'lentils', 'rice', 'curry', 'garam masala'],
    'Mexican': ['corn', 'chili', 'beans', 'tortilla', 'lime'],
    'American': ['bacon', 'beef', 'cheddar', 'potato', 'mustard'],
    'French': ['butter', 'cream', 'garlic', 'wine']
}

# Function to classify cuisine based on ingredients
def classify_cuisine(ingredients, cuisine_map):
    cuisines = []
    for cuisine, items in cuisine_map.items():
        if any(item in ingredients for item in items):
            cuisines.append(cuisine)
    return cuisines if cuisines else ['Unknown']

# Apply classification to each recipe
subset_data['Cuisine'] = subset_data['NER'].apply(lambda x: classify_cuisine(x, cuisine_map))


In [30]:
subset_data['Cuisine'].value_counts()

Cuisine
[Unknown]                               5984
[French]                                2704
[Italian, French]                        636
[Italian]                                538
[American]                               405
[American, French]                       189
[Mexican]                                148
[Indian]                                 145
[Italian, American]                       69
[Mexican, French]                         69
[Indian, French]                          63
[Italian, American, French]               63
[Italian, Mexican]                        28
[Italian, Mexican, French]                25
[Mexican, American]                       23
[Italian, Indian]                         20
[Italian, Indian, French]                 18
[Mexican, American, French]                9
[Indian, American]                         7
[Italian, Mexican, American, French]       4
[Indian, Mexican, French]                  3
[Indian, American, French]                 2
[I

In [31]:
cuisine_dummies = subset_data['Cuisine'].apply(lambda x: pd.Series(1, index=x)).fillna(0).astype(int)

# Concatenate the dummy columns to the original DataFrame
subset_data = pd.concat([subset_data, cuisine_dummies], axis=1)

In [32]:
subset_data.sample(5)

Unnamed: 0,title,ingredients,directions,NER,dietary_restriction,non_vegetarian,vegetarian,unknown,eggitarian,gluten_free,vegan,rating,Cuisine,American,French,Unknown,Italian,Indian,Mexican
10590,"Endive, Apple And Chicken Salad","[""2 boneless, skinless chicken breast halves"",...","[""Season chicken on both sides with salt and p...","[chicken, Salt, apple cider, apple cider vineg...",[non_vegetarian],1,0,0,0,0,0,3,"[Italian, American]",1,0,0,1,0,0
54,Brunswick Stew,"[""1 whole chicken, cut up"", ""1 onion, quartere...","[""Place chicken in Dutch oven and add enough w...","[chicken, onion, celery, salt, pepper, white c...",[non_vegetarian],1,0,0,0,0,0,3,[French],0,1,0,0,0,0
4005,Ww 1 Point Weight Watchers Macaroni Salad,"[""2 cups whole wheat elbow macaroni"", ""1/2 cup...","[""Cook macaroni according to package direction...","[whole wheat elbow macaroni, mayonnaise, nonfa...",[unknown],0,0,1,0,0,0,3,[Unknown],0,0,1,0,0,0
5334,Potato-Leek Pancakes,"[""5 medium potatoes, peeled, grated, and squee...","[""Combine all ingredients in a large bowl, and...","[potatoes, leeks, eggs, freshly ground pepper,...",[non_vegetarian],1,0,0,0,0,0,3,[Unknown],0,0,1,0,0,0
3864,Bacon Wrapped Chicken,"[""chicken cutlets"", ""bacon"", ""Monterey Jack ch...","[""You will need toothpicks.""]","[chicken cutlets, bacon, cheese, soy sauce, on...",[eggitarian],0,0,0,1,0,0,3,"[Italian, American]",1,0,0,1,0,0


# Recipe Complexity Classification
The complexity of each recipe is classified based on the number of directions provided in the recipe. Recipes are categorized as easy, medium, or hard.


In [33]:
def classify_complexity(directions_list):
    num_directions = len(directions_list)
    
    # Categorize based on number of directions (this threshold is just an example)
    if num_directions <= 3:
        return 1 #easy
    elif 4 <= num_directions <= 7:
        return 2 #medium
    else:
        return 3 #hard

# Apply the function to the DataFrame
subset_data['complexity'] = subset_data['directions'].apply(classify_complexity)

In [34]:
subset_data.sample(5)

Unnamed: 0,title,ingredients,directions,NER,dietary_restriction,non_vegetarian,vegetarian,unknown,eggitarian,gluten_free,vegan,rating,Cuisine,American,French,Unknown,Italian,Indian,Mexican,complexity
5270,Cookie Cake,"[""2 c. sugar"", ""2 c. flour"", ""1/2 tsp. salt"", ...","[""Bring margarine, cocoa and milk to a boil."",...","[sugar, flour, salt, soda, eggs, sour milk, va...","[eggitarian, non_vegetarian]",1,0,0,1,0,0,3,[Unknown],0,0,1,0,0,0,3
1056,Veal And Gravy,"[""6 veal tenderloins"", ""1 large can Carnation ...","[""Salt and pepper tenderloin slices."", ""Roll i...","[veal, Carnation milk]",[unknown],0,0,1,0,0,0,3,[Unknown],0,0,1,0,0,0,3
4773,Zesty Spaghetti Frittata,"[""2 cups cooked spaghetti"", ""1 cup frozen peas...","[""Preheat oven to 350 degrees F. Toss spaghett...","[frozen peas, Italian Dressing, eggs, milk, gr...","[eggitarian, non_vegetarian]",1,0,0,1,0,0,3,"[Italian, American]",1,0,0,1,0,0,3
1846,Honeyed Sweet Potatoes Recipe,"[""6 cooked sweet potatoes (yams), halved lengt...","[""Arrange cooked potatoes in baking dish."", ""H...","[potatoes, honey, orange juice]",[unknown],0,0,1,0,0,0,3,[Unknown],0,0,1,0,0,0,3
8344,Cheesy Brunch Pie,"[""4 eggs"", ""1 c. dairy sour cream"", ""8 oz. bac...","[""In medium bowl, whisk together eggs and sour...","[eggs, sour cream, bacon, Cheddar cheese, pars...",[non_vegetarian],1,0,0,0,0,0,3,[American],1,0,0,0,0,0,3


In [63]:
visualization_data = subset_data[['title', 'NER', 'rating','dietary_restriction', 'Cuisine', 'complexity']]
visualization_data.to_csv('../output/visualization_data.csv',index=False)

# Creating Processed Data for Modeling
We select relevant columns, including the dietary restriction and cuisine dummy variables, to create the final dataset used for model building and recommendations.


In [35]:
processed_data = subset_data[['title', 'NER', 'rating', 'complexity', 'vegetarian', 'vegan', 'gluten_free', 'non_vegetarian', 'eggitarian', 'American', 'French', 'Indian', 'Italian', 'Mexican']]


In [36]:
processed_data

Unnamed: 0,title,NER,rating,complexity,vegetarian,vegan,gluten_free,non_vegetarian,eggitarian,American,French,Indian,Italian,Mexican
0,Shrimp Lamaze Sauce (Warwick Hotel) Recipe,"[mayonnaise, chili sauce, relish, eggs, chives...",3,3,0,0,0,1,0,1,0,0,0,0
1,Sweet And Sour Beets,"[butter, flour, beet juice, vinegar, heavy cre...",3,3,1,0,0,0,0,0,1,0,0,0
2,Crispy Crisps,"[pastry, egg, salt, water, Cajun seasoning, Pa...",3,3,0,0,0,1,0,0,0,0,0,0
3,Chicken French Bread Pizza,"[bread, butter, cheddar cheese, Parmesan chees...",3,3,1,0,0,0,0,0,1,0,0,0
4,Butterfinger Delight,"[graham crackers, butter, instant chocolate pu...",3,3,1,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11151,Sour Cream Pound Cake,"[butter, sugar, baking soda, sour cream, vanil...",3,3,0,0,0,1,0,0,1,0,0,0
11152,Kathie'S Seasoning For Rib-Eye Roast,"[garlic, salt, cracked black pepper, thyme]",3,3,0,0,0,0,0,0,1,0,0,0
11153,Sumac-Mint Butter,"[butter, sumac, fresh mint, shallot]",3,3,1,0,0,0,0,0,1,0,0,0
11154,Ground Beef In Gravy,"[ground beef, flour, salt, water]",3,3,1,0,0,0,0,0,0,0,0,0


# Text Vectorization with TF-IDF
In this section, we apply the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to the list of ingredients (NER) for each recipe. This converts the ingredients into a numerical format that can be used for similarity calculation.


In [None]:
processed_data["title"] = processed_data["title"].str.lower()


# Flatten the list of ingredients for TF-IDF
processed_data["NER_text"] = processed_data["NER"].apply(lambda x: " ".join(x))  # Convert lists to space-separated strings
tfidf = TfidfVectorizer()
ner_tfidf_matrix = tfidf.fit_transform(processed_data["NER_text"])

# Create a DataFrame from the TF-IDF matrix
tfidf_df = pd.DataFrame(ner_tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed_data["title"] = processed_data["title"].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed_data["NER_text"] = processed_data["NER"].apply(lambda x: " ".join(x))  # Convert lists to space-separated strings


In [39]:
# Concatenate all processed columns
processed_df = pd.concat([processed_data.drop(columns=["NER"]),
                          tfidf_df], axis=1)

processed_df.sample(5)


Unnamed: 0,title,rating,complexity,vegetarian,vegan,gluten_free,non_vegetarian,eggitarian,American,French,...,your,za,zest,zested,zesty,zinfandel,zucchini,zucchinis,½oz,árbol
8623,chocolate peanut butter balls,3,3,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8451,gluten-free peanut butter chocolate chip cookies,3,3,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6486,sugar cookies,3,3,0,0,0,1,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3591,crab dip,3,3,1,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10472,milky way brownies,3,3,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Building the Feature Matrix
We concatenate the TF-IDF matrix with the processed dataset to create a feature matrix. This matrix is used to calculate similarity between recipes.

# Testing the Model with its own data

In [None]:


print(processed_df.dtypes)

# Ensure all columns except 'title' are numeric
feature_matrix = processed_df.drop(columns=["title"]).select_dtypes(include=[np.number]).values

# Define the recommendation function
def recommend_recipes(input_features, n_recommendations=5):
    """
    Recommends recipes based on input features.
    
    Parameters:
        input_features (list or array): Input features matching the columns of `feature_matrix`.
        n_recommendations (int): Number of recipes to recommend.
        
    Returns:
        DataFrame: Top N recommended recipes with their titles and similarity scores.
    """
    # Ensure input_features is a numpy array
    input_features = np.array(input_features).reshape(1, -1)
    
    # Compute cosine similarity
    similarities = cosine_similarity(input_features, feature_matrix).flatten()
    
    # Get top N indices
    top_indices = np.argsort(similarities)[::-1][:n_recommendations]
    
    # Retrieve corresponding recipes
    recommendations = processed_data.iloc[top_indices][["title"]].copy()
    recommendations["similarity_score"] = similarities[top_indices]
    
    return recommendations

# Example Usage
# Create a valid input feature vector (must match the numeric columns in `processed_df`)
sample_input = feature_matrix[7]  # Using the first recipe's features as an example

# Get recommendations
recommended_recipes = recommend_recipes(sample_input, n_recommendations=5)
print(recommended_recipes)


title          object
rating          int64
complexity      int64
vegetarian      int32
vegan           int32
               ...   
zinfandel     float64
zucchini      float64
zucchinis     float64
½oz           float64
árbol         float64
Length: 2857, dtype: object
                                 title  similarity_score
7     thai sweet chili & garlic burger          1.000000
5573                             pizza          0.966539
5681                    beef enchilada          0.961883
3114           spicy chicken casserole          0.961124
866                      chops o'brien          0.961085


# Recommendation Function
Here, we define a function `recommend_recipes` that takes a set of input features and provides recommendations based on cosine similarity. This function finds the most similar recipes to the input.

# User Input-Based Recipe Recommendation
This section introduces a new function, `recommend_from_user_input`, which allows a user to input their preferred ingredients, dietary restrictions, complexity level, and cuisine. Based on this input, the function recommends the most similar recipes.



In [61]:
def user_input_to_features(ingredients, dietary_restriction, complexity, rating, cuisine, tfidf, cuisine_map):
    """
    Convert the user input into a feature vector for recommendation.
    
    Parameters:
        ingredients (list): List of ingredients provided by the user.
        dietary_restriction (list): List of dietary restrictions selected by the user.
        complexity (int): The complexity level selected by the user (1: easy, 2: medium, 3: hard).
        rating (int): Rating selected by the user (e.g., 1 to 5).
        cuisine (list): List of cuisines selected by the user.
        tfidf (TfidfVectorizer): Fitted TF-IDF vectorizer.
        cuisine_map (dict): The mapping of cuisines to ingredients.
    
    Returns:
        np.array: A feature vector for the user input.
    """
    # Convert dietary restrictions to a binary vector (same as in your original dataset)
    dietary_features = ['vegetarian', 'vegan', 'gluten_free', 'non_vegetarian', 'eggitarian']
    dietary_vector = [1 if diet in dietary_restriction else 0 for diet in dietary_features]
    
    # Convert cuisine to a binary vector
    cuisine_vector = [1 if cuisin in cuisine else 0 for cuisin in cuisine_map.keys()]
    
    # Convert ingredients to a string and apply TF-IDF transformation
    ingredients_text = " ".join(ingredients)
    ingredients_tfidf = tfidf.transform([ingredients_text]).toarray().flatten()
    
    # Convert complexity and rating to numerical values
    complexity_vector = [complexity]  # single value (easy=1, medium=2, hard=3)
    rating_vector = [rating]  # single value (1 to 5 rating)
    
    # Concatenate all vectors to form the feature vector
    user_features = np.concatenate([ingredients_tfidf, dietary_vector, cuisine_vector, complexity_vector, rating_vector])
    
    return user_features


def recommend_from_user_input(ingredients, dietary_restriction, complexity, rating, cuisine, tfidf, cuisine_map, processed_data, feature_matrix, n_recommendations=5):
    """
    Recommends recipes based on user input.
    
    Parameters:
        ingredients (list): List of ingredients provided by the user.
        dietary_restriction (list): List of dietary restrictions selected by the user.
        complexity (int): The complexity level selected by the user.
        rating (int): Rating selected by the user.
        cuisine (list): List of cuisines selected by the user.
        tfidf (TfidfVectorizer): Fitted TF-IDF vectorizer.
        cuisine_map (dict): The mapping of cuisines to ingredients.
        processed_data (DataFrame): The processed data containing recipe details.
        feature_matrix (ndarray): The feature matrix from the recipes dataset.
        n_recommendations (int): Number of recipes to recommend.
    
    Returns:
        DataFrame: Top N recommended recipes with their titles and similarity scores.
    """
    # Convert user input to a feature vector
    user_features = user_input_to_features(ingredients, dietary_restriction, complexity, rating, cuisine, tfidf, cuisine_map)
    
    # Ensure the feature vector is a 2D array for cosine similarity computation
    user_features = user_features.reshape(1, -1)
    
    # Compute cosine similarity between user features and recipe features
    similarities = cosine_similarity(user_features, feature_matrix).flatten()
    
    top_indices = np.argsort(similarities)[::-1][:n_recommendations]
    
    recommendations = processed_data.iloc[top_indices][["title"]].copy()
    recommendations["similarity_score"] = similarities[top_indices]
    
    return recommendations

# Example Usage
ingredients = ['carrot', 'milk', 'cheese']  # Example ingredients
dietary_restriction = ['vegetarian']  # Example dietary restrictions
complexity = 1  # Easy
rating = 4  # Past rating given by the user out of 5
cuisine = ['Italian']  # Example cuisine

# Get recommendations
recommended_recipes = recommend_from_user_input(ingredients, dietary_restriction, complexity, rating, cuisine, tfidf, cuisine_map, processed_data, feature_matrix)
print(recommended_recipes)


                             title  similarity_score
9375  spicy honey-glazed parsnips           0.112397
1426               citrus soufflé          0.036641
7337       magic chocolate pudding          0.030295
3209                  rabbit stock          0.020913
5590    chocolate orange mud cakes          0.018332
