# Content based recommendation  

Content-based recommendation focuses on suggesting recipes based on the features of the recipes themselves, such as ingredients, dietary preferences, and nutritional values. By analyzing these features, the system can recommend similar recipes to users based on their previous interactions or preferences, helping to personalize suggestions effectively. Now that we have preprocessed the data, we will build a content-based recommendation system.

Let us start by importing necessary libraries.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

Now let us load the pickle object as the data.

In [4]:
data = pickle("C:/Users/pd006/Desktop/internship_search/machine_learning/Recipe-Recommender-System/data/food.pkl")

TypeError: 'module' object is not callable

In [5]:
len(data)

NameError: name 'data' is not defined

Since we have around 196k data points, we will use 25% of the data to speed up processing during the early stages of model development. This enables faster iterations without needing the full dataset.

In [8]:
sampled_data = data.sample(frac=0.25, random_state=0)

We now will need the following libraries for text preprocessing.

In [9]:
import string
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pd006\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The PorterStemmer is computationally efficient, making it suitable for large datasets. Its rule-based approach is straightforward and easier to implement compared to more complex stemming algorithms. Since it follows a fixed set of rules, it produces consistent outputs. Therefore, we will use it.

In [10]:
from nltk.stem import PorterStemmer

In [11]:
stemmer = PorterStemmer()
stop_words = stopwords.words("english")

In [13]:
def recipe_tokenizer(sentence):
    # Remove punctuation by replacing it with spaces and convert the sentence to lowercase
    sentence = " ".join([char if char not in string.punctuation else " " for char in sentence]).lower()
    # Split the cleaned sentence into a list of words (tokens)
    words = sentence.split()
    # Stem each word and exclude stopwords from the result
    return [stemmer.stem(word) for word in words if word not in stop_words]


Now we will need the following libraries for word embeddings, turning text into numerical data, and comparing text similarity, making text analysis easier.

In [14]:
import gensim
from gensim.models import Word2Vec
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [18]:
# Function for word embedding using Word2Vec
def word_embedding(data, column):
    # Tokenize text data and train Word2Vec model
    model = Word2Vec(data[column].apply(recipe_tokenizer), vector_size=150, window=5, min_count=1)
    
    # Return word embeddings
    return {word: model.wv[word] for word in model.wv.index_to_key}

In [None]:
def precompute_embeddings(data):
    # Get word embeddings for ingredients
    embeddings = word_embedding(data, "ingredients")
    # Combine 'name', 'tags', and 'description' into one text column and preprocess
    data['text_data'] = data[['name', 'tags', 'description']].astype(str).agg(' '.join, axis=1).str.lower()
    

In [40]:
sampled_data[['name', 'tags', 'description']].head(1)

Unnamed: 0,name,tags,description
21015,black pepper chicken wings with herbed blue dip,"[60-minutes-or-less, time-to-make, course, pre...",one of my favorite chicken wing recipes.


In [33]:
sampled_data[['name', 'tags', 'description']].head(2)["tags"][21015]

['60-minutes-or-less',
 'time-to-make',
 'course',
 'preparation',
 'very-low-carbs',
 'main-dish',
 'dietary',
 'one-dish-meal',
 'low-carb',
 'low-in-something']

In [34]:
sampled_data[['name', 'tags', 'description']].head(2).astype(str)["tags"][21015]

"['60-minutes-or-less', 'time-to-make', 'course', 'preparation', 'very-low-carbs', 'main-dish', 'dietary', 'one-dish-meal', 'low-carb', 'low-in-something']"

In [38]:
sampled_data[['name', 'tags', 'description']].head(2).astype(str).agg(' '.join, axis=1)[21015]
#sampled_data[['name', 'tags', 'description']].astype(str).agg(' '.join, axis=1).str.lower()

"black pepper chicken wings with herbed blue dip ['60-minutes-or-less', 'time-to-make', 'course', 'preparation', 'very-low-carbs', 'main-dish', 'dietary', 'one-dish-meal', 'low-carb', 'low-in-something'] one of my favorite chicken wing recipes."

In [50]:
z = sampled_data[['name', 'description']].head(2).astype(str).agg(' '.join, axis=1)[21015]
z

'black pepper chicken wings with herbed blue dip one of my favorite chicken wing recipes.'

In [51]:
zz = " ".join(sampled_data[['tags']].head(1)["tags"].values[0])
zz

'60-minutes-or-less time-to-make course preparation very-low-carbs main-dish dietary one-dish-meal low-carb low-in-something'

In [52]:
z + zz

'black pepper chicken wings with herbed blue dip one of my favorite chicken wing recipes.60-minutes-or-less time-to-make course preparation very-low-carbs main-dish dietary one-dish-meal low-carb low-in-something'

In [54]:
sampled_data[['name', 'description']].head(2).astype(str).agg(' '.join, axis=1)[21015]

'black pepper chicken wings with herbed blue dip one of my favorite chicken wing recipes.'

In [60]:
sampled_data["tags"] = sampled_data["tags"].astype(str)

In [64]:
sampled_data[['name', 'tags', 'description']].head(2).astype(str).agg(' '.join, axis=1).str.lower()[21015]

"black pepper chicken wings with herbed blue dip ['60-minutes-or-less', 'time-to-make', 'course', 'preparation', 'very-low-carbs', 'main-dish', 'dietary', 'one-dish-meal', 'low-carb', 'low-in-something'] one of my favorite chicken wing recipes."