# Data Pre-processing and Modeling

This was my first approach.  Unfortunately, this approach took too much memory to execute for my machine.  Therefore I had to pivot to approach #2, using a neural network.

***Approach #1: Item based collaborative filter***

Here's my general tactic for creating this recommendation system.  Since most users have written less than two reviews, I believe the best approach would be to create an item-based collaborative filter.  This will analyze the recipes for similarity and recommend recipes to users that are similar to those they have rated highly.  In order to execute this I will perform the following steps:

1. **Construct a user-recipe interaction table:**  Each user will have a row and each recipe will be a feature. The values will be the ratings that the recipe was given by that user.
   
2. **Calculate item similarities:** I will explore using different similarity metrics to calculate item similarities.  This can include cosine similarity, pearson correlation, or implement a KNN model to find similar items.

3. **Calculate predicted ratings of recipes:** Using KNN, I can select a number of nearest neighbors for each recipe and use a weighted average to calculate the predicted rating of that particular recipe for that particular user.

4. **Generate recommendations:** Select the highest predicted ratings for the recipes that the user has not interacted with yet as recommendations.


***Approach #2: Neural Network***
1. **Preprocess data:** convert categorical features in both recipe and user dfs into numerical values using embeddings
2. **Split the data:** split the data into training and test sets
3. **Define the model architecture:** Create input layers for both user interactions and recipes, define how many hidden layers we want, which activation function to use, the loss function to optimize, which optimizer we want to use
4. **Train the model:** Train the model on the training and validation data
5. **Evaluate the model:** Evaluate the model using the test set using a pre-defined evaluation metric such as RMSE
6. **Optimize the model:** Tune the parameters to improve evaluation metrics
7. **Deploy the model:** Deploy the model to be able to recommend recipes to new users

In [1]:
# Import necessary modules
import pandas as pd
import numpy as np
import csv
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# First import the dfs
recipe_df = pd.read_csv('../Data/Cleaned/recipes_df_cleaned.csv')
interactions_df = pd.read_csv('../Data/Cleaned/interactions_df_cleaned.csv')
recipe_df.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8


In [3]:
interactions_df.head()

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for...
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not...
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunk...
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprin..."


In [4]:
interactions_df['rating'].value_counts()

rating
5    816364
4    187360
0     60847
3     40855
2     14123
1     12818
Name: count, dtype: int64

### Attempt at approach #1

In [5]:
# I used the below code to attempt to create a user-recipe matrix, but the memory usage became to large for my machine.
# I will pivot to using a neural network approach which should take less memory

'''# We can use the pivot table functionality to create a user-recipe interaction table. Since this is a massive table, we'll break it down into chunks
from tqdm import tqdm

chunk_size = 10000
chunks = []

for chunk in tqdm(pd.read_csv('../Data/Cleaned/interactions_df_cleaned.csv', chunksize=chunk_size)):
    chunk_pivot = chunk.pivot_table(index = 'user_id',columns = 'recipe_id', values = 'rating', fill_value = -1, aggfunc = 'mean')
    chunks.append(chunk_pivot)
    
for i in tqdm(range(len(chunks))):
    chunks[i] = chunks[i].fillna(-1)
    chunks[i] = chunks[i].astype(pd.SparseDtype('float', -1))
    
user_recipe_matrix = pd.concat(chunks, axis = 0)
user_recipe_matrix = user_recipe_matrix.groupby(user_recipe_matrix.index).max()   

# Ensure the final DataFrame is sparse
user_recipe_matrix = user_recipe_matrix.astype(pd.SparseDtype("float", -1))

'''

'# We can use the pivot table functionality to create a user-recipe interaction table. Since this is a massive table, we\'ll break it down into chunks\nfrom tqdm import tqdm\n\nchunk_size = 10000\nchunks = []\n\nfor chunk in tqdm(pd.read_csv(\'../Data/Cleaned/interactions_df_cleaned.csv\', chunksize=chunk_size)):\n    chunk_pivot = chunk.pivot_table(index = \'user_id\',columns = \'recipe_id\', values = \'rating\', fill_value = -1, aggfunc = \'mean\')\n    chunks.append(chunk_pivot)\n    \nfor i in tqdm(range(len(chunks))):\n    chunks[i] = chunks[i].fillna(-1)\n    chunks[i] = chunks[i].astype(pd.SparseDtype(\'float\', -1))\n    \nuser_recipe_matrix = pd.concat(chunks, axis = 0)\nuser_recipe_matrix = user_recipe_matrix.groupby(user_recipe_matrix.index).max()   \n\n# Ensure the final DataFrame is sparse\nuser_recipe_matrix = user_recipe_matrix.astype(pd.SparseDtype("float", -1))\n\n'

### Neural Network Approach

#### Prepare each dataframe:

In [6]:
# Import necessary modules
import tensorflow as tf
import ast
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2024-10-22 17:15:33.473125: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [7]:
# Let's choose which features we want to use as inputs. For the recipe df, I think the tags and ingredients columns are the most important categorical features
# to maintain.  I will drop the others.
recipe_df = recipe_df.drop(columns = ['name', 'submitted', 'steps', 'description'])

In [8]:
# Now lets convert the tags and ingredients columns into strings

# The tags and ingredients column is a string representation of a list. This converts it to an actual list
recipe_df['tags'] = recipe_df['tags'].apply(lambda x: ast.literal_eval(x))
recipe_df['ingredients'] = recipe_df['ingredients'].apply(lambda x: ast.literal_eval(x))

# Convert the lists to strings:
recipe_df['tags'] = recipe_df['tags'].apply(lambda x: ' '.join(x))
recipe_df['ingredients'] = recipe_df['ingredients'].apply(lambda x: ' '.join(x))

In [9]:
recipe_df.head()

Unnamed: 0,id,minutes,contributor_id,tags,nutrition,n_steps,ingredients,n_ingredients
0,137739,55,47892,60-minutes-or-less time-to-make course main-in...,"[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,winter squash mexican seasoning mixed spice ho...,7
1,31490,30,26278,30-minutes-or-less time-to-make course main-in...,"[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,prepared pizza crust sausage patty eggs milk s...,6
2,112140,130,196586,time-to-make course preparation main-dish chil...,"[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,ground beef yellow onions diced tomatoes tomat...,13
3,59389,45,68585,60-minutes-or-less time-to-make course main-in...,"[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,spreadable cheese with garlic and herbs new po...,11
4,44061,190,41706,weeknight time-to-make course main-ingredient ...,"[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,tomato juice apple cider vinegar sugar salt pe...,8


In [10]:
# Now I need to convert the nutrition column into floats and separate the listed values into the appropriate columns

# Convert the nutrition column from string representation to an actual list of floats:
recipe_df['nutrition'] = recipe_df['nutrition'].apply(lambda x: ast.literal_eval(x))

# Expand the nutrition information into separate columns
# Per the dataset documentation, the columns are ['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates']
nutrition_df = pd.DataFrame(recipe_df['nutrition'].tolist(), columns = ['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates'])

In [11]:
# Concatenate the nutritional information with the recipe_df
recipe_df = pd.concat([recipe_df, nutrition_df], axis = 1)
recipe_df.head()

Unnamed: 0,id,minutes,contributor_id,tags,nutrition,n_steps,ingredients,n_ingredients,calories,total_fat,sugar,sodium,protein,saturated_fat,carbohydrates
0,137739,55,47892,60-minutes-or-less time-to-make course main-in...,"[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,winter squash mexican seasoning mixed spice ho...,7,51.5,0.0,13.0,0.0,2.0,0.0,4.0
1,31490,30,26278,30-minutes-or-less time-to-make course main-in...,"[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,prepared pizza crust sausage patty eggs milk s...,6,173.4,18.0,0.0,17.0,22.0,35.0,1.0
2,112140,130,196586,time-to-make course preparation main-dish chil...,"[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,ground beef yellow onions diced tomatoes tomat...,13,269.8,22.0,32.0,48.0,39.0,27.0,5.0
3,59389,45,68585,60-minutes-or-less time-to-make course main-in...,"[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,spreadable cheese with garlic and herbs new po...,11,368.1,17.0,10.0,2.0,14.0,8.0,20.0
4,44061,190,41706,weeknight time-to-make course main-ingredient ...,"[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,tomato juice apple cider vinegar sugar salt pe...,8,352.9,1.0,337.0,23.0,3.0,0.0,28.0


In [12]:
# Drop the original nutrition column, since that information is now contained in separate columns
recipe_df.drop(columns = ['nutrition'], inplace = True)
recipe_df.head()

Unnamed: 0,id,minutes,contributor_id,tags,n_steps,ingredients,n_ingredients,calories,total_fat,sugar,sodium,protein,saturated_fat,carbohydrates
0,137739,55,47892,60-minutes-or-less time-to-make course main-in...,11,winter squash mexican seasoning mixed spice ho...,7,51.5,0.0,13.0,0.0,2.0,0.0,4.0
1,31490,30,26278,30-minutes-or-less time-to-make course main-in...,9,prepared pizza crust sausage patty eggs milk s...,6,173.4,18.0,0.0,17.0,22.0,35.0,1.0
2,112140,130,196586,time-to-make course preparation main-dish chil...,6,ground beef yellow onions diced tomatoes tomat...,13,269.8,22.0,32.0,48.0,39.0,27.0,5.0
3,59389,45,68585,60-minutes-or-less time-to-make course main-in...,11,spreadable cheese with garlic and herbs new po...,11,368.1,17.0,10.0,2.0,14.0,8.0,20.0
4,44061,190,41706,weeknight time-to-make course main-ingredient ...,5,tomato juice apple cider vinegar sugar salt pe...,8,352.9,1.0,337.0,23.0,3.0,0.0,28.0


In [13]:
# For the interactions_df, I'll drop the review and date columns. These features won't be used for the recommendation system
interactions_df.drop(columns = ['date', 'review'], inplace = True)
interactions_df.head()

Unnamed: 0,user_id,recipe_id,rating
0,38094,40893,4
1,1293707,40893,5
2,8937,44394,4
3,126440,85009,5
4,57222,85009,5


#### Merge the two dataframes:

In [14]:
# Merge the two dataframes together.  The common column is 'recipe_id' in the interactions_df and 'id' in the recipe_df
merged_df = interactions_df.merge(recipe_df, left_on='recipe_id', right_on= 'id', how='left')

# Drop the duplicate 'recipe_id' column from the dataframe
merged_df.drop(columns = ['id'], inplace = True)

# Check to see if the result is as expected:
merged_df.head()

Unnamed: 0,user_id,recipe_id,rating,minutes,contributor_id,tags,n_steps,ingredients,n_ingredients,calories,total_fat,sugar,sodium,protein,saturated_fat,carbohydrates
0,38094,40893,4,495,1533,weeknight time-to-make course main-ingredient ...,4,great northern beans yellow onion diced green ...,9,204.8,5.0,9.0,26.0,24.0,2.0,10.0
1,1293707,40893,5,495,1533,weeknight time-to-make course main-ingredient ...,4,great northern beans yellow onion diced green ...,9,204.8,5.0,9.0,26.0,24.0,2.0,10.0
2,8937,44394,4,20,56824,30-minutes-or-less time-to-make course main-in...,5,devil's food cake mix vegetable oil eggs reese...,4,132.3,11.0,39.0,5.0,4.0,11.0,5.0
3,126440,85009,5,10,64342,15-minutes-or-less time-to-make course main-in...,3,mayonnaise salsa cheddar cheese refried beans ...,13,2786.2,342.0,134.0,290.0,161.0,301.0,42.0
4,57222,85009,5,10,64342,15-minutes-or-less time-to-make course main-in...,3,mayonnaise salsa cheddar cheese refried beans ...,13,2786.2,342.0,134.0,290.0,161.0,301.0,42.0


In [15]:
# Save the final dataframe to access later
# merged_df.to_csv('../Data/Preprocessed/NN_Input_Data.csv')

#### Select input and the target variable

In [16]:
X = merged_df[['user_id', 'recipe_id', 'minutes', 'contributor_id', 'tags',
       'n_steps', 'ingredients', 'n_ingredients', 'calories', 'total_fat',
       'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates']]

# y = merged_df['rating']

In [17]:
# Use this to create a binary rating column
# if 'rating' == 5, then binary will =1, otherwise binary = 0

merged_df['binary_rating'] = (merged_df['rating']==5).astype('int')

y = merged_df['binary_rating']

#### Split the data into train and test sets

In [18]:
from sklearn.model_selection import train_test_split

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
print("Shape of training set:", X_train.shape)
print('Shape of test set:', X_test.shape)

Shape of training set: (792656, 15)
Shape of test set: (339711, 15)


#### Prepare the input data for the model

In [20]:
# Now let's tokenize the tags and ingredients columns

# Tokenization for tags
tag_tokenizer = Tokenizer()
tag_tokenizer.fit_on_texts(X_train['tags'])
tag_train_sequences = tag_tokenizer.texts_to_sequences(X_train['tags'])
tag_test_sequences = tag_tokenizer.texts_to_sequences(X_test['tags'])

# Tokenization for ingredients
ingredient_tokenizer = Tokenizer()
ingredient_tokenizer.fit_on_texts(X_train['ingredients'])
ingredient_train_sequences = ingredient_tokenizer.texts_to_sequences(X_train['ingredients'])
ingredient_test_sequences = ingredient_tokenizer.texts_to_sequences(X_test['ingredients'])

In [21]:
# Now I'll pad the sequences to ensure uniform input size

max_tag_length = max(len(seq) for seq in tag_train_sequences)
max_ingredient_length = max(len(seq) for seq in ingredient_train_sequences)

tag_train_padded = pad_sequences(tag_train_sequences, maxlen=max_tag_length, padding='post')
tag_test_padded = pad_sequences(tag_test_sequences, maxlen=max_tag_length, padding='post')

ingredient_train_padded = pad_sequences(ingredient_train_sequences, maxlen=max_ingredient_length, padding='post')
ingredient_test_padded = pad_sequences(ingredient_test_sequences, maxlen=max_ingredient_length, padding='post')

In [22]:
# I'm going to renumber the user_id, recipe_id, and contributor_id starting from 0 to reduce the overhead of the following steps
user_id_mapping = {old_id: new_id for new_id, old_id in enumerate(X['user_id'].unique())}
X_train['user_id'] = X_train['user_id'].map(user_id_mapping)
X_test['user_id'] = X_test['user_id'].map(user_id_mapping)

recipe_id_mapping = {old_id: new_id for new_id, old_id in enumerate(X['recipe_id'].unique())}
X_train['recipe_id'] = X_train['recipe_id'].map(recipe_id_mapping)
X_test['recipe_id'] = X_test['recipe_id'].map(recipe_id_mapping)

contributor_id_mapping = {old_id: new_id for new_id, old_id in enumerate(X['contributor_id'].unique())}
X_train['contributor_id'] = X_train['contributor_id'].map(contributor_id_mapping)
X_test['contributor_id'] = X_test['contributor_id'].map(contributor_id_mapping)

In [23]:
# Separate inputs for user_id, recipe_id, nutrition, and prep
X_user = X_train['user_id'].values
X_recipe = X_train['recipe_id'].values
X_contributor = X_train['contributor_id'].values
X_nutrition = X_train[['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates']].values
X_prep = X_train[['n_steps', 'n_ingredients']].values

# Do the same for the test set:
X_user_test = X_test['user_id'].values
X_recipe_test = X_test['recipe_id'].values
X_contributor_test = X_test['contributor_id'].values
X_nutrition_test = X_test[['calories', 'total_fat', 'sugar', 'sodium', 'protein', 'saturated_fat', 'carbohydrates']].values
X_prep_test = X_test[['n_steps', 'n_ingredients']].values

#### Build the Neural Network Model:

In [24]:
from tensorflow.keras.layers import Input, Dense, concatenate, Embedding, Flatten
from tensorflow.keras.models import Model

In [25]:
# Initiate the input layers
user_input = Input(shape=(1,), name='user_id')
recipe_input = Input(shape=(1,), name = 'recipe_id')
contributor_input = Input(shape=(1,),name= 'contributor_id')
nutrition_input = Input(shape=(X_nutrition.shape[1],), name = 'nutrition')
prep_input = Input(shape= (X_prep.shape[1],), name = 'prep')
tags_input = Input(shape = (max_tag_length,), name = 'tags')
ingredients_input = Input(shape = (max_ingredient_length,), name = 'ingredients')

In [26]:
# Initiate embedding layers for categorical inputs (even though user_id, etc is numercal, they aren't meant to denote order)
user_embedding = Embedding(input_dim=X_train['user_id'].max()+1, output_dim=5)(user_input)
user_embedding = Flatten()(user_embedding)

recipe_embedding = Embedding(input_dim=X_train['recipe_id'].max()+1, output_dim=5)(recipe_input)
recipe_embedding = Flatten()(recipe_embedding)

contributor_embedding = Embedding(input_dim=X_train['contributor_id'].max()+1, output_dim=5)(contributor_input)
contributor_embedding = Flatten()(contributor_embedding)

In [27]:
# Process the tags and ingredients
tags_embedding = Embedding(input_dim = len(tag_tokenizer.word_index)+1, output_dim=10)(tags_input)
tags_embedding = Flatten()(tags_embedding)

ingredients_embedding = Embedding(input_dim=len(ingredient_tokenizer.word_index)+1, output_dim=10)(ingredients_input)
ingredients_embedding = Flatten()(ingredients_embedding)

In [28]:
# Combine all the features
combined_inputs = concatenate([user_embedding, recipe_embedding, contributor_embedding, nutrition_input, prep_input, tags_embedding, ingredients_embedding])

# # Combine inputs excluding nutrition
# combined_inputs = concatenate([nutrition_input, prep_input, tags_embedding, ingredients_embedding])

In [29]:
# Build hidden layers:
dense1 = Dense(64, activation='relu')(combined_inputs)
dense2 = Dense(64, activation='relu')(dense1)
output = Dense(1, activation = 'sigmoid')(dense2)

In [30]:
# Create the model
model = Model(inputs = [user_input, recipe_input, contributor_input, nutrition_input, prep_input, tags_input, ingredients_input], outputs = output)

# # Model without user, recipe, contributor
# model = Model(inputs = [nutrition_input, prep_input, tags_input, ingredients_input], outputs = output)


# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001, beta_1 = 0.9)
model.compile(optimizer = optimizer, loss= 'binary_crossentropy', metrics = ['accuracy'])

# Show the summary of the model
model.summary()

#### Train the model

In [31]:
# Train the model
# history = model.fit([X_user, X_recipe, X_contributor, X_nutrition, X_prep, tag_train_padded, ingredient_train_padded], y_train, epochs=3, validation_split=0.1)

# # Train the model excluding user, recipe, contributor
# history = model.fit([X_nutrition, X_prep, tag_train_padded, ingredient_train_padded], y_train, epochs=10, validation_split=0.1)


Epoch 1/3
[1m22294/22294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m544s[0m 24ms/step - accuracy: 0.7165 - loss: 0.5967 - val_accuracy: 0.7366 - val_loss: 0.5290
Epoch 2/3
[1m22294/22294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m509s[0m 23ms/step - accuracy: 0.7850 - loss: 0.4750 - val_accuracy: 0.7200 - val_loss: 0.5608
Epoch 3/3
[1m22294/22294[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m464s[0m 21ms/step - accuracy: 0.8240 - loss: 0.4138 - val_accuracy: 0.7103 - val_loss: 0.5778


In [32]:
# # Save the model to evaluate later
# model.save('../Models/model_binary_crossentropy1.keras')
# history_df = pd.DataFrame(history.history)
# history_df.to_csv('../Models/Model_Metrics/model_binary_crossentropy1_history.csv')
# history_df.head()

Unnamed: 0,accuracy,loss,val_accuracy,val_loss
0,0.728069,0.556022,0.736558,0.529034
1,0.784404,0.476702,0.719981,0.56077
2,0.8241,0.410864,0.710279,0.577826
