# Recipe Recommendation System: Data Preparation and Initial EDA

This project aims to build a personalized recipe recommendation system using the `Food.com Recipes and Interactions` dataset from Kaggle. The dataset contains detailed information about recipes, including ingredients, nutritional values, and user ratings. 

This notebook focuses on the first step in the pipeline: preparing and exploring the data. It is the foundation for creating clean and feature-rich datasets to be used in subsequent analysis and modeling. This is the first of multiple notebooks in the project.

### Objectives:
1. Preprocess and clean the dataset.
2. Handle outliers and convert data to appropriate formats.
3. Perform initial exploratory data analysis (EDA).
4. Generate a cleaned dataset and an additional feature-enriched dataset for further analysis.

### Outline:
- Importing libraries and loading the dataset.
- Cleaning and preprocessing the data:
  - Handling outliers.
  - Converting object data types to appropriate formats.
  - Adding new features from existing columns.
- Conducting initial exploratory data analysis (EDA).
- Generating and saving two datasets:
  - A cleaned version of the original dataset.
  - A feature-enriched dataset for use in subsequent analysis and modeling.


In [1]:
# import the necessary libraries
import pandas as pd
import ast
import matplotlib.pyplot as plt

In [2]:
# load the dataset into memory
recipe_df = pd.read_csv("C:/Users/pd006/Desktop/internship_search/machine_learning/Recipe-Recommender-System/data/RAW_recipes.csv")
users_df = pd.read_csv("C:/Users/pd006/Desktop/internship_search/machine_learning/Recipe-Recommender-System/data/RAW_interactions.csv")

Let us start exploring the `recipe_df`

In [3]:
recipe_df.head(3)

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13


Right off the bat, we can see that there are entries in the form of texts, numbers, dates and collections of some combinations of those. Now let us see the dimension of the dataset.

In [4]:
# get the shape of the dataframe
recipe_df.shape    # (num_rows, num_cols)

(231637, 12)

We see that the dataset has more than 231000 rows and only 12 columns. This means that we will have a better representation of the pupolation and that the models trained on this data will be flexible which will give better predictions. 

Next thig we can do is check if there are ny null values in the dataset. We will use `isnull()` method for pandas. 

In [5]:
# check for null values
recipe_df.isnull().sum()

name                 1
id                   0
minutes              0
contributor_id       0
submitted            0
tags                 0
nutrition            0
n_steps              0
steps                0
description       4979
ingredients          0
n_ingredients        0
dtype: int64

We have null values in `name` and `description` columns. 

In [6]:
# Percentage of the missing values
missing_name_percent = recipe_df["name"].isnull().mean()*100
missing_description_percent = recipe_df["description"].isnull().mean()*100

print(f"The name column has {missing_name_percent:.4f}% missing value and the description column has {missing_description_percent:.4f}% missing values.")

The name column has 0.0004% missing value and the description column has 2.1495% missing values.


Let us start with the name column. We will perform boolean masking/indexing to get the True/False values for the entry with null value.

In [7]:
# condition
null_value_in_name_col = recipe_df["name"].isnull()

Now we will filter the dataframe based in the condition above.

In [8]:
recipe_df[null_value_in_name_col]

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
721,,368257,10,779451,2009-04-27,"['15-minutes-or-less', 'time-to-make', 'course...","[1596.2, 249.0, 155.0, 0.0, 2.0, 112.0, 14.0]",6,"['in a bowl , combine ingredients except for o...",-------------,"['lemon', 'honey', 'horseradish mustard', 'gar...",10


Let us drop this entry from the dataframe.

In [9]:
# drop the null value
recipe_df.drop(index=[721], inplace = True)

In [10]:
# Sanity check
recipe_df.isnull().sum()

name                 0
id                   0
minutes              0
contributor_id       0
submitted            0
tags                 0
nutrition            0
n_steps              0
steps                0
description       4979
ingredients          0
n_ingredients        0
dtype: int64

We have successfully dropped the entry with null value from our dataframe. Now we will perform the boolean masking operation again to grab entries that have `NaN` values in the `description` column in the dataframe.

In [11]:
# condition
null_values_in_description_col = recipe_df["description"].isnull()

In [12]:
# filter, get the index and drop the entries.
idx_for_null_in_description_col = recipe_df[null_values_in_description_col].index
recipe_df.drop(index = idx_for_null_in_description_col, inplace = True)

In [13]:
# Sanity check
recipe_df.isnull().sum()

name              0
id                0
minutes           0
contributor_id    0
submitted         0
tags              0
nutrition         0
n_steps           0
steps             0
description       0
ingredients       0
n_ingredients     0
dtype: int64

So at this point we have dealt with the `NaN` values in the dataset.

Now let us check the dataset one more time using both `head()` and `info()` methods.

In [14]:
# head method
recipe_df.head(3)

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13


In [15]:
# info method
recipe_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 226657 entries, 0 to 231636
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   name            226657 non-null  object
 1   id              226657 non-null  int64 
 2   minutes         226657 non-null  int64 
 3   contributor_id  226657 non-null  int64 
 4   submitted       226657 non-null  object
 5   tags            226657 non-null  object
 6   nutrition       226657 non-null  object
 7   n_steps         226657 non-null  int64 
 8   steps           226657 non-null  object
 9   description     226657 non-null  object
 10  ingredients     226657 non-null  object
 11  n_ingredients   226657 non-null  int64 
dtypes: int64(5), object(7)
memory usage: 22.5+ MB


Notice that we have some columns that have incorrect datatypes. For instance, the `submitted` column is of type object when it should be of type datime.

Moreover, if we look at the first entry of the `tags` column, we see that it is of type `str` instead of type `list`. 

In [16]:
type(recipe_df["tags"][0])

str

This is the exact problem with the `nutrition`, `steps` and `ingredients` columns as well.

In [17]:
#print(f"Data type for first entry of `nutrition` column:{type(recipe_df["nutrition"][0])} and for first entry of 'steps' column:{type(recipe_df["steps"][0])}")

print(f"Data type of the first entry in 'nutrition' column: {type(recipe_df['nutrition'][0])}.")
print(f"Data type of the first entry in 'steps' column: {type(recipe_df['steps'][0])}.")
print(f"Data type of the first entry in 'ingredients' column: {type(recipe_df['ingredients'][0])}.")
print()


Data type of the first entry in 'nutrition' column: <class 'str'>.
Data type of the first entry in 'steps' column: <class 'str'>.
Data type of the first entry in 'ingredients' column: <class 'str'>.



So let us convert all the aforementioned columns to their correct types. To convert to datetime we will use the `to_datetime()` method from pandas and to convert to list we will use `literal_eval()` method from Abstract Syntax Trees module together with `apply()` method to perform custom, row-wise, or element-wise operations using `lambda` functions.

In [None]:
# converting submitted column to datetime
recipe_df['submitted'] = pd.to_datetime(recipe_df['submitted'])

# Conversion into a type list
recipe_df["tags"] = recipe_df["tags"].apply(lambda x: ast.literal_eval(x))
recipe_df["nutrition"] = recipe_df["nutrition"].apply(lambda x: ast.literal_eval(x))
recipe_df["steps"] = recipe_df["steps"].apply(lambda x: ast.literal_eval(x))
recipe_df["ingredients"] = recipe_df["ingredients"].apply(lambda x: ast.literal_eval(x))

In [None]:
# sanity check
print(f"Data type of 'submitted' column: {type(recipe_df["submitted"])}")
print(f"Data type of the first entry in 'nutrition' column: {type(recipe_df['nutrition'][0])}.")
print(f"Data type of the first entry in 'steps' column: {type(recipe_df['steps'][0])}.")
print(f"Data type of the first entry in 'ingredients' column: {type(recipe_df['ingredients'][0])}.")

### Detecting and Handling Outliers

We now move forwards to handle some outliers in the dataset.

To this end, let us look at the `minutes` column. We will use the `describe()` method.

In [None]:
recipe_df["minutes"].describe()

The provided summary statistics suggest there are likely outliers in the dataset. 
* The standard deviation (std) is extremely high (4.5e+06) compared to the mean (9.6e+03), indicating that the data values vary significantly, which often points to the presence of extreme values.
* The maximum value is 2.147484e+09, which is extraordinarily large compared to the upper quartile (75%, which is 65). This disparity suggests the presence of extreme outliers.
* The difference between the mean (9.6e+03) and the median (50%, which is 40) indicates a right-skewed distribution. Skewed distributions often contain outliers on the higher end.
* Since the focus is on entries with positive values for time, zero time can also be considered an outlier.

In [None]:
# Fraction entries that take zero minutes
len(recipe_df[recipe_df["minutes"] == 0]) / len(recipe_df["minutes"]) * 100

Since there are 0.19% of entries where time is equal to zero, we will modify the dataframe to only include positive values for time.

In [None]:
plt.title("Distribution of minutes (Full Range)", fontsize=16)
plt.xlabel("Minutes", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
recipe_df["minutes"].hist()

Since the minutes column is right-skewed, we will focus on the top 95th percentile of the original data

In [None]:
plt.title("Distribution of minutes (top 95 percentile)", fontsize=16)
plt.xlabel("Minutes", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
# get the 95 percentile
percentile_minutes_95 = recipe_df["minutes"].quantile(.95)
# filter the data upto top 95th percentile
recipe_df[recipe_df["minutes"] <= percentile_minutes_95]["minutes"].hist(bins=30)

Data points above the 95th percentile are considered outliers based on our design choice. We use this approach because of the following reasons:

* By removing the extreme outliers (above 255 minutes), the dataset's metrics are now more representative of the majority of the data
* The dataset contained exceptionally high values, such as 2,147,483,647. These values are so far beyond the general distribution that using Tukeyâ€™s Rule would likely still retain some of these extreme outliers, potentially skewing the analysis.
*  The data appeared heavily skewed to the right, making percentile-based thresholds (e.g., the 95th percentile) more practical for handling skewed distributions. Tukey's Rule can sometimes underperform when data is not symmetrically distributed.

In [None]:
# minutes outliers
minutes_outlier = recipe_df[recipe_df["minutes"] > percentile_minutes_95]

Now, we see that  summary statistics below suggest there are likely outliers in the `n_steps` and `n_ingredients` columns.

In [None]:
recipe_df[["n_steps", "n_ingredients"]].describe()

We will carry out similar procedure of handling outliers for `n_steps` and `n_ingredients` columns in the dataframe.

In [None]:
# only include positive values
recipe_df = recipe_df[recipe_df["n_steps"] > 0]

In [None]:
plt.title("Distribution of n_steps", fontsize=16)
plt.xlabel("Number of Steps", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
recipe_df["n_steps"].hist(bins=30)

In [None]:
percentile_num_steps_95 = recipe_df["n_steps"].quantile(0.95)

# n_steps outliers
n_steps_outlier = recipe_df[recipe_df["n_steps"] > percentile_num_steps_95]

In [None]:
plt.title("Distribution of n_ingredients", fontsize=16)
plt.xlabel("Number of Ingredients", fontsize=12)
plt.ylabel("Frequency", fontsize=12)
recipe_df["n_ingredients"].hist(bins=20)

In [None]:
percentile_n_ingredients_95 = recipe_df["n_ingredients"].quantile(0.95)

# n_ingredients outliers
n_ingredients_outlier = recipe_df[recipe_df["n_ingredients"] > percentile_n_ingredients_95]

With the outlier entries identified, the next step is to remove them. We'll first collect the indices of these outliers and then exclude them from the original DataFrame.

In [None]:
# Combine the indices of outliers from 'minutes', 'n_steps', and 'n_ingredients' columns
combined_indices = minutes_outlier.index.union(n_steps_outlier.index).union(n_ingredients_outlier.index)

# Drop
recipe_df.drop(index=combined_indices, inplace=True)