# Recipe Review Analysis 
## Part I: Introduction and Exploratory Data Analysis

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chillyssa/NLP-with-Deep-Learning-Project/blob/main/project_part1.ipynb)

## Introduction 

### Motivation 

As a foodie and avid home chef I am always searching for new recipes to try out to expand my repertoire! The problem I face when turning to the internet to look for a recipe is the vast amount information available online. There are many professional websites as well as personal blogs both of which have their own subset of seemingly countless recipes, including different variations of the virtually the same dish. Each recipe then has a set of reviews and it's baffling to sift through the reviews of each recipe to determine whether or not I should ultimately test out the dish. I need a tool or method to quickly analyze a set of recipe reviews and give me some insight on the reviews and potentially the underlying reason why the recipe is reccomended or not. Enter natural language processing and sentiment analysis!

### Objective

Luckily data is everywhere today, including the food world. The intention of this project is to harness the power of natural language processing by way of sentiment analysis to examine a set of [recipe review data](https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions) from Food.com's online recipe generator. This data set comes from Kaggle and was originally gathered for the below cited research. 

Generating Personalized Recipes from Historical User Preferences
Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley
EMNLP, 2019
https://www.aclweb.org/anthology/D19-1613/

## Exploratory Data Analysis

### Loading Data
First up we will load in the data! the data provided comes with a few different sets. For sentiment analysis I will explore the RAW_interactions.csv file which includes the recipe reviews as they were written by the users. 


In [6]:
# mount google drive to import data files - only have to run this once. 
# from google.colab import drive
# drive.mount('/content/drive')

# import all of the python modules/packages you'll need here
import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', 0)
path = '/content/drive/MyDrive/NLP-F22/data/RAW_interactions.csv'
df = pd.read_csv(path)
df.head(10)

# ...

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for 15 minutes.Added a shake of cayenne and a pinch of salt. Used low fat sour cream. Thanks.
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall evening. Should have doubled it ;)<br/><br/>Second time around, forgot the remaining cumin. We usually love cumin, but didn't notice the missing 1/2 teaspoon!"
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not quite a whole package (10oz) of white chips. Great!
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunko. Everyone loved it.
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprinkling of black pepper. Yum!"
5,52282,120345,2005-05-21,4,very very sweet. after i waited the 2 days i bought 2 more pints of raspberries and added them to the mix. i'm going to add some as a cake filling today and will take a photo.
6,124416,120345,2011-08-06,0,"Just an observation, so I will not rate. I followed this procedure with strawberries instead of raspberries. Perhaps this is the reason it did not work well. Sorry to report that the strawberries I did in August were moldy in October. They were stored in my downstairs fridge, which is very cold and infrequently opened. Delicious and fresh-tasting prior to that, though. So, keep a sharp eye on them. Personally I would not keep them longer than a month. This recipe also appears as #120345 posted in July 2009, which is when I tried it. I also own the Edna Lewis cookbook in which this appears."
7,2000192946,120345,2015-05-10,2,"This recipe was OVERLY too sweet. I would start out with 1/3 or 1/4 cup of sugar and jsut add on from there. Just 2 cups was way too much and I had to go back to the grocery store to buy more raspberries because it made so much mix. Overall, I would but the long narrow box or raspberries. Its a perfect fit for the recipe plus a little extra. I was not impressed with this recipe. It was exceptionally over-sweet. If you make this simple recipe, MAKE SURE TO ADD LESS SUGAR!"
8,76535,134728,2005-09-02,4,Very good!
9,273745,134728,2005-12-22,5,Better than the real!!


After the data is loaded in and we take a look at the first few rows we see that there are 5 columns coming from the data: 

```
user_id, recipe_id, date, rating, review 
```
Before diving deeper into the reviews, I will just look at some other values about the data set as a whole. I first will make sure the data doesn't have any null cells where there should be a review. Since this data is a set specifically for reviews, it is unlikely we will have any null review attributes, but this will make sure of that. 


In [7]:
# Remove any null review columns 
df = df[~df["review"].isnull()]

# Rows of data 
print('Individual Recipe Reviews: ' + str(len(df)))

# Reviews of each recipe 
recipe_count = df['recipe_id'].value_counts()
print('Review Counts by Recipe:')
print(recipe_count)


Individual Recipe Reviews: 1132198
Review Counts by Recipe:
2886      1609
27208     1601
89204     1579
39087     1448
67256     1322
          ... 
155682    1   
154055    1   
252960    1   
144013    1   
386618    1   
Name: recipe_id, Length: 231630, dtype: int64


From this query I can see that there are 1,132,367 reviews in this data set! The largest amount reviews a single recipe has is 1,613, however, looking at the summary of reviews by recipe, there seem to be several with a low review count. For recipes with a lower review count it is not really plausible to make any meaningful conclusions  so let's look to see what the data set holds for recipes with at least 25 reviews. 

In [9]:
df = df[df.groupby('recipe_id')["recipe_id"].transform('size') >= 25]
# Rows of data 
print('Individual Recipe Reviews: ' + str(len(df)))

# Reviews of each recipe 
recipe_count = df['recipe_id'].value_counts()
print('Review Counts by Recipe:')
print(recipe_count)

Individual Recipe Reviews: 380316
Review Counts by Recipe:
2886      1609
27208     1601
89204     1579
39087     1448
67256     1322
          ... 
210258    25  
227912    25  
65503     25  
297254    25  
79222     25  
Name: recipe_id, Length: 5841, dtype: int64


Now the data is 5,834 uniqure recipes all with 25 or more reviews giving a total of 380,460 reviews! Let's take a look at a few of the reviews to see what our actual text data looks like right now. I have chosen 5 reviews at random all from different recipes just to see what kind of text we are dealing with. 

In [10]:
df.iloc[[3,32,654,7000,18000]]

Unnamed: 0,user_id,recipe_id,date,rating,review
45,349752,79222,2007-04-07,5,"Tasty, low fat, fast & easy! I added 1tsp. of Old Bay Seasoning and 1 tsp. of Cajun Seasoning with a 14 oz. can of crab. We all like the extra spice. This recipe would satisfy DH & DS without any side dishes, but I offered saltines & a salad on the side just in case. Definitely a keeper!"
180,2100189,195977,2011-12-12,5,"This is the same recipe that I used in St. Louis about 35 years ago, at age 10. :-) I am from Affton/Kirkwood/Glendale area & now reside in SE Florida. I lost my recipe & am thankful that you posted. Today I made small tartlets with this recipe & put almond filling & cherry pie filling in the bottom, spooned the cheese mix over the top & added sliced almonds. Baking time is about 20 minutes or so, depending on your oven. I also used Mexican vanilla. It is a great vanilla for baking use. The aluminum cupcake cups with the wax paper liners work great for sharing, once the tartlet is removed from the tartlet pans. I topped off with powdered sugar. The ladies @ the Cookie Exchange party I made them for were all a buzz with curiousity & they went like lightning. This is an excellent recipe and is really easy to make."
2135,1060667,8507,2016-09-22,5,"Please, please, please you only fresh mozzarella in this! The cheap waxy stuff will not be the same. This is a delicious appy and of course I chose to drizzle with the optional balsamic, who wouldn't? :)"
18736,323134,51997,2007-08-26,5,"These were ""awesome"" blueberry muffins! I loved using fresh berries and loved the topping. Thank you for sharing this easy and delicious muffin recipe."
50246,1040507,79944,2009-06-17,5,"Oh my! We loved these! I skipped the butter. I didn't have the flavored cream cheese, so I chopped a bunch of fresh chives and mixed them with a cube of cream cheese. I liked how crispy the bacon got and how juicy the chicken was! We will be making this again! Thanks for the great recipe!"


##Preprocessing 

Our data set is currently in it's raw form, that is, straight from the source. That means that all these reviews are in a form that has punctuation, capitalization, potential mispellings, etc. It's the _'Wild West'_ of textual data that a machine learning model will not know how to interpret so we need to take a few steps to clean up or preprocess the reviews to convert it into data that a model can analyze and predict the sentiment behind the review. 

A good first step is to make that we are working with the same data type for all reviews. It is likely that these are all text strings anyway, but in the event they are not, this will ensure all review data is of type string.  Additionally, several review entries have emoji like text and punctuation that, to the human eye can be inferred as emotion, but it isn't beneficial for text sentiment analysis so I will also remove all special characters and punctuation. The last thing I will do is convert all text to lower case to standardize the data even further. 



In [None]:
import re 

# Convert all review data to stype string 
df['review'] = df['review'].astype(str)

# Remove Special Characters 
def clean(txt):
    txt = txt.str.replace(':-\)','')
    txt = txt.str.replace('(<a).*(>).*()', '')
    txt = txt.str.replace('(\xa0)', ' ')
    txt = txt.str.replace('(&amp)', '')
    txt = txt.str.replace('(&gt)', '')
    txt = txt.str.replace('(&lt)', '')
    txt = txt.str.replace('!','')
    txt = txt.str.replace('?','')
    txt = txt.str.replace(':-\(','')
    txt = txt.str.replace(':\)','')
    txt = txt.str.replace(':\(','')
    return txt

df['review'] = clean(df['review'])
df['review'] = df['review'].str.replace('[^\w\s]', '') 

# Convert all text to lower case
df['review'] = df['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))

df['review'].head()

Now that we have cleaned up the data a bit, let's look at those random reviews again to see what our reviews look like now. 

In [27]:
df.iloc[[3,32,654,7000,18000]]

Unnamed: 0,user_id,recipe_id,date,rating,review
45,349752,79222,2007-04-07,5,tasty low fat fast easy i added 1tsp of old bay seasoning and 1 tsp of cajun seasoning with a 14 oz can of crab we all like the extra spice this recipe would satisfy dh ds without any side dishes but i offered saltines a salad on the side just in case definitely a keeper
180,2100189,195977,2011-12-12,5,this is the same recipe that i used in st louis about 35 years ago at age 10 i am from afftonkirkwoodglendale area now reside in se florida i lost my recipe am thankful that you posted today i made small tartlets with this recipe put almond filling cherry pie filling in the bottom spooned the cheese mix over the top added sliced almonds baking time is about 20 minutes or so depending on your oven i also used mexican vanilla it is a great vanilla for baking use the aluminum cupcake cups with the wax paper liners work great for sharing once the tartlet is removed from the tartlet pans i topped off with powdered sugar the ladies the cookie exchange party i made them for were all a buzz with curiousity they went like lightning this is an excellent recipe and is really easy to make
2135,1060667,8507,2016-09-22,5,please please please you only fresh mozzarella in this the cheap waxy stuff will not be the same this is a delicious appy and of course i chose to drizzle with the optional balsamic who wouldnt
18736,323134,51997,2007-08-26,5,these were awesome blueberry muffins i loved using fresh berries and loved the topping thank you for sharing this easy and delicious muffin recipe
50246,1040507,79944,2009-06-17,5,oh my we loved these i skipped the butter i didnt have the flavored cream cheese so i chopped a bunch of fresh chives and mixed them with a cube of cream cheese i liked how crispy the bacon got and how juicy the chicken was we will be making this again thanks for the great recipe
