# Recipe Review Analysis 
## Part I: Introduction and Exploratory Data Analysis

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/cs39aa_project/blob/main/project_part1.ipynb)

## Introduction 

### Motivation 

As a foodie and avid home chef I am always searching for new recipes to try out to expand my repertoire! The problem I face when turning to the internet to look for a recipe is the vast amount information available online. There are many professional websites as well as personal blogs both of which have their own subset of seemingly countless recipes, including different variations of the virtually the same dish. Each recipe then has a set of reviews and it's baffling to sift through the reviews of each recipe to determine whether or not I should ultimately test out the dish. I need a tool or method to quickly analyze a set of recipe reviews and give me some insight on the reviews and potentially the underlying reason why the recipe is reccomended or not. Enter natural language processing and sentiment analysis!

### Objective

Luckily data is everywhere today, including the food world. The intention of this project is to harness the power of natural language processing by way of sentiment analysis to examine a set of [recipe review data](https://www.kaggle.com/datasets/shuyangli94/food-com-recipes-and-user-interactions) from Food.com's online recipe generator. This data set comes from Kaggle and was originally gathered for the below cited research. 

Generating Personalized Recipes from Historical User Preferences
Bodhisattwa Prasad Majumder*, Shuyang Li*, Jianmo Ni, Julian McAuley
EMNLP, 2019
https://www.aclweb.org/anthology/D19-1613/

## Exploratory Data Analysis

First up we will load in the data! the data provided comes with a few different sets. For sentiment analysis I will explore the RAW_interactions.csv file which includes the recipe reviews as they were written by the users. 


In [26]:
# mount google drive to import data files - only have to run this once. 
# from google.colab import drive
# drive.mount('/content/drive')

# import all of the python modules/packages you'll need here
import pandas as pd
import numpy as np

path = '/content/drive/MyDrive/NLP-F22/data/RAW_interactions.csv'
df = pd.read_csv(path)
df.head()

# ...

Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for...
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not...
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunk...
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprin..."


After the data is loaded in and we take a look at the first few rows we see that there are 5 columns coming from the data: 

```
user_id, recipe_id, date, rating, review 
```
Before diving deeper into the reviews, I will just look at some other values about the data set as a whole. 


In [27]:
# Rows of data 
print('Individual Recipe Reviews: ' + str(len(df)))

# Reviews of each recipe 
recipe_count = df['recipe_id'].value_counts()
print('Review Counts by Recipe:')
print(recipe_count)


Individual Recipe Reviews: 1132367
Review Counts by Recipe:
2886      1613
27208     1601
89204     1579
39087     1448
67256     1322
          ... 
155682       1
154055       1
252960       1
144013       1
386618       1
Name: recipe_id, Length: 231637, dtype: int64


From this query I can see that there are 1,132,367 reviews in this data set! The largest amount reviews a single recipe has is 1,613, however, looking at the summary of reviews by recipe, there seem to be several with a low review count. For recipes with a lower review count it is not really plausible to make any meaningful conclusions  so let's look to see what the data set holds for recipes with at least 25 reviews. 

In [28]:
df = df[df.groupby('recipe_id')["recipe_id"].transform('size') >= 25]
# Rows of data 
print('Individual Recipe Reviews: ' + str(len(df)))

# Reviews of each recipe 
recipe_count = df['recipe_id'].value_counts()
print('Review Counts by Recipe:')
print(recipe_count)

Individual Recipe Reviews: 380460
Review Counts by Recipe:
2886      1613
27208     1601
89204     1579
39087     1448
67256     1322
          ... 
5274        25
174348      25
245231      25
100481      25
79222       25
Name: recipe_id, Length: 5843, dtype: int64
