# ✅Step 3: Data Cleaning

This notebook will clean up the posts.json file by:
1. Removing posts with non-English titles
2. Removing irrelevant columns
3. Converting relevant values to the correct data types

## 0. 🎯Setup

In [37]:
import sys

import pandas as pd
from datetime import datetime, timedelta

from tqdm import tqdm
tqdm.pandas()

# Import our own modules
sys.path.append("../scripts/")
import chadtools

from tabulate import tabulate
from pprint import pprint

### 0.1. Load json file 

Load dataframe from json file generated from 02

In [21]:
df = pd.read_json('../data/posts_with_comments.json', orient='records')
df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,num_crossposts,media,is_video,is_gallery,media_metadata,gallery_data,poll_data,crosspost_parent_list,crosspost_parent,author_cakeday
0,,recipes,,t2_9mmv4,False,,0,False,Buffalo Chicken Tenders,"[{'e': 'text', 't': 'Recipe'}]",...,0,,False,,,,,,,
1,,recipes,,t2_s92gwguui,False,,0,False,Prawn Katsu Baos,"[{'e': 'text', 't': 'Recipe'}]",...,0,,False,,,,,,,
2,,recipes,,t2_2elyzmmv,False,,0,False,Cinnamon Rolls,"[{'e': 'text', 't': 'Recipe'}]",...,0,,False,,,,,,,
3,,recipes,Cast Iron goodness with a pair of eggs sunny s...,t2_ub96nnb4,False,,0,False,Bacon Jalapeño Sweet Potato Hash,"[{'e': 'text', 't': 'Recipe'}]",...,0,,False,,,,,,,
4,,recipes,,t2_9mmv4,False,,0,False,Mushroom-Taleggio Risotto,"[{'e': 'text', 't': 'Recipe'}]",...,0,,False,,,,,,,


## 1. 🧹Simple Data Cleanup

Some posts are not formatted properly or have been deleted. We will remove them from our dataframe by checking for newlines, which are present in all properly formatted recipes.

In [22]:
df = df[df['ingredient_comment'].str.contains("\n") == True]

df.shape

(1747, 119)

We clean up the posts.csv by removing irrelevant columns and renaming the columns that we want.

In [23]:
desired_columns = ['id',
                   'title',
                   'score',
                   'num_comments',
                   'ingredient_comment',
                   'created_utc', 
                   'upvote_ratio',
                   'link_flair_text',
                   'author',
                   'url',
                   'comment_link',
                   'permalink']

df_filtered = df.loc[:, desired_columns]
df_filtered.head()

Unnamed: 0,id,title,score,num_comments,ingredient_comment,created_utc,upvote_ratio,link_flair_text,author,url,comment_link,permalink
0,19d0wfc,Buffalo Chicken Tenders,202,12,**Recipe here originally:** [**Buffalo Chicken...,1705944195,0.96,Recipe,BushyEyes,https://i.redd.it/qtwisr8gz0ec1.jpeg,https://oauth.reddit.com/r/recipes/comments/19...,https://reddit.com/r/recipes/comments/19d0wfc/...
1,1998zka,Prawn Katsu Baos,272,11,This one is high impact and a showstopper for ...,1705528588,0.95,Recipe,TheLuckiestDragon,https://i.redd.it/q81uyef4o2dc1.jpeg,https://oauth.reddit.com/r/recipes/comments/19...,https://reddit.com/r/recipes/comments/1998zka/...
2,18zcqmd,Cinnamon Rolls,256,21,"# Homemade Cinnamon Rolls\n\nFor full recipe, ...",1704476711,0.96,Recipe,pangibear,https://i.redd.it/7uef78dbsnac1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18zcqmd/...
3,18xxyl1,Bacon Jalapeño Sweet Potato Hash,122,9,**Bacon Jalapeño Sweet Potato Hash**\n\n**Ingr...,1704325340,0.98,Recipe,zorionek0,https://i.redd.it/jcbdya99abac1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18xxyl1/...
4,18vf164,Mushroom-Taleggio Risotto,184,6,**Recipe here originally:** [**Mushroom-Talegg...,1704050722,0.98,Recipe,BushyEyes,https://i.redd.it/qc5akriilo9c1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18vf164/...


We first filter out posts with empty rows.

In [24]:
important_cols = ['id', 'title', 'ingredient_comment', 'permalink']
df_filtered.dropna(axis=0, subset=important_cols, inplace=True)
df_filtered.shape

(1747, 12)

### 1.1 Filter out posts with non-english titles

In [25]:
# filter the english posts by applying custom function
df_filtered = df_filtered[df_filtered['title'].apply(chadtools.is_english, threshold_rank=5)]

df_filtered.shape

(1635, 12)

### 1.2 Convert data to more appropriate types 

In [26]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1635 entries, 0 to 2065
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  1635 non-null   object 
 1   title               1635 non-null   object 
 2   score               1635 non-null   int64  
 3   num_comments        1635 non-null   int64  
 4   ingredient_comment  1635 non-null   object 
 5   created_utc         1635 non-null   int64  
 6   upvote_ratio        1635 non-null   float64
 7   link_flair_text     1635 non-null   object 
 8   author              1635 non-null   object 
 9   url                 1635 non-null   object 
 10  comment_link        1635 non-null   object 
 11  permalink           1635 non-null   object 
dtypes: float64(1), int64(3), object(8)
memory usage: 166.1+ KB


We will convert the following columns to the following types:
- `title` to string
- `score` to int32
- `num_comments` to int16
- `created_utc` to datetime
- `upvote_ratio` to float16

#### 1.2.1 Convert created_utc to a datetime object

In [27]:
df_filtered['created_utc'] = df_filtered['created_utc'].apply(lambda x: datetime.fromtimestamp(x))

Starting from 31 Aug 2020, r/recipes imposed strict rules on the format of new posts, which made the format substantially more consistent. Hence, we will only keep posts from 31 Aug 2020 onwards for ease of data collection.

In [28]:
cutoff_datetime = pd.to_datetime("2020-08-31 10:59:00")

# Filter out rows where 'created_utc' is before the cutoff datetime
df_filtered = df_filtered[df_filtered['created_utc'] >= cutoff_datetime]

df_filtered.head()

Unnamed: 0,id,title,score,num_comments,ingredient_comment,created_utc,upvote_ratio,link_flair_text,author,url,comment_link,permalink
0,19d0wfc,Buffalo Chicken Tenders,202,12,**Recipe here originally:** [**Buffalo Chicken...,2024-01-22 17:23:15,0.96,Recipe,BushyEyes,https://i.redd.it/qtwisr8gz0ec1.jpeg,https://oauth.reddit.com/r/recipes/comments/19...,https://reddit.com/r/recipes/comments/19d0wfc/...
1,1998zka,Prawn Katsu Baos,272,11,This one is high impact and a showstopper for ...,2024-01-17 21:56:28,0.95,Recipe,TheLuckiestDragon,https://i.redd.it/q81uyef4o2dc1.jpeg,https://oauth.reddit.com/r/recipes/comments/19...,https://reddit.com/r/recipes/comments/1998zka/...
2,18zcqmd,Cinnamon Rolls,256,21,"# Homemade Cinnamon Rolls\n\nFor full recipe, ...",2024-01-05 17:45:11,0.96,Recipe,pangibear,https://i.redd.it/7uef78dbsnac1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18zcqmd/...
4,18vf164,Mushroom-Taleggio Risotto,184,6,**Recipe here originally:** [**Mushroom-Talegg...,2023-12-31 19:25:22,0.98,Recipe,BushyEyes,https://i.redd.it/qc5akriilo9c1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18vf164/...
5,18v7m3w,Cinnamon Oatmeal Chocolate Chip Cookies (Recipe),211,16,[RECIPE LINK](https://www.sarahfreia.com/blog/...,2023-12-31 13:19:31,0.95,Recipe,sarahfreia,https://i.redd.it/aki9a36yrm9c1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18v7m3w/...


#### 1.2.2 Convert the more primitive types to more size-efficient types.

In [29]:
df_filtered['score'] = df_filtered['score'].astype('int16')
df_filtered['num_comments'] = df_filtered['num_comments'].astype('int16')   
df_filtered['upvote_ratio'] = df_filtered['upvote_ratio'].astype('float16')

In [30]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 0 to 1990
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   id                  1000 non-null   object        
 1   title               1000 non-null   object        
 2   score               1000 non-null   int16         
 3   num_comments        1000 non-null   int16         
 4   ingredient_comment  1000 non-null   object        
 5   created_utc         1000 non-null   datetime64[ns]
 6   upvote_ratio        1000 non-null   float16       
 7   link_flair_text     1000 non-null   object        
 8   author              1000 non-null   object        
 9   url                 1000 non-null   object        
 10  comment_link        1000 non-null   object        
 11  permalink           1000 non-null   object        
dtypes: datetime64[ns](1), float16(1), int16(2), object(8)
memory usage: 84.0+ KB


As seen, the size of the dataframe has been reduced by 50%.

### Save df_filtered as a JSON file

In [31]:
df_filtered.to_json('../data/cleaned_posts.json', indent=4, orient='records')

Saving as HTML to add to Index.md

In [54]:
df_filtered.head().to_html('../docs/df_filtered.html')


## Extract Ingredients

We will use `regex` to extract the ingredients portion of the comment, and disregard the cooking instructions for simplicity. We will then use GPT-3.5 to identify the ingredients.

### Filter out ingredients portion of the comments

In [33]:
df_filtered['ingredient_comment_truncated'] = df_filtered['ingredient_comment'].progress_apply(chadtools.extract_ingredients_text)

100%|██████████| 1000/1000 [00:00<00:00, 8845.46it/s]


### Set up GPT Client

In [34]:
my_client = chadtools.setup_gpt_client()

KeyError: 'openai_api_key'

In [None]:
df_filtered['ingredient_comment_truncated'].head()

0    **Recipe here originally:** [**Buffalo Chicken...
1    This one is high impact and a showstopper for ...
2    # Homemade Cinnamon Rolls\n\nFor full recipe, ...
4    **Recipe here originally:** [**Mushroom-Talegg...
5    [RECIPE LINK](https://www.sarahfreia.com/blog/...
Name: ingredient_comment_truncated, dtype: object

### Use GPT to identify ingredients

WARNING: This step calls the OpenAI GPT API, which incurs costs. Ensure sufficient tokens are available before running this step.

In [None]:
df_filtered['gpt_ingredients'] = df_filtered['ingredient_comment_truncated'].progress_apply(chadtools.get_ingredient_list, client=my_client)
df_filtered['gpt_ingredients'].head()

In [None]:
df_filtered.head()

Unnamed: 0,id,title,score,num_comments,ingredient_comment,created_utc,upvote_ratio,link_flair_text,author,url,comment_link,permalink,ingredient_comment_truncated,gpt_ingredients
0,19d0wfc,Buffalo Chicken Tenders,202,12,**Recipe here originally:** [**Buffalo Chicken...,2024-01-22 17:23:15,0.959961,Recipe,BushyEyes,https://i.redd.it/qtwisr8gz0ec1.jpeg,https://oauth.reddit.com/r/recipes/comments/19...,https://reddit.com/r/recipes/comments/19d0wfc/...,**Recipe here originally:** [**Buffalo Chicken...,"[chicken tenderloins, flour, garlic powder, eg..."
1,1998zka,Prawn Katsu Baos,272,11,This one is high impact and a showstopper for ...,2024-01-17 21:56:28,0.950195,Recipe,TheLuckiestDragon,https://i.redd.it/q81uyef4o2dc1.jpeg,https://oauth.reddit.com/r/recipes/comments/19...,https://reddit.com/r/recipes/comments/1998zka/...,This one is high impact and a showstopper for ...,"[kewpie, plain yoghurt, dill pickles, capers, ..."
2,18zcqmd,Cinnamon Rolls,256,21,"# Homemade Cinnamon Rolls\n\nFor full recipe, ...",2024-01-05 17:45:11,0.959961,Recipe,pangibear,https://i.redd.it/7uef78dbsnac1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18zcqmd/...,"# Homemade Cinnamon Rolls\n\nFor full recipe, ...","[cinnamon roll dough, warm milk, instant yeast..."
4,18vf164,Mushroom-Taleggio Risotto,184,6,**Recipe here originally:** [**Mushroom-Talegg...,2023-12-31 19:25:22,0.97998,Recipe,BushyEyes,https://i.redd.it/qc5akriilo9c1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18vf164/...,**Recipe here originally:** [**Mushroom-Talegg...,"[dry porcini mushrooms, boiling water, cremini..."
5,18v7m3w,Cinnamon Oatmeal Chocolate Chip Cookies (Recipe),211,16,[RECIPE LINK](https://www.sarahfreia.com/blog/...,2023-12-31 13:19:31,0.950195,Recipe,sarahfreia,https://i.redd.it/aki9a36yrm9c1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18v7m3w/...,[RECIPE LINK](https://www.sarahfreia.com/blog/...,"[unsalted butter, brown sugar, granulated suga..."


### Reorder, filter, save as JSON

Finally, filter and reorder the columns, and save the dataframe as a JSON file for use in the next step.

In [None]:
ordered_cols = ["id", "title", "gpt_ingredients", "ingredient_comment_truncated", "score", "upvote_ratio", "link_flair_text", "author", "created_utc", "url", "permalink"]
df_filtered = df_filtered.loc[:, ordered_cols]
df_filtered.to_json('../data/cleaned_posts_with_ingredient_list.json', orient='records', indent=4)