# 🧹 02 Data Cleaning

This notebook will clean up the initially scraped posts.csv file by:
1. Removing posts with non-English titles
2. Removing irrelevant columns
3. Converting relevant columns to the correct data types

## 0. 🎯Import libraries

In [12]:
import sys
import json
import requests as r

import numpy as np
import pandas as pd
from datetime import datetime, timedelta

import spacy
import re

from pprint import pprint
from tqdm import tqdm

# Import our own modules
sys.path.append("../scripts/")
import chadtools

### 0.1. Load json file 

Load dataframe from json file generated from 02

In [13]:
df = pd.read_json('../data/posts_with_comments.json', orient='records')
df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,num_crossposts,media,is_video,is_gallery,media_metadata,gallery_data,poll_data,crosspost_parent_list,crosspost_parent,author_cakeday
0,,recipes,,t2_9mmv4,False,,0,False,Buffalo Chicken Tenders,"[{'e': 'text', 't': 'Recipe'}]",...,0,,False,,,,,,,
1,,recipes,,t2_s92gwguui,False,,0,False,Prawn Katsu Baos,"[{'e': 'text', 't': 'Recipe'}]",...,0,,False,,,,,,,
2,,recipes,,t2_2elyzmmv,False,,0,False,Cinnamon Rolls,"[{'e': 'text', 't': 'Recipe'}]",...,0,,False,,,,,,,
3,,recipes,Cast Iron goodness with a pair of eggs sunny s...,t2_ub96nnb4,False,,0,False,Bacon Jalapeño Sweet Potato Hash,"[{'e': 'text', 't': 'Recipe'}]",...,0,,False,,,,,,,
4,,recipes,,t2_9mmv4,False,,0,False,Mushroom-Taleggio Risotto,"[{'e': 'text', 't': 'Recipe'}]",...,0,,False,,,,,,,


## 1. 🧹Simple Data Cleanup

Some posts are not formatted properly or have been deleted. We will remove them from our dataframe by checking for newlines, which are present in all properly formatted recipes.

In [14]:
df = df[df['ingredient_comment'].str.contains("\n") == True]

df.shape

(1747, 119)

We clean up the posts.csv by removing irrelevant columns and renaming columns.

In [15]:
desired_columns = ['id',
                   'title',
                   'score',
                   'num_comments',
                   'ingredient_comment',
                   'created_utc', 
                   'upvote_ratio',
                   'link_flair_text',
                   'author',
                   'url',
                   'comment_link',
                   'permalink']

df_filtered = df.loc[:, desired_columns]
df_filtered.head()

Unnamed: 0,id,title,score,num_comments,ingredient_comment,created_utc,upvote_ratio,link_flair_text,author,url,comment_link,permalink
0,19d0wfc,Buffalo Chicken Tenders,202,12,**Recipe here originally:** [**Buffalo Chicken...,1705944195,0.96,Recipe,BushyEyes,https://i.redd.it/qtwisr8gz0ec1.jpeg,https://oauth.reddit.com/r/recipes/comments/19...,https://reddit.com/r/recipes/comments/19d0wfc/...
1,1998zka,Prawn Katsu Baos,272,11,This one is high impact and a showstopper for ...,1705528588,0.95,Recipe,TheLuckiestDragon,https://i.redd.it/q81uyef4o2dc1.jpeg,https://oauth.reddit.com/r/recipes/comments/19...,https://reddit.com/r/recipes/comments/1998zka/...
2,18zcqmd,Cinnamon Rolls,256,21,"# Homemade Cinnamon Rolls\n\nFor full recipe, ...",1704476711,0.96,Recipe,pangibear,https://i.redd.it/7uef78dbsnac1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18zcqmd/...
3,18xxyl1,Bacon Jalapeño Sweet Potato Hash,122,9,**Bacon Jalapeño Sweet Potato Hash**\n\n**Ingr...,1704325340,0.98,Recipe,zorionek0,https://i.redd.it/jcbdya99abac1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18xxyl1/...
4,18vf164,Mushroom-Taleggio Risotto,184,6,**Recipe here originally:** [**Mushroom-Talegg...,1704050722,0.98,Recipe,BushyEyes,https://i.redd.it/qc5akriilo9c1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18vf164/...


We first filter out posts with empty rows.

In [16]:
df_filtered = df_filtered.dropna(axis=1, how='all')
df_filtered.tail()

Unnamed: 0,id,title,score,num_comments,ingredient_comment,created_utc,upvote_ratio,link_flair_text,author,url,comment_link,permalink
2060,czh22m,Vegetable bhaji recipe,13,2,MAKES\n\n18\n\nPREP TIME\n\n10MINS\n\nCOOK TIM...,1567578230,0.8,Fruit\Vegetarian,mark30322,https://i.redd.it/sprijosdtik31.jpg,https://oauth.reddit.com/r/recipes/comments/cz...,https://reddit.com/r/recipes/comments/czh22m/v...
2061,cz7pe8,Eggplant Chickpea Dip,10,1,Recipe: [https://edaqaskitchen.com/recipe/eggp...,1567530528,0.86,Fruit\Vegetarian,mortoray,https://imgur.com/gjUBUU7,https://oauth.reddit.com/r/recipes/comments/cz...,https://reddit.com/r/recipes/comments/cz7pe8/e...
2062,cymk0i,End-Of-Summer Sesame Slaw,22,4,I just started a blog about food and homestead...,1567414819,0.76,Fruit\Vegetarian,sweetpotatofamily,https://i.redd.it/8nehck9hb5k31.jpg,https://oauth.reddit.com/r/recipes/comments/cy...,https://reddit.com/r/recipes/comments/cymk0i/e...
2064,cwwfal,Restaurant Style Phool Gobhi Masala Recipe,20,1,"Ingredients\n\n12 Cauliflower (gobi)\t, cut to...",1567055721,0.88,Fruit\Vegetarian,mark30322,https://i.redd.it/ycwjgo0pnbj31.jpg,https://oauth.reddit.com/r/recipes/comments/cw...,https://reddit.com/r/recipes/comments/cwwfal/r...
2065,csv234,Celery and Soy Stuffed Butternut Squash,7,1,Recipe: [https://edaqaskitchen.com/recipe/cele...,1566290287,0.74,Fruit\Vegetarian,mortoray,https://imgur.com/OyakVfz,https://oauth.reddit.com/r/recipes/comments/cs...,https://reddit.com/r/recipes/comments/csv234/c...


### 1.1 Filter out posts with non-english titles

In [17]:
# load the English language model into spacy
nlp = spacy.load("en_core_web_sm")

# filter the english posts by applying custom function
df_filtered = df_filtered[df_filtered['title'].apply(chadtools.is_english, model=nlp)]

df_filtered.tail()

Unnamed: 0,id,title,score,num_comments,ingredient_comment,created_utc,upvote_ratio,link_flair_text,author,url,comment_link,permalink
2060,czh22m,Vegetable bhaji recipe,13,2,MAKES\n\n18\n\nPREP TIME\n\n10MINS\n\nCOOK TIM...,1567578230,0.8,Fruit\Vegetarian,mark30322,https://i.redd.it/sprijosdtik31.jpg,https://oauth.reddit.com/r/recipes/comments/cz...,https://reddit.com/r/recipes/comments/czh22m/v...
2061,cz7pe8,Eggplant Chickpea Dip,10,1,Recipe: [https://edaqaskitchen.com/recipe/eggp...,1567530528,0.86,Fruit\Vegetarian,mortoray,https://imgur.com/gjUBUU7,https://oauth.reddit.com/r/recipes/comments/cz...,https://reddit.com/r/recipes/comments/cz7pe8/e...
2062,cymk0i,End-Of-Summer Sesame Slaw,22,4,I just started a blog about food and homestead...,1567414819,0.76,Fruit\Vegetarian,sweetpotatofamily,https://i.redd.it/8nehck9hb5k31.jpg,https://oauth.reddit.com/r/recipes/comments/cy...,https://reddit.com/r/recipes/comments/cymk0i/e...
2064,cwwfal,Restaurant Style Phool Gobhi Masala Recipe,20,1,"Ingredients\n\n12 Cauliflower (gobi)\t, cut to...",1567055721,0.88,Fruit\Vegetarian,mark30322,https://i.redd.it/ycwjgo0pnbj31.jpg,https://oauth.reddit.com/r/recipes/comments/cw...,https://reddit.com/r/recipes/comments/cwwfal/r...
2065,csv234,Celery and Soy Stuffed Butternut Squash,7,1,Recipe: [https://edaqaskitchen.com/recipe/cele...,1566290287,0.74,Fruit\Vegetarian,mortoray,https://imgur.com/OyakVfz,https://oauth.reddit.com/r/recipes/comments/cs...,https://reddit.com/r/recipes/comments/csv234/c...


### 1.2 Change data to more appropriate types 

In [18]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1747 entries, 0 to 2065
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  1747 non-null   object 
 1   title               1747 non-null   object 
 2   score               1747 non-null   int64  
 3   num_comments        1747 non-null   int64  
 4   ingredient_comment  1747 non-null   object 
 5   created_utc         1747 non-null   int64  
 6   upvote_ratio        1747 non-null   float64
 7   link_flair_text     1747 non-null   object 
 8   author              1747 non-null   object 
 9   url                 1747 non-null   object 
 10  comment_link        1747 non-null   object 
 11  permalink           1747 non-null   object 
dtypes: float64(1), int64(3), object(8)
memory usage: 177.4+ KB


We will convert the following columns to the following types:
- `title` to string
- `score` to int32
- `num_comments` to int16
- `created_utc` to datetime
- `upvote_ratio` to float16

#### 1.2.1 Convert created_utc to a datetime object

In [19]:
df_filtered['created_utc'] = df_filtered['created_utc'].apply(lambda x: datetime.fromtimestamp(x))

In [20]:
cutoff_datetime = pd.to_datetime("2020-08-31 10:59:00")

# Filter out rows where 'created_utc' is before the cutoff datetime
df_filtered = df_filtered[df_filtered['created_utc'] >= cutoff_datetime]

df_filtered.head()

Unnamed: 0,id,title,score,num_comments,ingredient_comment,created_utc,upvote_ratio,link_flair_text,author,url,comment_link,permalink
0,19d0wfc,Buffalo Chicken Tenders,202,12,**Recipe here originally:** [**Buffalo Chicken...,2024-01-22 17:23:15,0.96,Recipe,BushyEyes,https://i.redd.it/qtwisr8gz0ec1.jpeg,https://oauth.reddit.com/r/recipes/comments/19...,https://reddit.com/r/recipes/comments/19d0wfc/...
1,1998zka,Prawn Katsu Baos,272,11,This one is high impact and a showstopper for ...,2024-01-17 21:56:28,0.95,Recipe,TheLuckiestDragon,https://i.redd.it/q81uyef4o2dc1.jpeg,https://oauth.reddit.com/r/recipes/comments/19...,https://reddit.com/r/recipes/comments/1998zka/...
2,18zcqmd,Cinnamon Rolls,256,21,"# Homemade Cinnamon Rolls\n\nFor full recipe, ...",2024-01-05 17:45:11,0.96,Recipe,pangibear,https://i.redd.it/7uef78dbsnac1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18zcqmd/...
3,18xxyl1,Bacon Jalapeño Sweet Potato Hash,122,9,**Bacon Jalapeño Sweet Potato Hash**\n\n**Ingr...,2024-01-03 23:42:20,0.98,Recipe,zorionek0,https://i.redd.it/jcbdya99abac1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18xxyl1/...
4,18vf164,Mushroom-Taleggio Risotto,184,6,**Recipe here originally:** [**Mushroom-Talegg...,2023-12-31 19:25:22,0.98,Recipe,BushyEyes,https://i.redd.it/qc5akriilo9c1.jpeg,https://oauth.reddit.com/r/recipes/comments/18...,https://reddit.com/r/recipes/comments/18vf164/...
...,...,...,...,...,...,...,...,...,...,...,...,...
1986,jcgb7j,Bitter gourd yogurt curry....with no bitternes...,7,6,Recipe.....\n\n[Short Video](https://youtu.be/...,2020-10-16 20:18:12,0.65,Fruit\Vegetarian,PassionateHobbies,https://i.redd.it/bpootodgbit51.jpg,https://oauth.reddit.com/r/recipes/comments/jc...,https://reddit.com/r/recipes/comments/jcgb7j/b...
1987,jb5peu,Punjabi Aloo Samosa,38,1,For video instruction follow this link: [http...,2020-10-14 18:51:34,0.96,Fruit\Vegetarian,Pakladies,https://i.redd.it/9kndhfs2m3t51.jpg,https://oauth.reddit.com/r/recipes/comments/jb...,https://reddit.com/r/recipes/comments/jb5peu/p...
1988,iz12pg,Ottolenghi's Baked Orzo w/Mozzarella,22,5,Ingredients:\n\n* 7 Tablespoons olive oil\n* ...,2020-09-24 17:59:05,0.83,Fruit\Vegetarian,BrinaElka,https://i.redd.it/l7osuhkcm4p51.jpg,https://oauth.reddit.com/r/recipes/comments/iz...,https://reddit.com/r/recipes/comments/iz12pg/o...
1989,iw3wli,Mushroom Barley Stew with Crispy Oyster Mushrooms,2694,41,**Recipe here originally:** [**Easy Mushroom B...,2020-09-20 01:27:07,0.98,Fruit\Vegetarian,BushyEyes,https://i.redd.it/511qxuct57o51.jpg,https://oauth.reddit.com/r/recipes/comments/iw...,https://reddit.com/r/recipes/comments/iw3wli/m...


#### 1.2.2 Change columns to a more size-efficient integer/ float type

In [21]:
int_cols = df_filtered.select_dtypes(include=('int64')).columns
df_filtered[int_cols].head()

Unnamed: 0,score,num_comments
0,202,12
1,272,11
2,256,21
3,122,9
4,184,6


Convert the more primitive types to more size-efficient types.

In [22]:
df_filtered['score'] = df_filtered['score'].astype('int16')
df_filtered['num_comments'] = df_filtered['num_comments'].astype('int16')   
df_filtered['upvote_ratio'] = df_filtered['upvote_ratio'].astype('float16')

In [23]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1090 entries, 0 to 1990
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   id                  1090 non-null   object        
 1   title               1090 non-null   object        
 2   score               1090 non-null   int16         
 3   num_comments        1090 non-null   int16         
 4   ingredient_comment  1090 non-null   object        
 5   created_utc         1090 non-null   datetime64[ns]
 6   upvote_ratio        1090 non-null   float16       
 7   link_flair_text     1090 non-null   object        
 8   author              1090 non-null   object        
 9   url                 1090 non-null   object        
 10  comment_link        1090 non-null   object        
 11  permalink           1090 non-null   object        
dtypes: datetime64[ns](1), float16(1), int16(2), object(8)
memory usage: 91.5+ KB


As seen, the size of the dataframe has been reduced by 50%.

### Save df_filtered as a JSON file

In [24]:
df_filtered.to_json('../data/cleaned_posts.json', indent=4, orient='records')

In [25]:
df_filtered.to_csv('../data/cleaned_posts.csv', index=False)

In [26]:
test_comment_1 = """One of my favorite Ukrainian recipes is the lesser known green version of the famous borshch. This one replaces the beets with sorrel.

It is also eaten in other ex-PLC countries like Poland, belarus, and Lithuania!

## [Ukrainian Green Borshch](https://cookingtoentertain.com/green-borscht/)

**INGREDIENTS**
  
• 500 grams Pork Ribs

• 500 grams Young Potatoes cubed

• 200 grams Sorrel fresh

• 1 Onion

• 1 Carrot

• 5 Eggs 4 hardboiled

• 1 tbsp Sour Cream or Smetana if you can find it


**INSTRUCTIONS**
 
1. In a pot add the pork ribs along with salt and pepper and the bay leaves. Add water up to 60% of the pot. Bring to a boil, then lower to a simmer and cover with a lid for one hour.
Add in the potatoes and bring back up to a boil. Let cook for 10 minutes.

2. While the potatoes are cooking, quickly fry some grated onion and carrot in a pan with a bit of oil. Add to the borshch and give everything a stir. Also chop up the hard boiled eggs and add that in.

3. In a small bowl beat together an egg and the sour cream. Swirl the pot of boiling borshch and slowly pour in the egg mixture so it cooks immediately as it hits the soup.

4. Turn off the heat and add in the chopped sorrel. Give everything a good stir and let sit for a few minutes before serving. Taste for salt and pepper and adjust as needed.
"""

test_comment_2 = """Recipe here originally: Leftover Turkey Soup

Stock (optional to make; can use chicken broth instead):

1 turkey carcass

Water

Salt

Soup:

1 tablespoon extra virgin olive oil

1 yellow onion, peeled and diced

4 carrots, peeled and diced

4 ribs celery, trimmed and diced

1 fennel bulb, trimmed, cored, and thinly sliced

5 cloves garlic, peeled and minced

5 sprigs thyme, bundled together with kitchen twine

6-7 cups prepared stock from above or use chicken broth

4 cups chopped or shredded leftover turkey; use in addition to any meat you pull off the turkey carcass

¾ cup pastina or ditalini

1 lemon, juiced

½ cup fresh parsley, minced

Big pinch of fennel fronds, minced

Crushed red pepper to taste

Salt and pepper

Make the stock:

Place the turkey carcass in a large stockpot and cover with 12 cups water. You may need more depending on the size of the carcass. Try your best to immerse the bird with water, but if your pot isn’t big enough, it’s ok if the back bone sticks out a bit. Add a big pinch of salt to the water.

Bring to a boil and then simmer for 2-3 hours. You may wish to flip the bird once during simmering. The liquid should reduce by almost half.

Cover the pot (with foil, if the turkey is sticking out) and transfer to the refrigerator overnight.

The next day, remove the carcass from the stock. Pick off any remaining meat and set it aside in a bowl to be added to the soup. Discard the carcass.

If the stock is very gelatinous, place it on the heat over medium-high just until the gelatin melts, and the stock returns to a liquid. Turn off the heat and strain through a fine-mesh sieve.

Give the pot a quick rinse and wipe it out. Return it to the stovetop.

Cook the soup aromatics:

Heat 1 tablespoon olive oil over medium heat. Add the onion, carrots, celery, and fennel. Season with salt and pepper. Cook for 8-10 minutes.

Add the garlic and cook for 1 minute until fragrant. Add the bundle of thyme.

Simmer the soup:

Pour in the prepared stock and the chopped turkey. Add salt, pepper, and crushed red pepper. Bring to a boil. Reduce heat and simmer for 30 minutes. Remove and discard the thyme.

Finish the soup:

Return the soup to a boil. Add the pastina and cook for 3-4 minutes. Taste and add salt and pepper.

Finish the soup by adding parsley, fennel fronds, and lemon juice.

To serve:

Ladle the soup into bowls and serve with lemon wedges and minced parsley on the side. Enjoy!"""


In [27]:
def extract_ingredients(comment):
    # Define a regular expression pattern to match ingredients
    ingredients_pattern = re.compile(r'\*\*INGREDIENTS\*\*([\s\S]*?)(?:\*\*INSTRUCTIONS\*\*|$)')

    # Find matches in the comment using the pattern
    matches = ingredients_pattern.search(comment)

    # If matches are found, extract and clean up the ingredients
    if matches:
        ingredients_text = matches.group(1).strip()
        ingredients_list = [ingredient.strip() for ingredient in re.split(r'\n•|\n', ingredients_text) if ingredient.strip()]
        return ingredients_list
    else:
        return None

# Test the function with the provided comments
ingredients_test_comment_1 = extract_ingredients(test_comment_1)
ingredients_test_comment_2 = extract_ingredients(test_comment_2)

# Print the results
print("Ingredients from Test Comment 1:")
print(ingredients_test_comment_1)

print("\nIngredients from Test Comment 2:")
print(ingredients_test_comment_2)

Ingredients from Test Comment 1:
['• 500 grams Pork Ribs', '500 grams Young Potatoes cubed', '200 grams Sorrel fresh', '1 Onion', '1 Carrot', '5 Eggs 4 hardboiled', '1 tbsp Sour Cream or Smetana if you can find it']

Ingredients from Test Comment 2:
None


In [28]:
pprint(ingredients_list)

NameError: name 'ingredients_list' is not defined

In [None]:
separators = ['grams', 'tbsp']

def extract_ingredient_names(ingredient): 
    for sep in separators:
        if sep in ingredient:
            return ingredient.split(sep)[1].strip()
        
    return ingredient

ingredient_names = [extract_ingredient_names(ingredient) for ingredient in ingredients_list]

ingredient_names

In [None]:
# Extract the ingredients section
ingredients_section = re.search(r'\*\*INGREDIENTS\*\*(.*?)\*\*', test_comment_1, re.DOTALL)
if ingredients_section:
    # Extract the bullet points from the ingredients section
    ingredients_list = [ingredient.strip() for ingredient in re.split(r'\n\s*•\s*', ingredients_section.group(1)) if ingredient.strip()]

else:
    print("Ingredients section not found.")

In [None]:
ingredients_list