# Goal: visualize text data

As an Excel user myself, textual or non-numeric data always felt beyond my reach in the tiny rectangular cells within a spreadsheet. Today, we explore how tokenization of text can be executed in Python to generate numerical insights from the text elements within 40,000 food recipes scraped from [Allrecipes](www.allrecipes.com). 

# Goal

# Overview of Setup

## Docker Environments

To replicate the environment used to perform this analysis:
1. fork the Github [repository](https://github.com/andrewyewcy/recipe_classifier) locally on a Docker and Docker-compose installed machine
2. run below line in Terminal within the file directory of the repository

In [None]:
# Run below line in terminal within the folder that contains the forked repo
docker-compose -f dev_setup.yaml up

Instructions on how to access Jupyter will be generated in the Terminal. Detailed installation documentation can be found within the README of the repository. 

## Import Packages and Define Functions

The [`natural language toolkit(NLTK)`](https://www.nltk.org/) and the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) within `scikit-learn` were used to process and transform the recipe text into tokens which can be quantified numerically for analysis.

In [160]:
# Packages for general data processing
import numpy as np
import pandas as pd
import joblib       # Loading data
import time         # Measuring time
import ast          # Reading lists within pandas DataFrame

# Packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from notebook_functions.plot import plot_histogram # Custom function to plot distribution

# Packages for pre-processing text
import nltk                       # Natural Language Tool Kit
nltk.download('stopwords')        # For processing stop words (words too common to hold significant meaning)
from nltk.corpus import stopwords # Import above downloaded stopwords
import re                         # Regular Expression
import string                     # For identifying punctuation

# Packages for processing text
from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Import Data From AWS S3

The data was stored in an AWS Simple, Storage, Service (S3) bucket, which is a cloud storage service that integrates well with other AWS applications like EC2 and EMR. 

The data contains the various numerical and textual elements of 40,000 recipes web-scraped of Allrecipes between the months of February and March 2023.

### Insert image of data here

In [128]:
df = joblib.load(
    "../data/raw_data_df.pkl"
)

In [129]:
print(f"There are {df.shape[0]} recipes with {df.shape[1]} columns.")

There are 40001 recipes with 18 columns.


The main columns of interest are the recipe `labels`, `titles`, `ratings` (1.0-5.0), and `number of ratings`.

In [132]:
columns_of_interest = [
    "recipe_url",
    "title",
    "label",
    "rating_average",
    "rating_count"
]

# Filter for columns of interest 
df = df.loc[:,columns_of_interest]

# Visually examine the filtered data
df.head(1).T

Unnamed: 0,0
recipe_url,https://www.allrecipes.com/recipe/83646/corned...
title,Corned Beef Roast
label,"['Recipes', 'Main Dishes', 'Beef', 'Corned Bee..."
rating_average,4.4
rating_count,68


In [141]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40001 entries, 0 to 40000
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   recipe_url      40001 non-null  object 
 1   title           39994 non-null  object 
 2   label           39994 non-null  object 
 3   rating_average  36420 non-null  float64
 4   rating_count    36420 non-null  object 
dtypes: float64(1), object(4)
memory usage: 1.5+ MB


| Column | Type | Description |
|---|---|---|
| `recipe_url` | hyperlink | The uniform resource locator(URL) that leads to the recipe hosted on Allrecipes.com |
| `title` | text | The title of each recipe |
| `label` | text | The labels (tags) for each recipe. Used by Allrecipes.com as a method of organizing recipes |
| `rating_average` | float | The rounded average as found on each recipe on Allrecipes.com |
| `rating_count` | integer | The number of ratings for each recipe |

Issues with the loaded data:
1. The `rating_count` column was found to be of `object` type instead of `int`.
2. Null values observed for `title`, `label`, `rating_average`,and `rating_count`.

# Data Cleaning

## Null `titles`

In [145]:
# Examine null titles
cond = df['title'].isnull()

print(f"Number of null titles: {cond.sum()}, {np.round(cond.sum()/df.shape[0]*100,2)}% of recipes")

# Visually examine null titles
df.loc[cond]

Number of null titles: 7, 0.02% of recipes


Unnamed: 0,recipe_url,title,label,rating_average,rating_count
22785,https://www.allrecipes.com/cook/thedailygourme...,,,,
23968,https://www.allrecipes.com/recipe/14759/pork-d...,,,,
26010,https://www.allrecipes.com/recipe/218445/lenge...,,,,
26389,https://www.allrecipes.com/recipe/cajun-spice-...,,,,
26848,https://www.allrecipes.com/recipe/herman-sourd...,,,,
27388,https://www.allrecipes.com/recipe/mustard-pork...,,,,
33171,https://www.allrecipes.com/recipe/biga/detail....,,,,


As the adherence of each recipe to a standard structure was not well understood, it was assumed that these recipes maybe published with some error and thus excluded from analysis.

In [None]:
# Drop rows with null titles
df = df.loc[~cond]

In [147]:
# Examine null values again
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 39994 entries, 0 to 40000
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   recipe_url      39994 non-null  object 
 1   title           39994 non-null  object 
 2   label           39994 non-null  object 
 3   rating_average  36420 non-null  float64
 4   rating_count    36420 non-null  object 
dtypes: float64(1), object(4)
memory usage: 1.8+ MB


## Null `rating_count`

In [152]:
# Try to convert the `rating_count` column to integer
try:
    df['rating_count'].astype('int')
except ValueError as e:
    print(f"{str(e)}")

invalid literal for int() with base 10: '2,630'


It would appear that thousands are separated with a comma, leading to the error in type conversion.

In [154]:
# Remove comma from `rating_count`, then fill NaN with 0 for conversion to integer
try:
    df['rating_count'] = df['rating_count'].str.replace(",","").fillna(0).astype('int')
except ValueError as e:
    print(f"{str(e)}")

In [155]:
# Examine the converted rating_count
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 39994 entries, 0 to 40000
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   recipe_url      39994 non-null  object 
 1   title           39994 non-null  object 
 2   label           39994 non-null  object 
 3   rating_average  36420 non-null  float64
 4   rating_count    39994 non-null  int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 1.8+ MB


Note the recipes with previously null rating counts were now filled with 0, which makes sense since no rating is equal to 0 ratings.

The significance of recipe ratings are dependent on the number of ratings. Thus, to ensure some standard of quality for recipes, recipes with less than 2 ratings were excluded from analysis. The number of 2 was chosen based on the assumption that a second independent rating raises the validity of the recipe rating.

Before removing the recipes with less than 2 rating, the distributions were analyzed visually to understand the changes if any.

In [161]:
for col in ["rating_average","rating_count"]:
    plot_histogram(df, col, col)

NameError: name 'pd' is not defined

# Breaking down text columns

In [33]:
df.loc[:,["title", "label"]].head()

Unnamed: 0,title,label
0,Corned Beef Roast,"['Recipes', 'Main Dishes', 'Beef', 'Corned Bee..."
1,Stout-Braised Lamb Shanks,"['Cuisine', 'European', 'UK and Ireland', 'Iri..."
2,Chicken Al Pastor,"['Mexican', 'Main Dishes', 'Tacos', 'Chicken']"
3,Mississippi Chicken,"['Recipes', 'Meat and Poultry', 'Chicken']"
4,Lasagna Flatbread,"['Recipes', 'Bread', 'Quick Bread Recipes']"


In [38]:
ast.literal_eval(df.loc[0,"label"])

['Recipes', 'Main Dishes', 'Beef', 'Corned Beef Recipes']

In [73]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk # Natural Language Tool Kit
nltk.download('stopwords')
from nltk.corpus import stopwords
import string

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def tokenizer(sentence):

    # Remove punctuation and set to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark, "").lower()

    # Remove numerical digits in tect


In [41]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [68]:
sentence = df.loc[0,"label"]

# Print sentence
sentence

"['Recipes', 'Main Dishes', 'Beef', 'Corned Beef Recipes']"

In [69]:
for punctuation_mark in string.punctuation:
    sentence = sentence.replace(punctuation_mark, "").lower()

In [70]:
sentence

'recipes main dishes beef corned beef recipes'

In [71]:
def my_tokenizer(sentence):
    """
    Custom tokenizer for preprocessing of text columns.
    
    Usage
    -----
    Used as the tokenizer hyperparameter for vectorizers such as CountVectorizer() and TfIdf
    
    Input
    -----
    string, raw document.
    
    Output
    ------
    list of strings, cleaned & stemmed tokens
    """
    # remove punctuation and set to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'').lower()
    
    # remove all numerical digits in the text
    # https://stackoverflow.com/questions/12851791/removing-numbers-from-string
    remove_digits = str.maketrans('','', string.digits)
    sentence = sentence.translate(remove_digits)

    # split sentence into words
    listofwords = sentence.split(' ')
    listofstemmed_words = []
    
    # define English stop words
    ENGLISH_STOP_WORDS = stopwords.words('english')
    # add in additional stop words that are specific to cooking
    ENGLISH_STOP_WORDS.extend(['degree','degrees','c','f', 'recipe', 'recipes', 'minute', 'minutes'])
    
    # define stemmer
    stemmer = nltk.stem.PorterStemmer()
    
    # remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in ENGLISH_STOP_WORDS) and (word!=''):
            # Stem words
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)

    return listofstemmed_words

In [79]:
sentence = df.loc[0,"label"]

# Print sentence
print(f"{sentence}")

print(f"{my_tokenizer(sentence)}")

['Recipes', 'Main Dishes', 'Beef', 'Corned Beef Recipes']
['main', 'dish', 'beef', 'corn', 'beef']


In [77]:
for punctuation_mark in string.punctuation:
    sentence = sentence.replace(punctuation_mark, "").lower()

print(f"{sentence}")

recipes main dishes beef corned beef recipes


In [91]:
>>> import re
>>> re.sub("([a-zA-Z]+)\s+([a-zA-Z]+)", r"\1_\2", sentence,count = 0)

"['Recipes', 'Main_Dishes', 'Beef', 'Corned_Beef Recipes']"

In [None]:
>>> import re
>>> re.sub("(\d+)\s+(\d+)", r"\1,\2", s)

In [89]:
re.sub("\s+","_",sentence)

"['Recipes',_'Main_Dishes',_'Beef',_'Corned_Beef_Recipes']"

In [92]:
test_sentence = "123 234 235"

In [97]:
re.sub("\s", r",", test_sentence)

'123,234,235'

In [100]:
sentence = df.loc[0,"label"]
# Print sentence
print(f"{sentence}")

['Recipes', 'Main Dishes', 'Beef', 'Corned Beef Recipes']


In [118]:
s1 = re.sub(",\s","+",sentence)
print(s1)

s2 = re.sub("\s","_",s1)
print(s2)

s3 = s2.replace("+",", ")
print(s3)

['Recipes'+'Main Dishes'+'Beef'+'Corned Beef Recipes']
['Recipes'+'Main_Dishes'+'Beef'+'Corned_Beef_Recipes']
['Recipes', 'Main_Dishes', 'Beef', 'Corned_Beef_Recipes']


In [123]:
for punctuation_mark in [punctuation for punctuation in string.punctuation if punctuation != "_"]:
    s3 = s3.replace(punctuation_mark,"").lower()

s3

'recipes main_dishes beef corned_beef_recipes'

In [124]:
remove_digits = str.maketrans('','', string.digits)
s3 = s3.translate(remove_digits)

In [125]:
s3

'recipes main_dishes beef corned_beef_recipes'

In [126]:
# split sentence into words
listofwords = s3.split(' ')
listofstemmed_words = []

# define English stop words
ENGLISH_STOP_WORDS = stopwords.words('english')
# add in additional stop words that are specific to cooking
ENGLISH_STOP_WORDS.extend(['degree','degrees','c','f', 'recipe', 'recipes', 'minute', 'minutes'])

# define stemmer
stemmer = nltk.stem.PorterStemmer()

# remove stopwords and any tokens that are just empty strings
for word in listofwords:
    if (not word in ENGLISH_STOP_WORDS) and (word!=''):
        # Stem words
        stemmed_word = stemmer.stem(word)
        listofstemmed_words.append(stemmed_word)

In [127]:
listofstemmed_words

['main_dish', 'beef', 'corned_beef_recip']