# Preprocessing and Tokenization

Rodrigo Becerra Carrillo

https://github.com/bcrodrigo

## Introduction

Notebook to perform Preprocessing and Tokenization on a reviews dataset of Amazon foods.

The dataset was sourced from [here](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews/data).

## Data Dictionary


| Column Name            | Description                                                               | Data Type |
| ---------------------- | ------------------------------------------------------------------------- | --------- |
| Id                     | Row ID                                                                    | int64     |
| ProductId              | Unique identifier for Product                                             | object    |
| UserId                 | Unique identifier for User                                                | object    |
| ProfileName            | Profile name of the user                                                  | object    |
| HelpfulnessNumerator   | Number of users who found the review helpful                              | int64     |
| HelpfulnessDenominator | Number of users who indicated wether they found the review helpful or not | int64     |
| Score                  | Rating between 1 and 5                                                    | int64     |
| Time                   | Timestamp for the review                                                  | int64     |
| Summary                | Brief summary of the review                                               | object    |
| Text                   | Full review                                                               | object    |


Previously, we performed EDA and noticed there were no missing values, and that there was a class imbalance in teh `Score`. From the table above, we'll only use `Text` and `Score` as features and target variable, respectively.

## Import Custom Modules

In [9]:
import sys
sys.path

['/Users/rodrigo/anaconda3/envs/nlp_env/lib/python311.zip',
 '/Users/rodrigo/anaconda3/envs/nlp_env/lib/python3.11',
 '/Users/rodrigo/anaconda3/envs/nlp_env/lib/python3.11/lib-dynload',
 '',
 '/Users/rodrigo/anaconda3/envs/nlp_env/lib/python3.11/site-packages']

In [10]:
sys.path.append('..')

In [11]:
from src.preprocessing import preprocess_dataset

In [12]:
preprocess_dataset?

[0;31mSignature:[0m [0mpreprocess_dataset[0m[0;34m([0m[0mcsv_filename[0m[0;34m,[0m [0mrebalance[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Function to preprocess a reviews datascet in csv into a dataframe with score and text.

Parameters
----------
csv_filename : str
    Path to the csv file containing the data. Note the file is expected to be compressed using gzip.

rebalance : bool, optional
    Optional flag indicates to balance the number of reviews.

Returns
-------
tuple
    Pandas DataFrames (df_orig, df_rebalanced), each with two columns: text and review score.

    if rebalance is False
        df_orig : contains all records
        df_rebalanced : is an empty dataframe

    if rebalance is True
        df_orig : contains all records minus those used to rebalance the review score
        df_rebalanced : contains all records used to balanced number of reviews by score

    Note that in either case pd.concat([df_orig,df_

## Import Libraries and Load DataFrame

In [13]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [14]:
file_path = '../data/Reviews.csv.gz'

In [15]:
dforig, dfnew = preprocess_dataset(file_path,rebalance=True)

In [16]:
dforig.shape

(440534, 2)

In [17]:
dfnew.shape

(127920, 2)

In [18]:
dfnew.shape[0] + dforig.shape[0]

568454

In [19]:
dfnew['Score'].value_counts()

Score
0    42640
1    42640
2    42640
Name: count, dtype: int64

In [20]:
dforig['Score'].value_counts()

Score
2    401137
0     39397
Name: count, dtype: int64

We'll now calculate for each score what is the average length of a review.

In [None]:
df[['Score','Text']].head()

In [None]:
df['review_n_char'] = df['Text'].apply(lambda x: len(x))

In [None]:
df[['Score','review_n_char']]

In [None]:
agg_df = df[['Score','review_n_char']].groupby('Score').aggregate('mean')

In [None]:
agg_df.plot(kind = 'bar')
plt.title('Average number of characters')
plt.xlabel('Review Score')
plt.ylabel('Review Length (# of characters)')
plt.grid()
ticks = plt.xticks(rotation = 0)

From the graph above we see that, on average, there is no significant difference in the average number of characters of each review. 

The reviews with the highest score (5) seem to have the least number of characters.

# Preprocessing

In this section we'll tokenize the contents of `dfnew`.

The first approach we'll take will be through the Bag-of-Words model with Scikit-Learn.
We need to 

- Instantiate an instance of CountVectorizer
- Define a tokenizer that removes punctuation, stop words, and performs either stemming or lemmatization
- Use spaCy to define a custom tokenizer

In [42]:
# get the first 5 lines
dftest = dfnew['Text'].head()

In [27]:
# get the first 5 reviews of the dataset
first5_rev = dftest.values.tolist()

In [36]:
import spacy

def custom_spacy_tokenizer(sentence):
    nlp = spacy.load("en_core_web_sm")
    document = nlp(sentence)
    
    # make a list of tokens not containing stop words and punctuation
    token_list = [token for token in document if not token.is_punct and not token.is_stop]
    
    return token_list

In [43]:
import spacy
nlp = spacy.load("en_core_web_sm")

def custom_spacy_tokenizer_lemma(sentence,nlp):
    
    document = nlp(sentence)
    
    # make a list of tokens not containing stop words and punctuation
    # and lemmatize them
    token_list = [token.lemma_ for token in document if not token.is_punct and not token.is_stop]
    
    return token_list

In [44]:
# tokenizer
custom_spacy_tokenizer(first5_rev[0])

[drink,
 lot,
 sugar,
 free,
 beverages,
 TERRIBLE,
 brew,
 cup,
 smells,
 like,
 melted,
 butter,
 taste,
 good,
 waste,
 money]

In [45]:
# tokenizer with lemmatiztion
custom_spacy_tokenizer_lemma(first5_rev[0],nlp)

['drink',
 'lot',
 'sugar',
 'free',
 'beverage',
 'terrible',
 'brew',
 'cup',
 'smell',
 'like',
 'melt',
 'butter',
 'taste',
 'good',
 'waste',
 'money']

In [46]:
# show the full review
print(first5_rev[1])

I have been giving my dog this treat for a long time,I found it here on Amazon and it is much cheaper! Now I learned that ALL chicken treats for dogs (and also cats I believe) that are MADE IN CHINA are being investigated  by the FDA because some dogs have died after consuming them. These treats and all Dogswell treats are made in China,I researched all over the web about this matter and bottom line is: why take the risk? A few sites say it is OK, most say to be cautious and others say don't buy.I threw out all the ones I bought and got new ones made in USA.I wish Amazon gave us the choice of "made in USA", for now I recommend everyone that has a pet to read the labels of the treats and food. Sorry this product, I don't recommend.


In [47]:
# show the tokenized review
print(custom_spacy_tokenizer(first5_rev[1]))

[giving, dog, treat, long, time, found, Amazon, cheaper, learned, chicken, treats, dogs, cats, believe, CHINA, investigated,  , FDA, dogs, died, consuming, treats, Dogswell, treats, China, researched, web, matter, line, risk, sites, OK, cautious, buy, threw, ones, bought, got, new, ones, USA.I, wish, Amazon, gave, choice, USA, recommend, pet, read, labels, treats, food, Sorry, product, recommend]


In [48]:
# show the tokenized review with lemmatization
print(custom_spacy_tokenizer_lemma(first5_rev[1],nlp))

['give', 'dog', 'treat', 'long', 'time', 'find', 'Amazon', 'cheap', 'learn', 'chicken', 'treat', 'dog', 'cat', 'believe', 'CHINA', 'investigate', ' ', 'FDA', 'dog', 'die', 'consume', 'treat', 'dogswell', 'treat', 'China', 'research', 'web', 'matter', 'line', 'risk', 'site', 'ok', 'cautious', 'buy', 'throw', 'one', 'buy', 'get', 'new', 'one', 'USA.I', 'wish', 'Amazon', 'give', 'choice', 'USA', 'recommend', 'pet', 'read', 'label', 'treat', 'food', 'sorry', 'product', 'recommend']


## Comments

- Need to take into consideration upper case words to bring them into lower case
- What do we do about empty spaces? see `custom_spacy_tokenizer_lemma(first5_rev[1],nlp)` above. Between 'investigate' and 'FDA'
- Test tokenizer with CountVectorizer