# Preprocessing and Tokenization

Rodrigo Becerra Carrillo

https://github.com/bcrodrigo

## Introduction

Notebook to perform Preprocessing and Tokenization on a reviews dataset of Amazon foods.

The dataset was sourced from [here](https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews/data).

## Data Dictionary


| Column Name            | Description                                                               | Data Type |
| ---------------------- | ------------------------------------------------------------------------- | --------- |
| Id                     | Row ID                                                                    | int64     |
| ProductId              | Unique identifier for Product                                             | object    |
| UserId                 | Unique identifier for User                                                | object    |
| ProfileName            | Profile name of the user                                                  | object    |
| HelpfulnessNumerator   | Number of users who found the review helpful                              | int64     |
| HelpfulnessDenominator | Number of users who indicated wether they found the review helpful or not | int64     |
| Score                  | Rating between 1 and 5                                                    | int64     |
| Time                   | Timestamp for the review                                                  | int64     |
| Summary                | Brief summary of the review                                               | object    |
| Text                   | Full review                                                               | object    |


Previously, we performed EDA and noticed there were no missing values, and that there was a class imbalance in teh `Score`. From the table above, we'll only use `Text` and `Score` as features and target variable, respectively.

## Import Custom Modules

In [1]:
import sys
sys.path

['/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python312.zip',
 '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12',
 '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12/lib-dynload',
 '',
 '/Users/rodrigo/anaconda3/envs/pytorch_env/lib/python3.12/site-packages',
 '/Users/rodrigo/Documents/Github/medium_articles/packages_and_modules/example_package']

In [2]:
sys.path.append('..')

In [3]:
from src.preprocessing import preprocess_dataset

In [4]:
preprocess_dataset?

[0;31mSignature:[0m [0mpreprocess_dataset[0m[0;34m([0m[0mcsv_filename[0m[0;34m,[0m [0mrebalance[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Function to preprocess a reviews datascet in csv into a dataframe with score and text.

Parameters
----------
csv_filename : str
    Path to the csv file containing the data. Note the file is expected to be compressed using gzip.

rebalance : bool, optional
    Optional flag indicates to balance the number of reviews.

Returns
-------
tuple
    Pandas DataFrames (df_orig, df_rebalanced), each with two columns: text and review score.

    if rebalance is False
        df_orig : contains all records
        df_rebalanced : is an empty dataframe

    if rebalance is True
        df_orig : contains all records minus those used to rebalance the review score
        df_rebalanced : contains all records used to balanced number of reviews by score

    Note that in either case pd.concat([df_orig,df_

## Import Libraries and Load DataFrame

In [5]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [6]:
file_path = '../data/Reviews.csv.gz'

In [7]:
dforig, dfnew = preprocess_dataset(file_path,rebalance=True)

In [8]:
dforig.shape

(440534, 2)

In [9]:
dfnew.shape

(127920, 2)

In [10]:
dfnew.shape[0] + dforig.shape[0]

568454

In [11]:
dfnew['Score'].value_counts()

Score
0    42640
1    42640
2    42640
Name: count, dtype: int64

In [12]:
dforig['Score'].value_counts()

Score
2    401137
0     39397
Name: count, dtype: int64

We'll now calculate for each score what is the average length of a review.

In [None]:
df[['Score','Text']].head()

In [None]:
df['review_n_char'] = df['Text'].apply(lambda x: len(x))

In [None]:
df[['Score','review_n_char']]

In [None]:
agg_df = df[['Score','review_n_char']].groupby('Score').aggregate('mean')

In [None]:
agg_df.plot(kind = 'bar')
plt.title('Average number of characters')
plt.xlabel('Review Score')
plt.ylabel('Review Length (# of characters)')
plt.grid()
ticks = plt.xticks(rotation = 0)

From the graph above we see that, on average, there is no significant difference in the average number of characters of each review. 

The reviews with the highest score (5) seem to have the least number of characters.

# Preprocessing

In this section we'll tokenize the contents of `dfnew`.

The first approach we'll take will be through the Bag-of-Words model with Scikit-Learn.
We need to 

- Instantiate an instance of CountVectorizer
- Define a tokenizer that removes punctuation, stop words, and performs either stemming or lemmatization

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

In [1]:
import nltk

In [5]:
# import the nltk stopwords
nltk.download('stopwords')

from nltk.corpus import stopwords 
ENGLISH_STOP_WORDS = stopwords.words('english')

def custom_tokenizer(sentence):
    # remove punctuation and set to lower case
    for punctuation_mark in string.punctuation:
        sentence = sentence.replace(punctuation_mark,'').lower()

    # split sentence into words
    listofwords = sentence.split(' ')
    listofstemmed_words = []
    
    # remove stopwords and any tokens that are just empty strings
    for word in listofwords:
        if (not word in ENGLISH_STOP_WORDS) and (word!=''):
            # Stem words
            stemmed_word = stemmer.stem(word)
            listofstemmed_words.append(stemmed_word)

    return listofstemmed_words

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rodrigo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
