# Data Cleaning

Below steps were performed to clean reviews

1. Convert text to lower case
2. Remove HTML tags and URLs from reviews using regex substitutions
3. Expand contractions using [`contractions`](https://pypi.org/project/contractions/) library
4. Remove non-alphabetical characters using regex substitutions
5. Substitute multiple white spaces with one white space

Steps 3 was performed before step 4 since removing non-alphabetical characters would interfere with contractions such as "won't". Hence, contractions were performed before removing non-alphabetical characters.

* Average length of reviews before cleaning:    271.9
* Average length of reviews after cleaning:     262.4

# Preprocessing

Preprocessing of text involves

1. Removing stop words
2. Lemmatizing words
    * Tokenize words
    * Part of speech tagging
    * Lemmatizing using the POS tags

It was observed that the performance of all the models was better when stop words were not removed. This could be because of the increased number of features extracted from the review body. Results from both methods are attached later in the report.

Since, we observed that the performance of models trained on text **without stopwords removal** is better, the same was submitted for final evaluation.

* Average length of reviews before preprocessing:    262.4
* Average length of reviews after preprocessing:     251.4

# Feature Extraction

We used `sklearn.feature_extraction.text.TfidfVectorizer` to extract features from reviews

* Grid search along with cross validation was used to fix the parameters of TFIDF
* After grid search, I found that using n grams yields better results across all models. Hence, I set the `ngram_range` parameter to `(1, 2)`. This extracts features consisting of 1 word and 2 words resulting in increased number of features
* By performing grid search, it was also found that using all the features extracted by TFIDF resulted in maximum average precision

# Perceptron

**Hyperparameters**

* `eta0`: 0.001
* `penalty`: l2

# Support Vector Machine

**Hyperparameters**

* `C`: 1.0

# Logistic Regression

**Hyperparameters**: Default parameters

# Multinomial Naive Bayes

**Hyperparameters**: Default parameters


# Results

## With Stopwords removal

![](with_stopwords.png)

## Without Stopwords removal
![](without_stopwords_removal.png)

* As it can be seen from the above results, the performance of all the models is better when stopwords aren't removed
* Logistic regression is performing better on this dataset as compared to the other models

# References

* [Treebank Part of Speech Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
import re
import pandas as pd
import numpy as np
import contractions
from tqdm import tqdm
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import Perceptron
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from bs4 import BeautifulSoup

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /home/aditya/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/aditya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/aditya/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/aditya/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
! pip install bs4 # in case you don't have it installed

# Dataset: https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Beauty_v1_00.tsv.gz



## Read Data

In [4]:
data_og = pd.read_csv('amazon_reviews_us_Beauty_v1_00.tsv',
                      delimiter='\t',
                      usecols=['star_rating', 'review_body'],
                      on_bad_lines='skip',)

In [5]:
data_og.head()

Unnamed: 0,star_rating,review_body
0,5,"Love this, excellent sun block!!"
1,5,The great thing about this cream is that it do...
2,5,"Great Product, I'm 65 years old and this is al..."
3,5,I use them as shower caps & conditioning caps....
4,5,This is my go-to daily sunblock. It leaves no ...


## Keep Reviews and Ratings


In [6]:
# Performed when reading the file

 ## We form three classes and select 20000 reviews randomly from each class.

In [7]:
def bin_column(column, bins, labels):
    """
    :param column: pd.Series
    :param bins: list
    :param labels: list
    :return: pd.Series
    """

    bins.insert(0, -float('inf'))

    # Use pd.IntervalIndex to create bins to split the data
    bins = pd.IntervalIndex.from_breaks(bins)

    print(bins)

    x = pd.cut(column, bins=bins, include_lowest=True)
    x = x.cat.rename_categories(labels)

    return x


def prepare_data(dataframe):
    # Convert ratings to numeric
    # Ignore ratings that are not numerals
    dataframe['star_rating_numeric'] = pd.to_numeric(dataframe.star_rating, errors='coerce')

    # Drop NaN
    dataframe.dropna(inplace=True)

    # Bin ratings into 3 classes
    # 1 and 2   class_1
    # 3         class_2
    # 4 and 5   class_3

    dataframe['target'] = bin_column(dataframe.star_rating_numeric,
                                     [2, 3, 5],
                                     labels=['class_1', 'class_2', 'class_3'])

    # In the interest of computational simplicity,
    # keep only 20000 instances of each class

    tiny_df = pd.DataFrame(
        columns=['star_rating', 'review_body', 'star_rating_numeric'])

    for cls in dataframe.target.unique():
        tiny_df = pd.concat([
            tiny_df,
            dataframe[dataframe.target == cls].sample(20000, random_state=42)
        ])

    return tiny_df

In [8]:
tiny_df = prepare_data(data_og.copy())
tiny_df.head(10)

IntervalIndex([(-inf, 2.0], (2.0, 3.0], (3.0, 5.0]], dtype='interval[float64, right]')


Unnamed: 0,star_rating,review_body,star_rating_numeric,target
1797264,4.0,Around 25% less than buying at the store + fre...,4.0,class_3
2951025,5.0,"Love it! Clean, fresh, seems greasy at first, ...",5.0,class_3
2821987,5.0,This color is beautiful and amazing for summer...,5.0,class_3
2666293,5.0,I've been on the hunt for the perfect lotion. ...,5.0,class_3
2776503,5.0,my son has eczema since he was born and his sk...,5.0,class_3
3551303,5.0,This is the classiest polish I have ever seen....,5.0,class_3
1459989,4.0,it's not very pigmented but that was why I bou...,4.0,class_3
725454,5.0,"Yay, my new favorite cleanser! I've been using...",5.0,class_3
2080912,5.0,"I ordered this on 12 Nov 2014, and received it...",5.0,class_3
2282098,5.0,I have purchased so many new styling tools tr...,5.0,class_3


# Data Cleaning

In [9]:
CLEAN_HTML = re.compile('<.*?>')            # Regex to match HTML tags
CLEAN_URL = re.compile('(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})')
CLEAN_SPACES = re.compile('\s+')            # Regex to match multiple spaces
CLEAN_NON_ALPHA = re.compile('[^a-zA-Z]')   # Regex to match non-alphabetic characters


def clean_text(text):
    # Convert to lower case
    text = text.lower()

    # Remove HTML and URL tags from text
    text = re.sub(CLEAN_HTML, ' ', text)
    text = re.sub(CLEAN_URL, ' ', text)

    # Perform contractions on the text
    text = contractions.fix(text)

    # Remove non-alphabetic characters
    text = re.sub(CLEAN_NON_ALPHA, ' ', text)

    # Remove additional spaces
    text = re.sub(CLEAN_SPACES, ' ', text)

    return text

tqdm.pandas()
print(f"Average length of review before cleaning: {tiny_df.review_body.apply(len).mean()}")
tiny_df['review_body_pp'] = tiny_df.review_body.progress_apply(clean_text)
print(f"Average length of review before cleaning: {tiny_df.review_body_pp.apply(len).mean()}")

Average length of review before cleaning: 271.9501


100%|██████████| 60000/60000 [00:02<00:00, 29004.61it/s]

Average length of review before cleaning: 262.4223





In [10]:
tiny_df.head(10)

Unnamed: 0,star_rating,review_body,star_rating_numeric,target,review_body_pp
1797264,4.0,Around 25% less than buying at the store + fre...,4.0,class_3,around less than buying at the store free ship...
2951025,5.0,"Love it! Clean, fresh, seems greasy at first, ...",5.0,class_3,love it clean fresh seems greasy at first but ...
2821987,5.0,This color is beautiful and amazing for summer...,5.0,class_3,this color is beautiful and amazing for summer...
2666293,5.0,I've been on the hunt for the perfect lotion. ...,5.0,class_3,i have been on the hunt for the perfect lotion...
2776503,5.0,my son has eczema since he was born and his sk...,5.0,class_3,my son has eczema since he was born and his sk...
3551303,5.0,This is the classiest polish I have ever seen....,5.0,class_3,this is the classiest polish i have ever seen ...
1459989,4.0,it's not very pigmented but that was why I bou...,4.0,class_3,it is not very pigmented but that was why i bo...
725454,5.0,"Yay, my new favorite cleanser! I've been using...",5.0,class_3,yay my new favorite cleanser i have been using...
2080912,5.0,"I ordered this on 12 Nov 2014, and received it...",5.0,class_3,i ordered this on nov and received it november...
2282098,5.0,I have purchased so many new styling tools tr...,5.0,class_3,i have purchased so many new styling tools try...


# Pre-processing

## Remove the stop words

## perform lemmatization

In [11]:
lemmatizer = WordNetLemmatizer()


def get_wordnet_pos(treebank_tag):
    treebank_to_wordnet = {
        'n': 'n', # Noun
        'v': 'v', # Verb,
        'j': 'a', # Adjective
        'r': 'r', # Adverb
    }
    wordnet_tag = treebank_to_wordnet.get(treebank_tag[0].lower())
    return wordnet_tag if wordnet_tag is not None else 'n'


def preprocess_text(text):
    """
    * Remove stop words
    * Perform lemmatization
    :param text: str
    :return: str
    """
    # Get a list of all the words
    words = text.split(' ')

    # Remove stop words
    # words = [w for w in words if w not in stopwords.words('english')]

    # Tokenize the rest of the text using NLTK
    text = ' '.join(words)
    text = nltk.word_tokenize(text)

    # Get part of speech tags and use it for lemmatization
    pos_tags = nltk.pos_tag(text)
    # print(pos_tags)
    # print([get_wordnet_pos(v) for _, v in pos_tags])
    words = [lemmatizer.lemmatize(word=w, pos=get_wordnet_pos(v)) for w, v in pos_tags]
    return ' '.join(words)

tqdm.pandas()
print(f"Average length of review before preprocessing: {tiny_df.review_body_pp.apply(len).mean()}")
tiny_df['review_body_pp'] = tiny_df.review_body_pp.progress_apply(preprocess_text)
print(f"Average length of review before preprocessing: {tiny_df.review_body_pp.apply(len).mean()}")

Average length of review before preprocessing: 262.4223


100%|██████████| 60000/60000 [01:14<00:00, 808.63it/s] 

Average length of review before preprocessing: 251.48775





In [12]:
tiny_df.head(10)

Unnamed: 0,star_rating,review_body,star_rating_numeric,target,review_body_pp
1797264,4.0,Around 25% less than buying at the store + fre...,4.0,class_3,around less than buying at the store free ship...
2951025,5.0,"Love it! Clean, fresh, seems greasy at first, ...",5.0,class_3,love it clean fresh seem greasy at first but o...
2821987,5.0,This color is beautiful and amazing for summer...,5.0,class_3,this color be beautiful and amaze for summer t...
2666293,5.0,I've been on the hunt for the perfect lotion. ...,5.0,class_3,i have be on the hunt for the perfect lotion t...
2776503,5.0,my son has eczema since he was born and his sk...,5.0,class_3,my son have eczema since he be bear and his sk...
3551303,5.0,This is the classiest polish I have ever seen....,5.0,class_3,this be the classy polish i have ever see it l...
1459989,4.0,it's not very pigmented but that was why I bou...,4.0,class_3,it be not very pigmented but that be why i buy...
725454,5.0,"Yay, my new favorite cleanser! I've been using...",5.0,class_3,yay my new favorite cleanser i have be use an ...
2080912,5.0,"I ordered this on 12 Nov 2014, and received it...",5.0,class_3,i order this on nov and receive it november da...
2282098,5.0,I have purchased so many new styling tools tr...,5.0,class_3,i have purchase so many new styling tool try t...


# TF-IDF Feature Extraction

In [13]:
tfidf_features = TfidfVectorizer(ngram_range=(1, 2))
tfidf_features.fit(tiny_df.review_body_pp.values)

# Training data after TF-IDF feature extraction
X = tfidf_features.transform(tiny_df.review_body_pp.values)
y = tiny_df.target.to_numpy()

In [14]:
tfidf_features.get_feature_names_out()

array(['aa', 'aa battery', 'aa be', ..., 'zy event', 'zytaze',
       'zytaze really'], dtype=object)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
def fit_model_cv(pipeline, parameter_grid, training_data, target_labels):
    cv = GridSearchCV(
        estimator=pipeline,
        param_grid=parameter_grid,
    )
    cv.fit(training_data, target_labels)
    return cv


def get_model_metrics(pipeline, testing_data, testing_labels, print_output=True):
    y_pred = pipeline.predict(testing_data)
    clf_report = classification_report(testing_labels, y_pred, output_dict=True)

    if print_output:
        for cls in set(testing_labels):
            print(f"{clf_report[cls]['precision']}, "
                  f"{clf_report[cls]['recall']}, "
                  f"{clf_report[cls]['f1-score']}")
        print(f"{clf_report['weighted avg']['precision']}, "
              f"{clf_report['weighted avg']['recall']}, "
              f"{clf_report['weighted avg']['f1-score']}")

    # return clf_report

# Perceptron

In [17]:
print("Perceptron Model")
perceptron = fit_model_cv(Perceptron(),
                          {'penalty': ['l1', 'l2'],
                           'eta0': [1e-4, 0.001, 0.01, 0.1, 1]
                           },
                          X_train, y_train)

get_model_metrics(perceptron, X_test, y_test)

Perceptron Model
0.602112676056338, 0.5950782997762863, 0.5985748218527316
0.7889536356726627, 0.7857683573050719, 0.7873577749683945
0.7027225901398086, 0.7137518684603886, 0.7081942899517982
0.6974709192305503, 0.69775, 0.697588104198317


# SVM

In [18]:
print("Support Vector Machine")
svc = fit_model_cv(LinearSVC(),
                   {'C': np.arange(1, 1.5, 0.2),},
                   X_train, y_train)

get_model_metrics(svc, X_test, y_test)

Support Vector Machine
0.6535742340926944, 0.6204325130499627, 0.6365723029839327
0.807205452775073, 0.836739843552864, 0.8217073472927766
0.7353302234225386, 0.7461385151968112, 0.7406949425003091
0.7316583224933828, 0.7339166666666667, 0.7325421742851563


# Logistic Regression

In [19]:
print("Logistic Regression Model")
logistic_reg = fit_model_cv(LogisticRegression(),
                            {},
                            # {'C': np.arange(1, 2, 0.2),},
                            X_train, y_train)

get_model_metrics(logistic_reg, X_test, y_test)

Logistic Regression Model
0.6606917445089624, 0.6505095699726572, 0.655561122244489
0.822772027265842, 0.8223568004037345, 0.8225643614336193
0.7442373712604218, 0.7561036372695565, 0.750123578843302
0.7421647700377851, 0.7425833333333334, 0.7423450837190022


# Naive Bayes

In [20]:
print("Multinomial Naive Bayes Model")
naive_bayes = fit_model_cv(MultinomialNB(),
                           {},
                           X_train, y_train)

get_model_metrics(naive_bayes, X_test, y_test)


Multinomial Naive Bayes Model
0.604603370324702, 0.7312950534427044, 0.6619417257284285
0.8702266376901583, 0.707292455210699, 0.7803452115812917
0.7446971633018145, 0.7259591429995017, 0.735208780118582
0.7391868281229881, 0.7215833333333334, 0.7255523066248427
