# Text classification (sentiment analysis)
Task: Predict sentiment of Amazon reviews
Dataset: Beans from TFDS

## 1. Loading dataset & basic preprocessing
- removal of reviews shorter than 5 characters
- mapping from 1-5 -> 0,1,2
- subsampling - without replacement, random state 42, 80 000 rows

In [79]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from IPython.display import display
import re
import matplotlib.pyplot as plt

In [80]:
df = pd.read_csv('datasets/amazon_reviews_us_Major_Appliances_v1_00.tsv', sep='\t', on_bad_lines='skip')

In [81]:
# remove nas
df.dropna(axis=0, subset=['review_body'], inplace=True)

In [82]:
def remove_tags(review):
    return re.sub(pattern='<.*?>', string=review , repl=' ') 

def remove_subs(review):
    return re.sub(pattern='', string=review , repl=' ')

def keep_alnum(review):
    return re.sub(pattern='[^A-Za-z\d\s:]', string=review, repl=' ')

def strip_spaces(review):
    return re.sub(pattern='[\s]{2,}', string=review, repl=' ')

def lowercase(review):
    return review.lower()

In [83]:
df['review_body'] = df['review_body'].apply(remove_tags) # removes html tags
df['review_body'] = df['review_body'].apply(remove_subs) # removes sub unicode char
df['review_body'] = df['review_body'].apply(keep_alnum) # removes sub unicode char
df['review_body'] = df['review_body'].apply(strip_spaces) # removes sub unicode char
df['review_body'] = df['review_body'].apply(lowercase)

In [84]:
df['review_body']

0        what a great stove what a wonderful replacemen...
1                                             worked great
2        part exactly what i needed saved by purchasing...
3        love my refrigerator keeps everything cold wil...
4        no more running to the store for ice works per...
                               ...                        
96829    this is a pretty good dishwasher for the price...
96830    i bought this for our office and was extremely...
96831    when i saw this small dishwasher i thought it ...
96832    probably the best small refrigerator on the ma...
96833    this is just a normal mid sized refrigerater b...
Name: review_body, Length: 96827, dtype: object

In [85]:
df = df[['review_body', 'star_rating']]
df = df[df['review_body'].str.len() > 5]
df.loc[df['star_rating'] < 3, 'sentiment'] = 0
df.loc[df['star_rating'] == 3, 'sentiment'] = 1
df.loc[df['star_rating'] > 3, 'sentiment'] = 2
df.drop('star_rating', axis=1, inplace=True)
df = resample(df, n_samples=80000, random_state=42, replace=False)
print(df.shape)

(80000, 2)


In [86]:
df

Unnamed: 0,review_body,sentiment
2579,just what i needed for my camping trip,2.0
52110,a little noisy but makes ice quickly nice grap...,2.0
54656,ugh 3rd agitator in 6 months plastic paddles c...,0.0
13607,so far so good works like oem for a lesser price,2.0
75019,i m a huge fan of newcastle so thought i would...,0.0
...,...,...
9509,this was awesome,2.0
24482,it fir perfectly and replaced my factory filte...,2.0
20160,the knobs on this range get excessively hot wh...,0.0
1020,this looks great in my kitchen it s a little l...,2.0


## Exploratory Data Analysis of the dataset

In [87]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df['review_body'], df['sentiment'], random_state=42, test_size=0.1, stratify=df['sentiment']
)