# Text classification (sentiment analysis)
Task: Predict sentiment of Amazon reviews
Dataset: Beans from TFDS

## 1. Loading dataset & basic preprocessing
- removal of reviews shorter than 5 characters
- mapping from 1-5 -> 0,1,2
- subsampling - without replacement, random state 42, 80 000 rows

In [105]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from IPython.display import display
import re
import matplotlib.pyplot as plt

In [106]:
df = pd.read_csv('datasets/amazon_reviews_us_Major_Appliances_v1_00.tsv', sep='\t', on_bad_lines='skip')

In [107]:
# remove nas and duplicate reviews
df.dropna(axis=0, subset=['review_body'], inplace=True)
df.drop_duplicates(subset=['review_body'], inplace=True)

In [108]:
df

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,16199106,R203HPW78Z7N4K,B0067WNSZY,633038551,"FGGF3032MW Gallery Series 30"" Wide Freestandin...",Major Appliances,5,0,0,N,Y,"If you need a new stove, this is a winner.",What a great stove. What a wonderful replacem...,2015-08-31
1,US,16374060,R2EAIGVLEALSP3,B002QSXK60,811766671,Best Hand Clothes Wringer,Major Appliances,5,1,1,N,Y,Five Stars,worked great,2015-08-31
2,US,15322085,R1K1CD73HHLILA,B00EC452R6,345562728,Supco SET184 Thermal Cutoff Kit,Major Appliances,5,0,0,N,Y,Fast Shipping,Part exactly what I needed. Saved by purchasi...,2015-08-31
3,US,32004835,R2KZBMOFRMYOPO,B00MVVIF2G,563052763,Midea WHS-160RB1 Compact Single Reversible Doo...,Major Appliances,5,1,1,N,Y,Five Stars,Love my refrigerator! ! Keeps everything cold...,2015-08-31
4,US,25414497,R6BIZOZY6UD01,B00IY7BNUW,874236579,Avalon Bay Portable Ice Maker,Major Appliances,5,0,0,N,Y,Five Stars,No more running to the store for ice! Works p...,2015-08-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96829,US,37431087,R3CYIDM3UEY5PA,B00005O64S,222987122,Haier HDT18PA Space Saver Compact Countertop D...,Major Appliances,4,37,43,N,N,Pretty good dishwasher for small apartment,This is a pretty good dishwasher for the price...,2002-07-14
96830,US,44686434,R1PLFLGSA6N9WU,B00005O64T,802734810,Haier America HSE02-WNAWW 1.8-Cubic-Foot Capac...,Major Appliances,1,33,39,N,N,Does not last long,I bought this for our office and was extremely...,2002-06-03
96831,US,36739731,RBPARLMOY6ZU5,B00005O64S,222987122,Haier HDT18PA Space Saver Compact Countertop D...,Major Appliances,5,6,45,N,N,Rave review for space saver,When I saw this small dishwasher I thought it ...,2002-05-05
96832,US,50744080,RSS5TDZOGUEB6,B00004SACT,344802997,Sanyo Two-Door 2.9 Cubic Foot Refrigerator,Major Appliances,4,71,71,N,N,Sanyo compact refrigerator,Probably the best small refrigerator on the ma...,2000-09-29


In [109]:
def remove_tags(review):
    return re.sub(pattern='<.*?>', string=review , repl=' ') 

def remove_subs(review):
    return re.sub(pattern='', string=review , repl=' ')

def keep_alnum(review):
    return re.sub(pattern='[^A-Za-z\d\s:]', string=review, repl=' ')

def strip_spaces(review):
    return re.sub(pattern='[\s]{2,}', string=review, repl=' ')

def lowercase(review):
    return review.lower()

In [110]:
df['review_body'] = df['review_body'].apply(remove_tags) # removes html tags
df['review_body'] = df['review_body'].apply(remove_subs) # removes sub unicode char
df['review_body'] = df['review_body'].apply(keep_alnum) # removes sub unicode char
df['review_body'] = df['review_body'].apply(strip_spaces) # removes sub unicode char
df['review_body'] = df['review_body'].apply(lowercase)

In [111]:
df['review_body']

0        what a great stove what a wonderful replacemen...
1                                             worked great
2        part exactly what i needed saved by purchasing...
3        love my refrigerator keeps everything cold wil...
4        no more running to the store for ice works per...
                               ...                        
96829    this is a pretty good dishwasher for the price...
96830    i bought this for our office and was extremely...
96831    when i saw this small dishwasher i thought it ...
96832    probably the best small refrigerator on the ma...
96833    this is just a normal mid sized refrigerater b...
Name: review_body, Length: 93446, dtype: object

In [112]:
df = df[['review_body', 'star_rating']]
df = df[df['review_body'].str.len() > 5]
df.loc[df['star_rating'] < 3, 'sentiment'] = 0
df.loc[df['star_rating'] == 3, 'sentiment'] = 1
df.loc[df['star_rating'] > 3, 'sentiment'] = 2
df.drop('star_rating', axis=1, inplace=True)
df = resample(df, n_samples=80000, random_state=42, replace=False)
print(df.shape)

(80000, 2)


In [113]:
df

Unnamed: 0,review_body,sentiment
39391,i bought these parts from parts fast through a...,0.0
89072,we bought this unit as a gift for a family mem...,0.0
63534,my dishwasher came to a standstill and i calle...,0.0
70388,i purchased this dryer 16 months ago and it no...,0.0
63091,this was installed by our contractor 3 weeks a...,2.0
...,...,...
57692,looks good but suction is poor has a nice clea...,0.0
52810,all the parts have worked well so far except t...,0.0
30634,we replaced a 14 year old ge jp model very sim...,2.0
8420,it has been a month since i am using it powerf...,2.0


## Exploratory Data Analysis of the dataset

In [115]:
df['sentiment'].value_counts()

sentiment
2.0    52946
0.0    21348
1.0     5706
Name: count, dtype: int64

In [114]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df['review_body'], df['sentiment'], random_state=42, test_size=0.1, stratify=df['sentiment']
)