# Text classification (sentiment analysis)
Task: Predict sentiment of Amazon reviews
Dataset: Beans from TFDS

## 1. Loading dataset & basic preprocessing
- removal of reviews shorter than 5 characters
- mapping from 1-5 -> 0,1,2
- subsampling - without replacement, random state 42, 80 000 rows

In [27]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from IPython.display import display
import re
import matplotlib.pyplot as plt

In [22]:
df = pd.read_csv('datasets/amazon_reviews_us_Major_Appliances_v1_00.tsv', sep='\t', on_bad_lines='skip')
df = df[['review_body', 'star_rating']]
df = df[df['review_body'].str.len() > 5]
df.loc[df['star_rating'] < 3, 'sentiment'] = 0
df.loc[df['star_rating'] == 3, 'sentiment'] = 1
df.loc[df['star_rating'] > 3, 'sentiment'] = 2
df.drop('star_rating', axis=1, inplace=True)
df = resample(df, n_samples=80000, random_state=42, replace=False)
print(df.shape)

(80000, 2)


In [23]:
df

Unnamed: 0,review_body,sentiment
2578,Not very quiet.,1.0
34744,Not many hours on the new ignitor but so far i...,2.0
69710,This one replaced an 20 year old Rangaire. Th...,2.0
50218,it looks like the OEM i pulled out and fit jus...,2.0
78859,Looks great. My son loves it.<br />Keeps cans ...,2.0
...,...,...
88974,"When stove was set at 350 degrees, handle very...",0.0
96409,This thing works great! Don't put in too much ...,2.0
58911,"Started ok, but after 2 months use, I started ...",0.0
77313,I bought the white version of this refrigerato...,0.0


In [28]:
def remove_tags(review):
    return re.sub(pattern='<.*?>', string=review , repl=' ') 

def remove_subs(review):
    return re.sub(pattern='', string=review , repl=' ')

In [29]:
df['review_body'] = df['review_body'].apply(remove_tags) # removes html tags
df['review_body'] = df['review_body'].apply(remove_subs) # removes sub unicode char

In [30]:
df

Unnamed: 0,review_body,sentiment
2578,Not very quiet.,1.0
34744,Not many hours on the new ignitor but so far i...,2.0
69710,This one replaced an 20 year old Rangaire. Th...,2.0
50218,it looks like the OEM i pulled out and fit jus...,2.0
78859,Looks great. My son loves it. Keeps cans prett...,2.0
...,...,...
88974,"When stove was set at 350 degrees, handle very...",0.0
96409,This thing works great! Don't put in too much ...,2.0
58911,"Started ok, but after 2 months use, I started ...",0.0
77313,I bought the white version of this refrigerato...,0.0


## Exploratory Data Analysis of the dataset

In [10]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    df['review_body'], df['sentiment'], random_state=42, test_size=0.1, stratify=df['sentiment']
)