# Sentiment Classification of Product Reviews Using RNNs

## Read and Explore the dataset

Explore the dataset

Data Manipulations - Data Processing

Cleaning

Tokenize the reviews

Convert this into tensor. Array

Padding for small reviews

Build a model and train the model

Predict the sentiments


In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
import pandas as pd
import numpy as np
import re

In [4]:
# from google.colab import drive
# drive.mount("/content/drive/")

In [5]:
%%writefile get_data.sh
if [ ! -f amazon_product_reviews_mod.csv ]; then
  wget -O amazon_product_reviews_mod.csv https://www.dropbox.com/scl/fi/imdwnlb4b617hfrepik8c/amazon_product_reviews_mod.csv?rlkey=3oafw8jsifa74v8hsvsuhcf77&dl=0
fi

Writing get_data.sh


In [6]:
!bash get_data.sh

--2024-02-29 14:24:24--  https://www.dropbox.com/scl/fi/imdwnlb4b617hfrepik8c/amazon_product_reviews_mod.csv?rlkey=3oafw8jsifa74v8hsvsuhcf77
Resolving www.dropbox.com (www.dropbox.com)... 162.125.4.18, 2620:100:6019:18::a27d:412
Connecting to www.dropbox.com (www.dropbox.com)|162.125.4.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://ucdfdd01adc6482fad248650d857.dl.dropboxusercontent.com/cd/0/inline/CONZSZhMWwGzPdWZrlEBgsYuT1nxicC5FIZwPTWp984Vnxr2LGHGjNp2tlqqN5vJCkkvKU6-xnQGgVd5c-862Jf36Df4zpNKdePwAl52qq5qV0urtb50wQ-QXWJ9LEu5ZfvO4cHuME3O9p1_amMXbJ-n/file# [following]
--2024-02-29 14:24:25--  https://ucdfdd01adc6482fad248650d857.dl.dropboxusercontent.com/cd/0/inline/CONZSZhMWwGzPdWZrlEBgsYuT1nxicC5FIZwPTWp984Vnxr2LGHGjNp2tlqqN5vJCkkvKU6-xnQGgVd5c-862Jf36Df4zpNKdePwAl52qq5qV0urtb50wQ-QXWJ9LEu5ZfvO4cHuME3O9p1_amMXbJ-n/file
Resolving ucdfdd01adc6482fad248650d857.dl.dropboxusercontent.com (ucdfdd01adc6482fad248650d857.dl.dropboxusercontent.com)... 16

In [7]:
#cd 'drive/My Drive'

We already downloaded the dataset from [here](https://www.kaggle.com/datasets/datafiniti/consumer-reviews-of-amazon-products?select=1429_1.csv) and saved it as a CSV file which we'll read now using Pandas.

In [8]:
#path = 'HousePricePrediction.csv.csv'
df = pd.read_csv('amazon_product_reviews_mod.csv')

In [9]:
df.shape

(28332, 4)

In [10]:
df.iloc[10]

reviews.rating                                                      5
reviews.text        I find amazon basics batteries to be equal if ...
reviews.title       ... find amazon basics batteries to be equal i...
reviews.username                                             ByTXcust
Name: 10, dtype: object

In [11]:
print(df.iloc[1000]["reviews.rating"])
print(df.iloc[1000]["reviews.text"])

1
These are terrible. Don't last Put then in various items and their life is only 1/2 half of a Duracell


In [12]:
print(df.iloc[27162]["reviews.rating"])
print(df.iloc[27162]["reviews.text"])

5
Had to chose between Fire 10 and Fire 8. For the money, you can't go wrong with the Fire 8. Screen is good for " older eyes".


## Data Manipulations

In [13]:
reviews_df = df[["reviews.text", "reviews.rating",]]
reviews_df.columns = ["review", "rating"]

In [14]:
reviews_df.head(10)

Unnamed: 0,review,rating
0,I order 3 of them and one of the item is bad q...,3
1,Bulk is always the less expensive way to go fo...,4
2,Well they are not Duracell but for the price i...,5
3,Seem to work as well as name brand batteries a...,5
4,These batteries are very long lasting the pric...,5
5,Bought a lot of batteries for Christmas and th...,5
6,ive not had any problame with these batteries ...,5
7,Well if you are looking for cheap non-recharge...,5
8,These do not hold the amount of high power jui...,3
9,AmazonBasics AA AAA batteries have done well b...,4


In [15]:
reviews_df.isnull().sum()

review    0
rating    0
dtype: int64

In [16]:
reviews_df.dropna(inplace=True)

In [17]:
def sentiments(rating):
    if (rating == 5) or (rating == 4):
        return "positive"
    elif rating == 3:
        return "neutral"
    elif (rating == 2) or (rating == 1):
        return "negative"

In [18]:
reviews_df["sentiment"] = reviews_df["rating"].apply(sentiments)

In [19]:
reviews_df.sample(10, random_state = 86, ignore_index=True)

Unnamed: 0,review,rating,sentiment
0,perfect for plane rides. the product operates ...,5,positive
1,This suited my exact needs which was a mobile ...,5,positive
2,Seem to last a long time,4,positive
3,I never owned a Kindle before. This was easy t...,5,positive
4,good batteries for the price. i will order again,5,positive
5,I realized after purchasing that the reason yo...,1,negative
6,"I do a lot of international travelling, so I h...",5,positive
7,Bought one of these two weeks before on Amazon...,5,positive
8,I reccomend this tabley yo every one. I purcha...,5,positive
9,My granddaughter loved her tablet and the colo...,5,positive


In [20]:
sentiments = reviews_df["sentiment"].values
reviews = reviews_df["review"].values
print(sentiments[0:5])
print("\n")
print(reviews[0:5])

['neutral' 'positive' 'positive' 'positive' 'positive']


['I order 3 of them and one of the item is bad quality. Is missing backup spring so I have to put a pcs of aluminum to make the battery work.'
 'Bulk is always the less expensive way to go for products like these'
 'Well they are not Duracell but for the price i am happy.'
 'Seem to work as well as name brand batteries at a much better price'
 'These batteries are very long lasting the price is great.']


## Convert text categories into one-hot encoded vector

We are importing keras utils here, this will be needed to convert the labels to one-hot encoded vectors. We'll use the the function "to_categorical" from this library to do this. To know more about this function, please check out this [link](https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical).

1. Tensorflow
2. PyTorch
3. Keras

%pip install tensorflow

In [21]:
import tensorflow.keras.utils as ku


Keras to-categorical function takes numerical arguments, that's why we need to convert the text labels to numerical values first. To achieve this, we'll define this funnction.

In [22]:
def encode_sentiments(sentiment):
    if sentiment == "negative":
        return 0
    elif sentiment == "neutral":
        return 1
    elif sentiment == "positive":
        return 2

In [23]:
label_encoding = {0: 'negative', 1: 'neutral', 2: 'positve'}

In [24]:
sentiments_encoded = [encode_sentiments(sentiment) for sentiment in sentiments]

In [25]:
print(sentiments[0:10])
print(sentiments_encoded[0:10])

['neutral' 'positive' 'positive' 'positive' 'positive' 'positive'
 'positive' 'positive' 'neutral' 'positive']
[1, 2, 2, 2, 2, 2, 2, 2, 1, 2]


In [26]:
labels = ku.to_categorical(sentiments_encoded, num_classes = 3)

In [27]:
print(labels[0:20])
sentiments[0:20]

[[0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]


array(['neutral', 'positive', 'positive', 'positive', 'positive',
       'positive', 'positive', 'positive', 'neutral', 'positive',
       'positive', 'neutral', 'positive', 'positive', 'positive',
       'positive', 'negative', 'negative', 'positive', 'neutral'],
      dtype=object)

## Cleaning of the review texts

In [28]:
test_str = "I am really really impressed by this product."
print(test_str)
old_pattern = "really"
new_pattern = "very"
new_str = re.sub(old_pattern, new_pattern, test_str)
print(new_str)

I am really really impressed by this product.
I am very very impressed by this product.


In [29]:
test_str = "Python 3.8"
print(test_str)
old_pattern = r'\d'
new_pattern = '<digit>'
new_str = re.sub(old_pattern, new_pattern, test_str)
print(new_str)

Python 3.8
Python <digit>.<digit>


### Remove hyperlinks

In [30]:
test_str = "Visit https://www.amazon.com for more information on this."
test_pattern = r'http\S+'
print(test_str)
new_str = re.sub(test_pattern, " ", test_str)
print(new_str)

Visit https://www.amazon.com for more information on this.
Visit   for more information on this.


In [31]:
def remove_hyperlinks(text):
    pattern_for_hyperlink = r'http\S+'
    return re.sub(pattern_for_hyperlink, " ", text)

In [32]:
reviews = [remove_hyperlinks(review) for review in reviews]

In [33]:
reviews[0:5]

['I order 3 of them and one of the item is bad quality. Is missing backup spring so I have to put a pcs of aluminum to make the battery work.',
 'Bulk is always the less expensive way to go for products like these',
 'Well they are not Duracell but for the price i am happy.',
 'Seem to work as well as name brand batteries at a much better price',
 'These batteries are very long lasting the price is great.']

### Expand contracted words

In [34]:
def remove_contracted_words(text):
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can\'t", "can not", text)
    text = re.sub(r"n\'t", " not", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'t", " not", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'m", " am", text)
    return text

In [35]:
test_str = "I can't use this product. I don't recommend this product to anyone."
print(test_str)
print(remove_contracted_words(test_str))

I can't use this product. I don't recommend this product to anyone.
I can not use this product. I do not recommend this product to anyone.


In [36]:
reviews = [remove_contracted_words(review) for review in reviews]

In [37]:
reviews[0:5]

['I order 3 of them and one of the item is bad quality. Is missing backup spring so I have to put a pcs of aluminum to make the battery work.',
 'Bulk is always the less expensive way to go for products like these',
 'Well they are not Duracell but for the price i am happy.',
 'Seem to work as well as name brand batteries at a much better price',
 'These batteries are very long lasting the price is great.']

### Remove everything other than letters of alphabet

In [38]:
def remove_non_letters(text):
    antipattern = r'[^A-Za-z]+'
    return re.sub(antipattern, " ", text)

In [39]:
reviews = [remove_non_letters(review) for review in reviews]
reviews[0:5]

['I order of them and one of the item is bad quality Is missing backup spring so I have to put a pcs of aluminum to make the battery work ',
 'Bulk is always the less expensive way to go for products like these',
 'Well they are not Duracell but for the price i am happy ',
 'Seem to work as well as name brand batteries at a much better price',
 'These batteries are very long lasting the price is great ']

In [40]:
test_string = "I recommend      resolution with   GB of RAM     "
test_string_list = test_string.split()
test_string_list

['I', 'recommend', 'resolution', 'with', 'GB', 'of', 'RAM']

In [41]:
' '.join(test_string_list)

'I recommend resolution with GB of RAM'

In [42]:
test_string = "       I do not like this product     "
test_string.strip()

'I do not like this product'

In [43]:
test_string = "I did not like Deliver Package provided by AMAZON"
test_string.lower()

'i did not like deliver package provided by amazon'

In [44]:
def remove_spaces_and_convert_to_Lowercase(text):
    return ' '.join(text.split()).strip().lower()

In [45]:
reviews = [remove_spaces_and_convert_to_Lowercase(review) for review in reviews]
reviews[0:5]

['i order of them and one of the item is bad quality is missing backup spring so i have to put a pcs of aluminum to make the battery work',
 'bulk is always the less expensive way to go for products like these',
 'well they are not duracell but for the price i am happy',
 'seem to work as well as name brand batteries at a much better price',
 'these batteries are very long lasting the price is great']

## Tokenization

In [46]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [47]:
VOCAB_SIZE = 30000
UNK_TOK = '<UNK>'

In [48]:
tokenizer = Tokenizer(num_words = VOCAB_SIZE, oov_token=UNK_TOK)

In [49]:
tokenizer.fit_on_texts(reviews)

In [50]:
sequences = tokenizer.texts_to_sequences(reviews)

Tokenize the sentences are sequences.

In [51]:
print(sequences[0])
print(sequences[3])
print(reviews[0])
print(reviews[3])

[3, 313, 14, 38, 4, 42, 14, 2, 252, 9, 350, 92, 9, 1348, 1410, 2172, 28, 3, 17, 5, 177, 8, 3575, 14, 3945, 5, 240, 2, 64, 55]
[138, 5, 55, 16, 53, 16, 104, 96, 15, 47, 8, 82, 88, 25]
i order of them and one of the item is bad quality is missing backup spring so i have to put a pcs of aluminum to make the battery work
seem to work as well as name brand batteries at a much better price


## Padding

In [52]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [53]:
MAX_LEN = 32 # 32 words in my review

In [54]:
padded_sequences = np.array(pad_sequences(sequences,
                                          maxlen=MAX_LEN,
                                          padding='post',
                                          truncating='post'))

### All input sizes need to be the same. So add padding - which can be at the end or beginning.

In [55]:
print(padded_sequences[0])
print(padded_sequences[1])
print(padded_sequences[3])

[   3  313   14   38    4   42   14    2  252    9  350   92    9 1348
 1410 2172   28    3   17    5  177    8 3575   14 3945    5  240    2
   64   55    0    0]
[532   9 181   2 216 209 167   5 137   7 296  51  30   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
[138   5  55  16  53  16 104  96  15  47   8  82  88  25   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0]


### The above tokens is the input for our model.
Word embeddings - vector forms (1D or multidimensional number or metrics), words with same contextual meanings (ie. King, Queen)

## Create the model

 multi-class classification out, (neg, pos, neutral) - create a sigmoid - returns a value between 0 and 1.

 Hidden layers in the nn model typically use a non-linear relu function. Output layer used sigmoid or tan. Input layer does not use a activation function.

In [56]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, SimpleRNN, Flatten, Dense

vocab size = 16 - a multiple of this 2,3, 4 etc it the RNN(64, parameter
input layer is size 16 -

64 units or number of neurons in the network

RNN - bidirectional network

Flattening is when converting n-dimensional vector to one-dimensional vector

In [60]:
model = Sequential()
model.add(Embedding(VOCAB_SIZE, 16, input_length=MAX_LEN))
model.add(Bidirectional(SimpleRNN(64, return_sequences=True)))
model.add(Bidirectional(SimpleRNN(64)))
model.add(Flatten())
model.add(Dense(24, activation='relu'))
model.add(Dense(3, activation='softmax'))

In [61]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [59]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 32, 16)            480000    
                                                                 
 bidirectional (Bidirection  (None, 32, 128)           10368     
 al)                                                             
                                                                 
 bidirectional_1 (Bidirecti  (None, 128)               24704     
 onal)                                                           
                                                                 
 flatten (Flatten)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 24)                3096      
                                                                 
 dense_1 (Dense)             (None, 3)                 7

## Model training

Training set of 80% and Test or Validation set is 20%

The difference between test and validation set - validation set helps with overfitting or overtrained the model.

Estimate of parameters - and estimate of losses with validation split

Accuracy increases as the number of epochs increases.

In [62]:
model.fit(padded_sequences, labels, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7f486873c4f0>

## Sentiment prediction using trained model

In [70]:
positive_sample = 'ws husband and uncle they \
loved it and how easy they are to use with fantastic featuresi gave this as a christmas gift to my inla'

negative_sample = 'if ads dont bother you then this may be a decent device purchased this \
for my kid and it was loaded down with so much spam it kept loading it up making \
it slow and laggy plus the carrasoul loadout makes it hard to navigate for kids \
not very kid friendly oh you can pay to remove the ads but it wont remove them all \
buy the samsung better everything'

neutral_sample='it is ok'

In [71]:
sample_sequence = tokenizer.texts_to_sequences([neutral_sample])[0]
sample_sequence_padded = pad_sequences([sample_sequence],
                                       maxlen=MAX_LEN,
                                       padding='post',
                                       truncating='post')

In [72]:
label_encoding={0:'Negative', 1:'Neutral', 2:'Positive'}

In [75]:
from sklearn.preprocessing import LabelEncoder

predictions = model.predict(sample_sequence_padded, verbose=0)
print(np.round(predictions, 3))
predicted_label = np.argmax(predictions, axis=1)[0]
print("Review:", neutral_sample)
print("Sentiment:", label_encoding[predicted_label])

[[0.    0.033 0.967]]
Review: it is ok
Sentiment: Positive
