<h1>
    Sentiment Analysis with LSTM
</h1>

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
from torchtext.vocab import build_vocab_from_iterator
import torchtext.transforms as T
import torch.nn as nn
import torch
from torch.optim import Adam
from tqdm.auto import tqdm, trange
from imblearn.under_sampling import RandomUnderSampler
from gensim.utils import simple_preprocess
from sklearn.model_selection import train_test_split
from sklearn.utils import gen_batches, shuffle
from nltk.stem import WordNetLemmatizer
import joblib

<h3>
    Setting the plotly renderer to iframe so that interactive plots show up correctly in nbviewer
</h3>

In [4]:
import plotly.io as pio
pio.renderers.default = "iframe"

<h2>
    We are using the Yelp Dataset for our Sentiment Analysis
</h2>
<h4>
    Loading the Yelp restaurant review dataset downloaded from kaggle. We are interested in only two columns "stars" and "text". "stars" column contain ratings given on a scale of 5 and "text" column contains the actual review text.
    <br>
    Also since this is a very big dataset we will be using a part of the dataset as sample for our analysis.
</h4>

In [6]:
df = pd.read_csv("yelp_review.csv",usecols=["stars","text"]).sample(300000,random_state=42)
df

Unnamed: 0,stars,text
4528116,3,Airport Wendy's. You curbed my hunger. That wa...
3097267,5,I stumbled across this store on my way to Nest...
2290314,3,Pizza was decent. Very disappointed in the del...
1146971,3,My first time: the bartenders were so cute [an...
3184541,3,I was in las vegas staying at the Paris hotel ...
...,...,...
371430,4,Great sports bar with great bar food. The wing...
1572295,1,I went to Heart Attack Grill after seeing it i...
1811562,4,Island Flavor is just as ono as the one on Dur...
3767168,5,Five stars for our dinner service last night! ...


<h4>
    Mapping the stars to sentiment. For our analysis we will be using 3 levels of sentiment which are {"negative": 0, "neutral": 1, "positive": 2}
    <br>
    For our analysis Reviews with 3 rating are considered "neutral" and anything above is "positive" and below is "negative"
</h4>

In [8]:
def stars_to_sentiment(stars):
    if stars < 3:
        return 0
    elif stars == 3:
        return 1
    else:
        return 2

In [9]:
df["sentiment"] = df.apply(lambda x: stars_to_sentiment(x["stars"]),axis=1)
df

Unnamed: 0,stars,text,sentiment
4528116,3,Airport Wendy's. You curbed my hunger. That wa...,1
3097267,5,I stumbled across this store on my way to Nest...,2
2290314,3,Pizza was decent. Very disappointed in the del...,1
1146971,3,My first time: the bartenders were so cute [an...,1
3184541,3,I was in las vegas staying at the Paris hotel ...,1
...,...,...,...
371430,4,Great sports bar with great bar food. The wing...,2
1572295,1,I went to Heart Attack Grill after seeing it i...,0
1811562,4,Island Flavor is just as ono as the one on Dur...,2
3767168,5,Five stars for our dinner service last night! ...,2


<h4>
    By plotting our Sentiment Values we can see clearly see there is high class imbalance in our data.
</h4>

In [11]:
df["sentiment"].value_counts()

sentiment
2    198149
0     66851
1     35000
Name: count, dtype: int64

<h4>
    This is not good for any classification problem. So, Let's focus on balancing our data in the next step.
</h4>

In [13]:
px.histogram(df, x = "sentiment", color = "sentiment",width=1000)

<h4>
    One way to solve class imbalance is by under-sampling the majority classes. This strategy works well for large datasets (Which is true for our Large Yelp Dataset).
    <br>
    I have used RandomUnderSampler from imblearn to resample our dataset.
</h4>

In [15]:
X, Y = RandomUnderSampler(random_state=42).fit_resample(df[["text"]],df["sentiment"])

<h4>
    Post-undersampling, we can see that our dataset now has equal data for all classes. This gives the model an equal opportunity to learn the relationships between the data of each class and removes any naive classification validation error.
</h4>

In [17]:
Y.value_counts()

sentiment
0    35000
1    35000
2    35000
Name: count, dtype: int64

In [18]:
px.histogram(Y, x = "sentiment", color = "sentiment",width=1000)

<h4>
    Next step is to preprocess the review text. I have used simple_preprocess from gensim for the first step. This lowers the text, removes punctuations, deaccentizes and also tokenize the text.
</h4>

In [20]:
X = X.apply(lambda x: simple_preprocess(x["text"],deacc=True),axis=1)
X

2893361    [was, in, las, vegas, with, some, friends, las...
5247916    [should, have, done, some, research, or, looke...
56197      [this, place, was, great, our, fam, night, eve...
14430      [thanksgiving, dinner, was, so, much, better, ...
3238309    [just, had, to, take, to, some, kind, of, soci...
                                 ...                        
4645274    [if, you, want, fine, dining, this, place, isn...
4813124    [this, was, celebration, dinner, and, it, tota...
2513110    [the, food, is, amazing, the, garlic, ramen, w...
1994163    [love, this, place, not, only, do, we, buy, al...
1846833    [came, in, for, lunch, and, had, an, amazing, ...
Length: 105000, dtype: object

<h4>
    Next I have lemmatize the tokens to get the root form. Thus words with similar root forms will not create separate tokens.
</h4>

In [22]:
lemmatizer = WordNetLemmatizer()
X = X.apply(lambda x: [lemmatizer.lemmatize(i) for i in x])

<h4>
    Created the vocabulary. Also specified few special tokens to be used to signal our model about start of text, end of text, unknown word (not in current vocabulary), and padding tokens.
    <br>
    Set the max_tokens to 10k. So that our model can focus more on the most frequent words to understand their relationship and get less distracted.
</h4>

In [24]:
pad_token = '<pad>'
start_token = '<sos>'
end_token = '<eos>'
unknown_yoken = '<unk>'
max_tokens = 10000
vocab = build_vocab_from_iterator(X,min_freq=2,specials=[pad_token,start_token,end_token,unknown_yoken],special_first=True,max_tokens=max_tokens)

In [25]:
vocab.get_itos()[:10]

['<pad>', '<sos>', '<eos>', '<unk>', 'the', 'and', 'to', 'wa', 'it', 'of']

<h4>
    Setting the default index to unknown. Thus by default, any out-of-vocabulary word will get that default unknown token.
</h4>

In [27]:
vocab.set_default_index(vocab[unknown_yoken])

<h4>
    Creating device-agnostic code. PyTorch will automatically use the GPU or CPU as per availability.
</h4>

In [29]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

<h4>
    Creating the label tensor
</h4>

In [31]:
Y = torch.tensor(Y.to_numpy(),device=device)
Y

tensor([0, 0, 0,  ..., 2, 2, 2], device='cuda:0')

<h4>
    Splitting our data in training and testing set
</h4>

In [33]:
train_x, test_x, train_y, test_y = train_test_split(X.to_list(),Y,random_state=42,test_size=.3)

In [34]:
len(train_x)

73500

In [35]:
len(train_y)

73500

In [36]:
len(test_x)

31500

In [37]:
len(test_y)

31500

<h4>
    Creating our Text Transformation sequence.
</h4>

In [39]:
max_sequence_len = 512

text_transform = T.Sequential(
    # Convert the sentences to indices based on the given vocabulary
    T.VocabTransform(vocab=vocab),
    # Add start_token at the beginning of each sentence.
    T.AddToken(vocab[start_token], begin=True),
    # Crop the sentence if it is longer than the specified max length
    T.Truncate(max_seq_len=max_sequence_len),
    # Add end_token at the end of each sentence.
    T.AddToken(vocab[end_token], begin=False),
    # Convert the list of lists to a tensor. This also pads a sentence with the pad_token if it is shorter than the max document length of the current batch, Thus ensuring that all sentences are the same length.
    T.ToTensor(padding_value=vocab[pad_token])
)

<h3>
    Defining the Neural Network Class. It consists of below layers --
    <h4>
    <ol>
        <li>Embedding Layer - To create word embedding for our vocabulary.</li>
        <li>LSTM Layers - We have used LSTM Layers for our Sentiment Analysis. It's good for NLP or any other sequential data like Speech Recognition or Time Series Data.</li>
        <li>Fully Connected Layer - Finally a Linear Fully Connected Layer at the end.</li>
    </ol>
        Also added a helper function called init_hidden which will initialize the hidden and memory tensors on demand as per provided batch_size and available device
    </h4>
</h3>

In [41]:
class SentimentAnalysis(nn.Module):
    def __init__(self,vocab_size, embedding_dim, num_lstm_layers, hidden_dim, output_dim,dropout):
        super().__init__()
        self.num_lstm_layers = num_lstm_layers
        self.hidden_dim = hidden_dim
        self.embed = nn.Embedding(vocab_size,embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=self.hidden_dim, num_layers=self.num_lstm_layers, batch_first=True, dropout=dropout)
        self.fc = nn.Linear(self.hidden_dim,output_dim)
    
    def forward(self,input_batch,hidden_in,mem_in):
        output = self.embed(input_batch)
        output, _ = self.lstm(output,(hidden_in,mem_in))
        return self.fc(output)

    def init_hidden(self,batch_size,device):
        hidden = torch.zeros(self.num_lstm_layers,batch_size,self.hidden_dim).to(device)
        memory = torch.zeros(self.num_lstm_layers,batch_size,self.hidden_dim).to(device)
        return (hidden, memory)

<h3>
    Initialized our LSTM Model with some carefully tuned hyperparameters obtained after several experimental runs. This provided good accuracy as we are going to see in later stages.
</h3>

In [43]:
num_lstm_layers = 2 
embedding_dim = 64 
hidden_dim = 256 
output_dim = 3
dropout = 0.5

sentiment_classifier = SentimentAnalysis(vocab_size=len(vocab),embedding_dim=embedding_dim,num_lstm_layers=num_lstm_layers,hidden_dim=hidden_dim,output_dim=output_dim,dropout=dropout).to(device)

print(sentiment_classifier)

SentimentAnalysis(
  (embed): Embedding(10000, 64)
  (lstm): LSTM(64, 256, num_layers=2, batch_first=True, dropout=0.5)
  (fc): Linear(in_features=256, out_features=3, bias=True)
)


<h4>
    For Loss Function we have used the Cross Entropy Loss and Optimizer we have used Adam optimzer which works very well with LSTM models.
</h4>

In [45]:
loss_fun = nn.CrossEntropyLoss()

# Used the default learning rate of 0.001 which provided good result in our analysis
optimizer = Adam(sentiment_classifier.parameters())

<h4>
    We already have approx 15 lakh parameters in our model. Which are going to be tuned in the training epochs.
</h4>

In [47]:
# Let's see how many Parameters our Model has!
num_model_params = 0
for param in sentiment_classifier.parameters():
    num_model_params += param.flatten().shape[0]

print(f"This Model Has {num_model_params} (Approximately {round(num_model_params/100000,2)} Lakhs) Parameters!")

This Model Has 1496835 (Approximately 14.97 Lakhs) Parameters!


<h4>
    Using the gen_batches from sklearn to create our batch slices which are going to be used in later stages to train our model in batches.
    <br>
    Using an optimal batch size is important for LSTM models. For our analysis we found batches of 64 works well.
</h4>

In [49]:
batch_size = 64
train_batches = list(gen_batches(len(train_y),batch_size))
test_batches = list(gen_batches(len(test_y),batch_size))

<h4>
    Initializing Train and Test Loss and Accuracy loggers, which are going to be used later to plot and track our model progress.
</h4>

In [51]:
train_loss_logger = list()
test_loss_loger = list()
train_accuracy_logger = list()
test_accuracy_logger = list()

<h3>
    Training and Testing in batches for several epochs. Number epochs to run is important for getting good accuracy. We will train our model till we achieve good accuracy.
</h3>
<h4>
    I have also used tqdm progress bar with insight full postfix to track our model progress on the go.
</h4>

In [53]:
total_epochs = 9
clip = 5

pbar = trange(total_epochs,desc="Epoch")

train_acc = 0
test_acc = 0
cur_train_loss = 0
cur_test_loss = 0

for epoch in pbar:

    # Shuffling the Train and Test Dataset after each epoch for better training and validation
    
    train_x, train_y = shuffle(train_x, train_y)
    test_x, test_y = shuffle(test_x, test_y)
    
    pbar.set_postfix_str(f"Train Accuracy: {round(train_acc*100,2)}% | Test Accuracy: {round(test_acc*100,2)}% | Train Loss: {round(cur_train_loss,4)} | Test Loss: {round(cur_test_loss,4)}")
    sentiment_classifier.train()
    
    train_acc = 0
    train_losses = list()
    for train_batch in tqdm(train_batches,desc="Training in Batches",leave=False):
        text = train_x[train_batch]
        label_tensor = train_y[train_batch]
        text_tensor = text_transform(text).to(device)
        
        hidden, memory = sentiment_classifier.init_hidden(len(label_tensor),device)

        optimizer.zero_grad()
        
        pred = sentiment_classifier(text_tensor,hidden,memory)

        loss = loss_fun(pred[:,-1,:],label_tensor)

        loss.backward()

        # 'clip_grad_norm' helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(sentiment_classifier.parameters(), clip)
        
        optimizer.step()

        train_losses.append(loss.item())
        
        train_acc += (pred[:,-1,:].argmax(1) == label_tensor).sum()

    cur_train_loss = np.mean(train_losses)
    train_loss_logger.append(cur_train_loss)
    
    train_acc = (train_acc/len(train_y)).item()
    train_accuracy_logger.append(train_acc)


    sentiment_classifier.eval()
    test_acc = 0
    test_losses = list()
    with torch.inference_mode():
        for test_batch in tqdm(test_batches,desc="Testing in Batches",leave=False):
            text = test_x[test_batch]
            label_tensor = test_y[test_batch]

            text = text_transform(text).to(device)

            hidden, memory = sentiment_classifier.init_hidden(len(label_tensor),device)
            
            pred = sentiment_classifier(text,hidden,memory)

            loss = loss_fun(pred[:,-1,:],label_tensor)

            test_losses.append(loss.item())

            test_acc += (pred[:,-1,:].argmax(1) == label_tensor).sum()

    cur_test_loss = np.mean(test_losses)
    test_loss_loger.append(cur_test_loss)
    test_acc = (test_acc/len(test_y)).item()
    test_accuracy_logger.append(test_acc)
            

Epoch:   0%|          | 0/9 [00:00<?, ?it/s]

Training in Batches:   0%|          | 0/1149 [00:00<?, ?it/s]


The inner type of a container is lost when calling torch.jit.isinstance in eager mode. For example, List[int] would become list and therefore falsely return True for List[float] or List[str].



Testing in Batches:   0%|          | 0/493 [00:00<?, ?it/s]

Training in Batches:   0%|          | 0/1149 [00:00<?, ?it/s]

Testing in Batches:   0%|          | 0/493 [00:00<?, ?it/s]

Training in Batches:   0%|          | 0/1149 [00:00<?, ?it/s]

Testing in Batches:   0%|          | 0/493 [00:00<?, ?it/s]

Training in Batches:   0%|          | 0/1149 [00:00<?, ?it/s]

Testing in Batches:   0%|          | 0/493 [00:00<?, ?it/s]

Training in Batches:   0%|          | 0/1149 [00:00<?, ?it/s]

Testing in Batches:   0%|          | 0/493 [00:00<?, ?it/s]

Training in Batches:   0%|          | 0/1149 [00:00<?, ?it/s]

Testing in Batches:   0%|          | 0/493 [00:00<?, ?it/s]

Training in Batches:   0%|          | 0/1149 [00:00<?, ?it/s]

Testing in Batches:   0%|          | 0/493 [00:00<?, ?it/s]

Training in Batches:   0%|          | 0/1149 [00:00<?, ?it/s]

Testing in Batches:   0%|          | 0/493 [00:00<?, ?it/s]

Training in Batches:   0%|          | 0/1149 [00:00<?, ?it/s]

Testing in Batches:   0%|          | 0/493 [00:00<?, ?it/s]

In [54]:
model_performance = pd.DataFrame({"Epochs": list(range(1,len(train_loss_logger)+1)),"Train Loss": train_loss_logger,"Test Loss": test_loss_loger,"Train Accuracy": train_accuracy_logger, "Test Accuracy": test_accuracy_logger})
model_performance

Unnamed: 0,Epochs,Train Loss,Test Loss,Train Accuracy,Test Accuracy
0,1,1.097343,1.097328,0.336354,0.339016
1,2,1.099918,1.100476,0.338204,0.365365
2,3,1.097102,1.096292,0.33985,0.340444
3,4,1.095729,1.098802,0.339116,0.337587
4,5,1.009755,0.76844,0.437932,0.637683
5,6,0.652685,0.569506,0.711592,0.753619
6,7,0.529243,0.563355,0.775061,0.74946
7,8,0.471331,0.52491,0.802299,0.777429
8,9,0.421404,0.529671,0.82668,0.777683


<h4>
    Plotting Train and Test Loss
</h4>

In [56]:
px.line(model_performance,x="Epochs",y=["Train Loss","Test Loss"],markers=True,height=500)

<h4>
    Plotting Train and Test Accuracy
</h4>

In [58]:
px.line(model_performance,x="Epochs", y = ["Train Accuracy","Test Accuracy"],markers=True,height=500)

<h4>
    We are able to achieve ≈ 78% accuracy from our model
</h4>

In [60]:
max(test_accuracy_logger)

0.7776825428009033

<h4>
    Creating a helper function to use our model to perform sentiment analysis
</h4>

In [62]:
def predict_sentiment(review_data):
    sentiment_map = {0: "negative", 1: "neutral", 2: "positive"}
    sentiment_classifier.eval()
    with torch.inference_mode():
        for index, row in review_data.iterrows():
            orginal_sentiment = sentiment_map[stars_to_sentiment(row["stars"])]
            review_tokens = [lemmatizer.lemmatize(i) for i in simple_preprocess(row["text"])]
            review_tokens = np.expand_dims(review_tokens,axis=0).tolist()
            review_tensor = text_transform(review_tokens).to(device)
            hidden, memory = sentiment_classifier.init_hidden(1,device)
            pred = sentiment_classifier(review_tensor,hidden,memory)
            pred_sentiment = sentiment_map[pred[:,-1,:].argmax(1).item()]
            pred_sentiment_probability = pred[:,-1,:].softmax(1).max().item()

            print("Review Text:- ")
            print(row["text"])
            print("========================================================================================================")
            print(f"Rating:- {row["stars"]}")
            print("========================================================================================================")
            print(f"Actual Sentiment:- {orginal_sentiment}")
            print("========================================================================================================")
            print(f"Predicted Sentimen:- {pred_sentiment}")
            print(f"Prediction Probability:- {pred_sentiment_probability}")
            print("#########################################################################################################")

<h4>
    Let's fetch some completely unseen data and perform sentiment analysis
</h4>

In [120]:
testing_data = pd.read_csv("yelp_review.csv",usecols=["stars","text"]).sample(100)

In [65]:
predict_sentiment(testing_data[testing_data["stars"] < 3].sample(2))

Review Text:- 
I wrote the below review on Elegant Smile and soon after learned they changed their name to Gentle Dental.  This place is dishonest.  I recently took my Plan of Care from this office to my new dentist and was told a lot of the work Elegant Smile told me I needed was not warranted.  It's been a few years, I've had none of the work they listed done and have zero problems.  As I stated in my original review below they charged me over $400 for a bill they told me would only be two $50 copays (I have a contract stating this is what would be charged) and have been sent to collections.  I still refuse to pay and it has not hurt my credit whatsoever.  It's a matter of principal.  You had me sign a form acknowledging what I owe and then you charge me four times that amount?  There is very little I dislike more than a dishonest dentist.  

Original Review for Elegant Smile:
I was told I would have a $50 copay each visit, two visits total for a deep clean,  After my second visit I 

In [121]:
predict_sentiment(testing_data[testing_data["stars"] == 3].sample(2))

Review Text:- 
I am still searching for the perfect Thai meal so met up with a buddy for lunch at this place.  The place is neutrally decorated and was fairly empty as it was Labor Day.  We both ordered our favorites: pad thai and yellow curry.  This place runs a lunch special with rice, salad, and won tons.The food was OK, nothing exciting but definitely Americanized.  I wish them well and think they have the potential to do a brisk lunch business due to their location and pricing.  As for me, I'm continuing my search.
Rating:- 3
Actual Sentiment:- neutral
Predicted Sentimen:- neutral
Prediction Probability:- 0.849919855594635
#########################################################################################################
Review Text:- 
This was my first time at the Wicked Spoon Buffet and I was super excited to see what it was all about. $35 bucks for dinner during the weekend. 

Definitely 5 stars for the decor, ambiance, food presentation and service at this place - it is 

In [67]:
predict_sentiment(testing_data[testing_data["stars"] > 3].sample(2))

Review Text:- 
Love it! Great beer and awesome service. Glad to have this gem within walking distance.
Rating:- 5
Actual Sentiment:- positive
Predicted Sentimen:- positive
Prediction Probability:- 0.9656999707221985
#########################################################################################################
Review Text:- 
Awesome brunch option! Will go back for dinner for sure! Came here with the whole fam and everyone was impressed. I'm not a huge brunch guy (we went for lunch) but when I looked at the brunch menu I was intrigued...and I'm glad I tried it...really cool twist on eggs Benedict with Arepas in place of the English muffins. Can't explain how good this was. Kids had lunch quesadillas and wife had a Chimichanga everything was awesome...we'll be back for sure.
Rating:- 5
Actual Sentiment:- positive
Predicted Sentimen:- positive
Prediction Probability:- 0.9520800709724426
#############################################################################################

<h3>
    Finally with above result we can conclude our analysis. Our model was able to get good accuracy on the testing dataset and also able to correctly identify the sentiment of the unseen datas.
</h3>

<h4>
    Saving our trained model for future use.
</h4>

In [124]:
joblib.dump(sentiment_classifier,"LSTM_Sentiment_Classifier.pkl")

['LSTM_Sentiment_Classifier.pkl']