<a href="https://colab.research.google.com/github/axel-sirota/implement-nlp-word-embedding/blob/main/module3/Module3_Demo2_Analysing_Sentiment_With_OHE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysing Sentiment

Let's first import everything and load the dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob, Word
import nltk
import torch
from torch import nn
import seaborn as sns
nltk.download('punkt')

%matplotlib inline
sns.set(rc={'figure.figsize':(20,20)})
import warnings
warnings.filterwarnings('ignore')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
%%writefile get_data.sh
if [ ! -f yelp.csv ]; then
  wget https://raw.githubusercontent.com/axel-sirota/implement-nlp-word-embedding/main/module3/data/yelp.csv
fi

Overwriting get_data.sh


In [3]:
!bash get_data.sh


In [4]:
path = './yelp.csv'
yelp = pd.read_csv(path)
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# Define X and y.
X = yelp_best_worst.text
y = yelp_best_worst.stars.map({1:0, 5:1})


## Doing the train_test split and defining model

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [6]:
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [7]:
X_train_tensor = torch.Tensor(X_train_dtm.toarray()).to(device)
X_test_tensor = torch.Tensor(X_test_dtm.toarray()).to(device)
y_train = torch.Tensor(y_train.values).type(torch.LongTensor).to(device)
y_test = torch.Tensor(y_test.values).type(torch.LongTensor).to(device)

In [8]:
model = nn.Sequential(
  nn.Linear(X_train_tensor.shape[1], 2),
  nn.LogSoftmax(dim = 1)
).to(device)

In [9]:
def forward(X):
  return model(X).to(device)

def loss(y_pred, y):
  return nn.functional.nll_loss(y_pred, y)

def metric(y_pred, y):  # -> accuracy
  return (1 / len(y)) * ((y_pred.argmax(dim = 1) == y).sum())


## Let's verify the metric makes sense

In [10]:
y_train_pred = model(X_train_tensor).to(device)
y_train_pred.argmax(dim=1)

tensor([0, 0, 1,  ..., 1, 1, 1], device='cuda:0')

In [11]:
(y_train_pred.argmax(dim = 1) == y_train).sum()

tensor(1619, device='cuda:0')

In [12]:
metric(y_train_pred, y_train)

tensor(0.4954, device='cuda:0')

In [13]:
del y_train_pred

## The training routine

In [14]:
optimizer = torch.optim.AdamW(model.parameters())

In [15]:
epochs = 1000
for i in range(epochs):
  y_pred = forward(X_train_tensor)
  xe = loss(y_pred, y_train)
  accuracy = metric(y_pred, y_train)
  xe.backward()
  if i % 100 == 0:
    print("Loss: ", xe, " Accuracy ", accuracy.data.item())
  optimizer.step()
  optimizer.zero_grad()

Loss:  tensor(0.6942, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.4954100549221039
Loss:  tensor(0.1341, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9914320707321167
Loss:  tensor(0.0742, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9951040744781494
Loss:  tensor(0.0502, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9963280558586121
Loss:  tensor(0.0372, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9972460269927979
Loss:  tensor(0.0293, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9972460269927979
Loss:  tensor(0.0239, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9981640577316284
Loss:  tensor(0.0200, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9984700083732605
Loss:  tensor(0.0171, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9987760186195374
Loss:  tensor(0.0149, device='cuda:0', grad_fn=<NllLossBackward0>)  Accuracy  0.9987760186195374


In [16]:
y_test_pred = forward(X_test_tensor)
print(f'Model accuracy is {metric(y_test_pred, y_test)}')

Model accuracy is 0.8948655724525452


# Some manual validation

In [20]:
review = np.array(["This place was fantastic"])
vectorized_review = torch.Tensor(vect.transform(review).toarray()).to(device)

In [22]:
prediction = forward(vectorized_review)
prediction.argmax(dim = 1)

tensor([1], device='cuda:0')

Therefore, the model predicted correctly that the review was positive!