<a href="https://colab.research.google.com/github/abhiraman/data_mining/blob/master/Sentiment_Analysis_Sequence_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sentiment Analysis is a Sequence Classification Problem. Here The labels are Positive & Negative.

Data Set : https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv

https://gist.github.com/HarshTrivedi/f4e7293e941b17d19058f6fb90ab0fec

In [1]:
import nltk
from nltk.corpus import stopwords
import pandas as pd
import regex as re
from sklearn.model_selection import train_test_split
import plotly.express as px
from nltk.stem.wordnet import WordNetLemmatizer
from pprint import pprint
from collections  import Counter
import torch
import torch.nn as nn
from torch.utils.data import Dataset,DataLoader
from torch.nn.utils.rnn import pack_padded_sequence,pad_packed_sequence
import torch.optim as optim
from sklearn.metrics import accuracy_score
from tqdm import tqdm

In [2]:
nltk.download(["stopwords","wordnet"])
%cd /root/nltk_data/corpora/stopwords
stop_Words = stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
/root/nltk_data/corpora/stopwords


In [3]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [4]:
%cd /gdrive/MyDrive/IMDB_Senti_Analysis
!ls

/gdrive/MyDrive/IMDB_Senti_Analysis
'IMDB Dataset.csv'   IMDB_Words_Vocab.csv   model.pt


In [5]:
isCuda = torch.cuda.is_available()
if isCuda:
  Mydevice = torch.device("cuda")
else:
  Mydevice = torch.device("cpu")

In [6]:
main_df = pd.read_csv('IMDB Dataset.csv')

In [7]:
main_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Split Data

In [8]:
## Converting Positive ->1 and negative -> 0
main_df.sentiment[main_df.sentiment=="positive"]=1
main_df.sentiment[main_df.sentiment=="negative"]=0

In [9]:
main_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [10]:
main_df["review"][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [11]:
fig = px.bar(main_df,x=["Positive Review","Negative Review"],y = main_df["sentiment"].value_counts(),)
fig.show()

In [12]:
X,Y = main_df["review"].values,main_df["sentiment"].values ## Converting pd.series -> np array
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.4,stratify=Y)
print(X_train.shape,X_test.shape)

(30000,) (20000,)



---
# Cleaning Data - Tokenization
***



In [13]:
def _string_cleanUp(arrOf_strs):
  count=0
  listOf_Strs = []
  for e_str in arrOf_strs:
    e_str = e_str.lower()   ## Loawer Casing the entire string
    e_str = re.sub(r'<[^>]*>','',e_str) ## Removing HTML Tags
    e_str = re.sub(r"[^\w\s]", ' ', e_str) ## Remove Special Characters 
    e_str = re.sub(r"\d", '', e_str) ## Remove Numbers from string
    count+=1
    listOf_Strs.append(e_str)
  return listOf_Strs

Cleaned_Sentences = _string_cleanUp(X_train)
for e_line in Cleaned_Sentences[0:5]:
  print(e_line)
  print("\n")

red sonja is a career step in the wrong direction for arnold schwarzenegger  having made a couple of sword  n  sorcery films  as conan  he had moved onto slightly more serious acting roles in films like the terminator and commando  only to make a mystifying return to the sword  n  sorcery genre for this  debacle  it s hard to figure out why he bothered  as this is weaker than both conan films in every conceivable department  allegedly  this was to have been the third conan film  but for one reason or another the emphasis was shifted onto the leading female character  the titular red head  leaving poor old arnold to play an incredibly dull supporting role  spare a thought  too  for director richard fleischer who had given the world classics like   leagues under the sea  fantastic voyage  the boston strangler and  rillington place  in this   his penultimate film   fleischer also has taken a gigantic career step backwards evil queen gedren  sandahl bergman  wants to rule the world  and sh

In [14]:
def _token_StringList(StrList,lemObj):
  
  wordList,spl_strs  = [],["<sos>","<eos>","<pad>"]
  for eLine in StrList:
    eLine = eLine.split(" ")
    for eWord in eLine:
      if eWord in stop_Words:continue ## Skipping stop words
      else:
        if  eWord == '':continue
        eWord = lemObj.lemmatize(eWord)
        wordList.append(eWord)
  return wordList

wl = WordNetLemmatizer()
wordToken = _token_StringList(Cleaned_Sentences,wl)

#wordToken = {ind:word for ind,word in enumerate(spl_strs+wordList)}

In [None]:
wordDict = Counter(wordToken)
print(wordDict)

In [None]:
def _return_most_recurringVocab(worDict):
  spl_strs = ["<pad>"]
  vList = [x[0] for x in sorted(worDict.items(),key=lambda x:x[1],reverse=True)[:1000]]
  return {word:ind for ind,word in enumerate(spl_strs+vList)}

# Train & Test Data(Indexed Vocab)

In [None]:
trainVocab = _return_most_recurringVocab(wordDict)
print(trainVocab)

In [None]:
## Similar activity for Test Data ##
test_Cleaned_Sentences =  _string_cleanUp(X_test)
testWordToken = _token_StringList(test_Cleaned_Sentences,wl)


In [None]:
testVocab = _return_most_recurringVocab(Counter(testWordToken))
print(testVocab)

# Custom Data Loader 

In [None]:
## Create a custom dataset loader ## 
class _reviews_loader(Dataset):
  def __init__(self,X,Y):
    super().__init__()
    self.X,self.Y = X,Y
    
  
  def __len__(self):
    #d_frame = pd.read_csv(csv_file_name)
    return len(self.X)
  
  def __getitem__(self,idx):
    returnDict = (self.X[idx],self.Y[idx])
    return returnDict


In [None]:
class MyCollateClass():
  def __init__(self,vocabDict = None):
    self.vocabDict = vocabDict

  def _string_cleanUp(arrOf_strs):
    count=0
    listOf_Strs = []
    for e_str in arrOf_strs:
      e_str = e_str.lower()   ## Loawer Casing the entire string
      e_str = re.sub(r'<[^>]*>','',e_str) ## Removing HTML Tags
      e_str = re.sub(r"[^\w\s]", ' ', e_str) ## Remove Special Characters 
      e_str = re.sub(r"\d", '', e_str) ## Remove Numbers from string
      count+=1
      listOf_Strs.append(e_str)
    return listOf_Strs

  def _return_indexList(self,OneSentance):
    vocabIndexes = []
    for eWord in OneSentance.split(" "):
      if eWord in list(self.vocabDict.keys()):
        vocabIndexes.append(self.vocabDict[eWord])
    idx_Tensor = torch.LongTensor(vocabIndexes)
    return idx_Tensor
  
  def _stack_Sentance_info(self,max_sentence_len = None,batch_size=None,device='cpu'):
    tensorList,updatedTensorList,seqLengths = [],[],[]
    for ind,eLine in enumerate(self.cleanedList):
      retTensor = self._return_indexList(eLine)
      tensorList.append(retTensor)
    maxTensorSize = max(list((e_Tensor.size()[0] for e_Tensor in tensorList)))
    for e_tensor in tensorList:
      seqLengths.append(e_tensor.size()[0])
      if e_tensor.size()[0]<maxTensorSize:
        diff = maxTensorSize - e_tensor.size()[0]
        newTensor = torch.cat([e_tensor,torch.zeros(diff)])
        updatedTensorList.append(newTensor)
      else:updatedTensorList.append(e_tensor)
    finalTensor = torch.stack(updatedTensorList).type(torch.LongTensor).to(device)
    return finalTensor,seqLengths

  def PadCollate(self,batch):
    def _get_max_sentance_len(SentanceList):
      return max(list((len(esentance.split(' ')) for esentance in SentanceList)))
    def _convert_senti_to_int(SentList,device='cpu'):
      sTensor = torch.LongTensor(SentList)
      return sTensor
    batch_Dict = {}
    revList = list((eTuple[0] for eTuple in batch))
    sentiList = list((eTuple[1] for eTuple in batch))
    stacked_senti_tensor = _convert_senti_to_int(sentiList,device=Mydevice).to(Mydevice)
    self.cleanedList = _string_cleanUp(revList)
    maxLen_sentance = _get_max_sentance_len(self.cleanedList)
    stacked_vocab_tensor,seqLengths = self._stack_Sentance_info(maxLen_sentance,len(batch),device=Mydevice)
    batch_Dict = {"Vocab":stacked_vocab_tensor,"Senti":stacked_senti_tensor,"Seqlen":seqLengths}
    return batch_Dict


  def __call__(self,batch):
    return self.PadCollate(batch)

review_dataset = _reviews_loader(X_train,Y_train)
dataloader1 = DataLoader(review_dataset,batch_size = 10,shuffle=True, num_workers=0,collate_fn=MyCollateClass(trainVocab))

In [None]:
for ind,data in enumerate(dataloader1):
  if ind>3:break
  print(data["Vocab"].device)
  print(data["Senti"])
  print(data["Vocab"].shape)
  print(data["Senti"].shape)
  print("seq lenght",data["Seqlen"])
  print('*'*75)


MODEL
---

In [None]:
class SentiClassify_Model(nn.Module):
  def __init__(self,vocabLen,dims,hidden_size,batchSize,numLayers,output_size=2):
    super().__init__()
    #output_size =  2
    self.hidden_size = hidden_size
    self.batchSize = batchSize
    self.numLayers = numLayers
    self.embed = nn.Embedding(vocabLen,dims)
    self.lstm_cell = nn.LSTM(input_size=dims,hidden_size=hidden_size,batch_first =True,
                             num_layers=self.numLayers,bidirectional=True)
    self.lf = nn.Linear(self.hidden_size,output_size)
    self.dp_out = nn.Dropout(p=0.3)
    self.F = nn.ReLU(inplace=False)
    
    

  def forward(self,input,hidden =None,bsize = None,verbose=False):
    embeds = self.embed(input)
    output,(hid,cell) = self.lstm_cell(embeds,hidden)
    if bsize!=None:
      hid = hid.view(self.numLayers*self.num_dirns,bsize, self.hidden_size)
    else:
      hid = hid.view(self.numLayers*self.num_dirns,self.batchSize, self.hidden_size)
    # Get the last hidden state with respect to the layers
    hid = hid[-1]
    # Get rid of the direction dimension (won't work for bidirectional=True)
    hid = self.dp_out(hid.squeeze(0))
    linear = self.lf(hid)
    return linear

    if verbose:
      print("input shape",input.shape)
      print('embed shape',embeds.shape)
      print("Rehaped output",reshaped_out.size())
      print("After Fully conn layer :",lin_output.size())

  def init_hiddenlayer(self,num_dirns=1,device='cpu'):
    self.num_dirns = num_dirns
    return (torch.zeros(self.numLayers*num_dirns,self.batchSize,self.hidden_size,device=device),torch.zeros(self.numLayers*num_dirns,self.batchSize ,self.hidden_size,device=device))



In [None]:
def computeAccuracy(target,source):
  sf_max_obj = nn.Softmax(dim=1)
  sf_max = sf_max_obj(source)
  sf_max = torch.argmax(sf_max,dim=1)
  fintensor = torch.where(sf_max==1,1,0) ## 1-> positive ,0->Negative
  score = accuracy_score(target.tolist(),fintensor.tolist())
  return score


def infer(dataLoader,net,device,batchSize=None):
  net.eval().to(device)
  allScores = []
  if torch.is_tensor(dataLoader):
    source = net(dataLoader,bsize=batchSize)
    sf_max_obj = nn.Softmax(dim=1)
    sf_max = sf_max_obj(source)
    sf_max = torch.argmax(sf_max,dim=1)
    fintensor = torch.where(sf_max==1,1,0)
    return fintensor
    
  for ind,data in enumerate(dataLoader):
    wordInput,seqLengths,targets = data["Vocab"],data["Seqlen"],data["Senti"]
    if wordInput.size()[0] != batchSize:continue
    source = net(wordInput)
    allScores.append(computeAccuracy(targets,source))
  net.train().to(device)
  return sum(allScores)/len(allScores)   ## Mean Accuracy for all test batches ##


In [None]:
def _trainLoader(model=None,Train_dataset=None,Test_Loader = None, batchSize =None,vocabList = None,optimFn =None,loss_func=None,epochs=1,device='cpu',lr=0.005):
  maxLoss = 10000
  dataloader1 = DataLoader(Train_dataset,batch_size = batchSize,shuffle=True, num_workers=0,collate_fn=MyCollateClass(vocabList))

  loss_per_epoch,train_accuracy = 0,None
  ## Batch Optimization ##
  for i in range(epochs):
    cummLoss = 0
    for ind,data in enumerate(tqdm(dataloader1)):
      wordInput,seqLengths,targets = data["Vocab"],data["Seqlen"],data["Senti"]
      hidden = modObj.init_hiddenlayer(num_dirns = num_dirns,device=device)
      if wordInput.size()[0]!=batchSize:continue
      optimFn.zero_grad()
      source = modObj(wordInput,hidden)
      loss = lossFn(source,targets)
      loss.backward()
      optimFn.step()
      cummLoss+=loss.item()*batchSize ## Cumulative loss per batch


    loss_per_epoch = cummLoss/batchSize
    train_accuracy = computeAccuracy(targets,source)
    if loss_per_epoch<maxLoss:
      maxLoss = loss_per_epoch
      torch.save({
          'epoch': i,
          'model_state_dict': model.state_dict(),
          'optimizer_state_dict': optimFn.state_dict(),
          'loss': loss_per_epoch,
          }, "model.pt")
    if Test_Loader!=None:
      Mean_testAccuracy = infer(testLoader,modObj,Mydevice,batchSize)
    print("Loss per Epoch : {} , Training Accuracy : {}, Mean Test Accuracy : {}".format(loss_per_epoch,train_accuracy,Mean_testAccuracy))

In [26]:
### Hyperparameters ##
embed_dims = 20
hidden_size = 30
num_LSTMLayers = 2
num_dirns = 2
batchSize = 1000
lr = 0.005
#################################################
review_dataset = _reviews_loader(X_train,Y_train)
test_data =  _reviews_loader(X_test,Y_test)
testLoader = DataLoader(test_data,batch_size = batchSize,shuffle=True, num_workers=0,collate_fn=MyCollateClass(testVocab))
modObj = SentiClassify_Model(len(trainVocab),embed_dims,hidden_size,batchSize,num_LSTMLayers).to(Mydevice)
lossFn = nn.CrossEntropyLoss()
optimm = optim.Adam(modObj.parameters(),lr=lr)
_trainLoader(model=modObj,Train_dataset=review_dataset,Test_Loader = testLoader,batchSize=batchSize, vocabList = trainVocab,loss_func=lossFn,optimFn = optimm, epochs=10,device=Mydevice)

100%|██████████| 30/30 [02:28<00:00,  4.96s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

Loss per Epoch : 19.894886016845703 , Training Accuracy : 0.698, Mean Test Accuracy : 0.56165


100%|██████████| 30/30 [02:28<00:00,  4.95s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

Loss per Epoch : 16.560502350330353 , Training Accuracy : 0.754, Mean Test Accuracy : 0.5803


100%|██████████| 30/30 [02:29<00:00,  4.98s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

Loss per Epoch : 14.02003327012062 , Training Accuracy : 0.799, Mean Test Accuracy : 0.6012000000000001


100%|██████████| 30/30 [02:30<00:00,  5.01s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

Loss per Epoch : 12.18629252910614 , Training Accuracy : 0.836, Mean Test Accuracy : 0.59285


100%|██████████| 30/30 [02:29<00:00,  4.98s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

Loss per Epoch : 11.10161018371582 , Training Accuracy : 0.844, Mean Test Accuracy : 0.6045999999999998


100%|██████████| 30/30 [02:29<00:00,  4.97s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

Loss per Epoch : 10.306888580322266 , Training Accuracy : 0.853, Mean Test Accuracy : 0.6073000000000001


100%|██████████| 30/30 [02:29<00:00,  4.99s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

Loss per Epoch : 9.815760970115662 , Training Accuracy : 0.859, Mean Test Accuracy : 0.6043


100%|██████████| 30/30 [02:30<00:00,  5.00s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

Loss per Epoch : 10.048907577991486 , Training Accuracy : 0.862, Mean Test Accuracy : 0.6012000000000002


100%|██████████| 30/30 [02:29<00:00,  4.98s/it]
  0%|          | 0/30 [00:00<?, ?it/s]

Loss per Epoch : 9.276146978139877 , Training Accuracy : 0.889, Mean Test Accuracy : 0.5977999999999999


100%|██████████| 30/30 [02:29<00:00,  4.98s/it]


Loss per Epoch : 8.978687703609467 , Training Accuracy : 0.88, Mean Test Accuracy : 0.6009


In [27]:
def _get_max_sentance_len(SentanceList):
  return max(list((len(esentance.split(' ')) for esentance in SentanceList)))

def _return_indexList(OneSentance,vocabDict):
  vocabIndexes = []
  for eWord in OneSentance.split(" "):
    if eWord in list(vocabDict.keys()):
      vocabIndexes.append(vocabDict[eWord])
  idx_Tensor = torch.LongTensor(vocabIndexes)
  return idx_Tensor

def _stack_Sentance_info(cleanedList,vocanDict,max_sentence_len = None,batch_size=None,device='cpu'):
  tensorList,updatedTensorList,seqLengths = [],[],[]
  for ind,eLine in enumerate(cleanedList):
    retTensor = _return_indexList(eLine,vocanDict)
    tensorList.append(retTensor)
  maxTensorSize = max(list((e_Tensor.size()[0] for e_Tensor in tensorList)))
  for e_tensor in tensorList:
    seqLengths.append(e_tensor.size()[0])
    if e_tensor.size()[0]<maxTensorSize:
      diff = maxTensorSize - e_tensor.size()[0]
      newTensor = torch.cat([e_tensor,torch.zeros(diff)])
      updatedTensorList.append(newTensor)
    else:updatedTensorList.append(e_tensor)
  finalTensor = torch.stack(updatedTensorList).type(torch.LongTensor).to(device)
  return finalTensor,seqLengths

In [87]:
checkpoint = torch.load('model.pt')
modObj.load_state_dict(checkpoint['model_state_dict'])
optimm.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']

In [93]:
sample_reviews = ["This movie was just awesome!","It was a good movie","His acting was top class and twas a great  movie !","His tastes is utterly bad"]
cleanList = _string_cleanUp(sample_reviews)
maxLen = _get_max_sentance_len(cleanList)
wordInputs = _stack_Sentance_info(cleanList,trainVocab,max_sentence_len=maxLen,batch_size=len(cleanList),device=Mydevice)[0]
pred_output = infer(wordInputs,modObj,Mydevice,batchSize=len(wordInputs))
print(pred_output)

tensor([1, 1, 1, 0], device='cuda:0')


In [None]:
sample_tens = torch.tensor([[0.45,0.5],
                            [0.5,0.48]])
print(sample_tens)
bb = torch.argmax(sample_tens,dim=1)
print(bb)
bb = torch.where(bb==1,1,0)
print(bb)

tensor([[0.4500, 0.5000],
        [0.5000, 0.4800]])
tensor([1, 0])
tensor([1, 0])
