<a href="https://colab.research.google.com/github/abhiraman/data_mining/blob/main/Sentiment_Analysis_Sequence_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sentiment Analysis is a Sequence Classification Problem. Here The labels are Positive & Negative.

Data Set : https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv

https://gist.github.com/HarshTrivedi/f4e7293e941b17d19058f6fb90ab0fec

In [2]:
import nltk
from nltk.corpus import stopwords
import pandas as pd
import regex as re
from sklearn.model_selection import train_test_split
import plotly.express as px
from nltk.stem.wordnet import WordNetLemmatizer
from pprint import pprint
from collections  import Counter
import torch
import torch.nn as nn
from torch.utils.data import Dataset,DataLoader
from torch.nn.utils.rnn import pack_padded_sequence,pad_packed_sequence
import torch.optim as optim
from sklearn.metrics import accuracy_score

In [3]:
nltk.download(["stopwords","wordnet"])
%cd /root/nltk_data/corpora/stopwords
stop_Words = stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
/root/nltk_data/corpora/stopwords


In [4]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [5]:
%cd /gdrive/MyDrive/IMDB_Senti_Analysis
!ls

/gdrive/MyDrive/IMDB_Senti_Analysis
'IMDB Dataset.csv'   IMDB_Words_Vocab.csv


In [6]:
isCuda = torch.cuda.is_available()
if isCuda:
  devie = torch.device("cuda")
else:
  device = torch.device("cpu")

print(device)

cpu


In [7]:
main_df = pd.read_csv('IMDB Dataset.csv')

In [8]:
main_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Split Data

In [9]:
## Converting Positive ->1 and negative -> 0
main_df.sentiment[main_df.sentiment=="positive"]=1
main_df.sentiment[main_df.sentiment=="negative"]=0

In [10]:
main_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [11]:
main_df["review"][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [12]:
fig = px.bar(main_df,x=["Positive Review","Negative Review"],y = main_df["sentiment"].value_counts(),)
fig.show()

In [13]:
X,Y = main_df["review"].values,main_df["sentiment"].values ## Converting pd.series -> np array
X_train,X_test,Y_train,Y_split = train_test_split(X,Y,test_size=0.3,stratify=Y)
print(X_train.shape,X_test.shape)

(35000,) (15000,)



---
# Cleaning Data - Tokenization
***



In [14]:
def _string_cleanUp(arrOf_strs):
  count=0
  listOf_Strs = []
  for e_str in arrOf_strs:
    e_str = e_str.lower()   ## Loawer Casing the entire string
    e_str = re.sub(r'<[^>]*>','',e_str) ## Removing HTML Tags
    e_str = re.sub(r"[^\w\s]", ' ', e_str) ## Remove Special Characters 
    e_str = re.sub(r"\d", '', e_str) ## Remove Numbers from string
    count+=1
    listOf_Strs.append(e_str)
  return listOf_Strs

Cleaned_Sentences = _string_cleanUp(X_train)
for e_line in Cleaned_Sentences[0:5]:
  print(e_line)
  print("\n")

i initially bought this dvd because it had srk and aishwarya rai on the cover and i thought  hey  another film starring aishu and shah rukh  little did i know that aishwarya would only appear in an item number in the last quarter of the film in a song which she shares with srk and helps introduce his character who is in the film for about just  minutes  shakti is a film about a mother s love and endurance  it s a film about transformations  ignorance  coming of age  stepping into the know and embracing the harsh realities of life  the item number in which srk and aishu appear in has nothing to do with the movie  it s actually a dream sequence that occurs while srk s drunken character is knocked unconscious by booze  he dreams that aishwarya rai is this sexy street girl who shows up at his favourite hangout spot one day  dressed scantily and begins to seduce him  the title of the song is  ishq kamina   loosely translated as  love s a bitch    and it is just plain smoking hot  don t miss

In [15]:
def _token_StringList(StrList,lemObj):
  
  wordList,spl_strs  = [],["<sos>","<eos>","<pad>"]
  for eLine in StrList:
    eLine = eLine.split(" ")
    for eWord in eLine:
      if eWord in stop_Words:continue ## Skipping stop words
      else:
        if  eWord == '':continue
        eWord = lemObj.lemmatize(eWord)
        wordList.append(eWord)
  return wordList

wl = WordNetLemmatizer()
wordToken = _token_StringList(Cleaned_Sentences,wl)

#wordToken = {ind:word for ind,word in enumerate(spl_strs+wordList)}

In [16]:
wordDict = Counter(wordToken)
print(wordDict)



In [17]:
def _return_most_recurringVocab(worDict):
  spl_strs = ["<pad>"]
  vList = [x[0] for x in sorted(worDict.items(),key=lambda x:x[1],reverse=True)[:1000]]
  return {word:ind for ind,word in enumerate(spl_strs+vList)}

# Train & Test Data(Indexed Vocab)

In [18]:
trainVocab = _return_most_recurringVocab(wordDict)
print(trainVocab)

{'<pad>': 0, 'movie': 1, 'film': 2, 'one': 3, 'like': 4, 'time': 5, 'good': 6, 'character': 7, 'story': 8, 'even': 9, 'would': 10, 'get': 11, 'make': 12, 'see': 13, 'really': 14, 'well': 15, 'scene': 16, 'much': 17, 'bad': 18, 'people': 19, 'great': 20, 'also': 21, 'first': 22, 'show': 23, 'way': 24, 'thing': 25, 'made': 26, 'life': 27, 'could': 28, 'think': 29, 'go': 30, 'watch': 31, 'know': 32, 'love': 33, 'plot': 34, 'actor': 35, 'two': 36, 'many': 37, 'year': 38, 'seen': 39, 'end': 40, 'say': 41, 'acting': 42, 'never': 43, 'look': 44, 'best': 45, 'little': 46, 'man': 47, 'ever': 48, 'better': 49, 'take': 50, 'come': 51, 'work': 52, 'still': 53, 'part': 54, 'something': 55, 'director': 56, 'find': 57, 'back': 58, 'want': 59, 'lot': 60, 'give': 61, 'real': 62, 'guy': 63, 'watching': 64, 'performance': 65, 'woman': 66, 'play': 67, 'old': 68, 'though': 69, 'funny': 70, 'another': 71, 'actually': 72, 'role': 73, 'nothing': 74, 'u': 75, 'going': 76, 'new': 77, 'day': 78, 'every': 79, 'gi

In [19]:
## Similar activity for Test Data ##
test_Cleaned_Sentences =  _string_cleanUp(X_test)
testWordToken = _token_StringList(test_Cleaned_Sentences,wl)


In [20]:
testVocab = _return_most_recurringVocab(Counter(testWordToken))
print(testVocab)

{'<pad>': 0, 'movie': 1, 'film': 2, 'one': 3, 'like': 4, 'time': 5, 'good': 6, 'character': 7, 'story': 8, 'get': 9, 'even': 10, 'would': 11, 'see': 12, 'make': 13, 'really': 14, 'scene': 15, 'well': 16, 'much': 17, 'bad': 18, 'great': 19, 'people': 20, 'also': 21, 'first': 22, 'show': 23, 'way': 24, 'made': 25, 'thing': 26, 'could': 27, 'life': 28, 'think': 29, 'know': 30, 'go': 31, 'watch': 32, 'love': 33, 'seen': 34, 'two': 35, 'many': 36, 'actor': 37, 'plot': 38, 'say': 39, 'never': 40, 'year': 41, 'end': 42, 'acting': 43, 'little': 44, 'look': 45, 'best': 46, 'ever': 47, 'man': 48, 'better': 49, 'take': 50, 'come': 51, 'work': 52, 'still': 53, 'find': 54, 'something': 55, 'director': 56, 'part': 57, 'give': 58, 'want': 59, 'back': 60, 'lot': 61, 'real': 62, 'performance': 63, 'watching': 64, 'funny': 65, 'woman': 66, 'play': 67, 'guy': 68, 'old': 69, 'though': 70, 'another': 71, 'u': 72, 'actually': 73, 'going': 74, 'nothing': 75, 'role': 76, 'every': 77, 'new': 78, 'world': 79, '

# Custom Data Loader 

In [21]:
## Create a custom dataset loader ## 
class _reviews_loader(Dataset):
  def __init__(self,csv_file_name = None):
    super().__init__()
    print(dir(_reviews_loader))
    self.d_frame = pd.read_csv(csv_file_name)
  
  def __len__(self):
    #d_frame = pd.read_csv(csv_file_name)
    return len(self.d_frame)
  
  def __getitem__(self,idx):
    returnDict = (self.d_frame["review"][idx],self.d_frame["sentiment"][idx])
    return returnDict
review_dataset = _reviews_loader(csv_file_name='IMDB Dataset.csv')    

['__add__', '__class__', '__class_getitem__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__orig_bases__', '__parameters__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__']


In [22]:
class MyCollateClass():
  def __init__(self,vocabDict = None):
    self.vocabDict = vocabDict

  def _string_cleanUp(arrOf_strs):
    count=0
    listOf_Strs = []
    for e_str in arrOf_strs:
      e_str = e_str.lower()   ## Loawer Casing the entire string
      e_str = re.sub(r'<[^>]*>','',e_str) ## Removing HTML Tags
      e_str = re.sub(r"[^\w\s]", ' ', e_str) ## Remove Special Characters 
      e_str = re.sub(r"\d", '', e_str) ## Remove Numbers from string
      count+=1
      listOf_Strs.append(e_str)
    return listOf_Strs


  
  def _return_indexList(self,OneSentance):
    vocabIndexes = []
    for eWord in OneSentance.split(" "):
      if eWord in list(self.vocabDict.keys()):
        vocabIndexes.append(self.vocabDict[eWord])
    idx_Tensor = torch.LongTensor(vocabIndexes)
    return idx_Tensor
  
  def _stack_Sentance_info(self,max_sentence_len = None,batch_size=None):
    tensorList,updatedTensorList,seqLengths = [],[],[]
    for ind,eLine in enumerate(self.cleanedList):
      retTensor = self._return_indexList(eLine)
      tensorList.append(retTensor)
    maxTensorSize = max(list((e_Tensor.size()[0] for e_Tensor in tensorList)))
    for e_tensor in tensorList:
      seqLengths.append(e_tensor.size()[0])
      if e_tensor.size()[0]<maxTensorSize:
        diff = maxTensorSize - e_tensor.size()[0]
        newTensor = torch.cat([e_tensor,torch.zeros(diff)])
        updatedTensorList.append(newTensor)
      else:updatedTensorList.append(e_tensor)
    finalTensor = torch.stack(updatedTensorList).type(torch.LongTensor)
    return finalTensor,seqLengths

  def PadCollate(self,batch):
    def _get_max_sentance_len(SentanceList):
      return max(list((len(esentance.split(' ')) for esentance in SentanceList)))
    def _convert_senti_to_int(SentList):
      sList = []
      for i in SentList:
        if i=="positive":sList.append(1)
        else:sList.append(0)
      sTensor = torch.LongTensor(sList)
      return sTensor
    batch_Dict = {}
    revList = list((eTuple[0] for eTuple in batch))
    sentiList = list((eTuple[1] for eTuple in batch))
    stacked_senti_tensor = _convert_senti_to_int(sentiList).type(torch.float32)
    self.cleanedList = _string_cleanUp(revList)
    maxLen_sentance = _get_max_sentance_len(self.cleanedList)
    stacked_vocab_tensor,seqLengths = self._stack_Sentance_info(maxLen_sentance,len(batch))
    batch_Dict = {"Vocab":stacked_vocab_tensor,"Senti":stacked_senti_tensor,"Seqlen":seqLengths}
    return batch_Dict


  def __call__(self,batch):
    return self.PadCollate(batch)

dataloader1 = DataLoader(review_dataset,batch_size = 50,shuffle=True, num_workers=0,collate_fn=MyCollateClass(trainVocab))

In [23]:
## Test The data from data loader ##
for ind,data in enumerate(dataloader1):
  if ind>3:break
  print(data["Vocab"])
  print(data["Senti"])
  print(data["Vocab"].shape)
  print(data["Senti"].shape)
  print("seq lenght",data["Seqlen"])
  print('*'*75)



tensor([[247, 165, 959,  ...,   0,   0,   0],
        [117,  47,  62,  ...,   0,   0,   0],
        [765, 709, 292,  ...,   0,   0,   0],
        ...,
        [212, 271,  60,  ...,   0,   0,   0],
        [  1,  48,  39,  ...,   0,   0,   0],
        [160, 816, 259,  ..., 655, 206, 699]])
tensor([0., 1., 1., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 1., 1., 1., 1., 1.,
        0., 1., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0., 1., 1., 0.,
        1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.])
torch.Size([50, 179])
torch.Size([50])
seq lenght [71, 61, 33, 45, 48, 80, 101, 58, 60, 44, 36, 57, 30, 25, 28, 40, 53, 107, 46, 71, 110, 68, 86, 39, 81, 66, 85, 59, 29, 12, 22, 56, 152, 82, 54, 34, 77, 26, 37, 51, 71, 124, 152, 63, 42, 39, 64, 33, 47, 179]
***************************************************************************
tensor([[  6,  49,   1,  ...,   0,   0,   0],
        [ 79,   3,  13,  ...,   0,   0,   0],
        [ 72,  12,  97,  ...,   0,   0,   0],
        ...,


MODEL
---

In [162]:
class SentiClassify_Model(nn.Module):
  def __init__(self,vocabLen,dims,hidden_size,seqLengths,output_size=2):
    super().__init__()
    #output_size =  2
    self.hidden_size = hidden_size
    self.seqLengths = seqLengths
    self.embed = nn.Embedding(vocabLen,dims)
    self.lstm_cell = nn.LSTM(input_size=dims,hidden_size=hidden_size)
    self.lf = nn.Linear(max(seqLengths)*hidden_size,output_size)
    self.sf_max = nn.Softmax(dim=1)
    

  def forward(self,input,hidden,verbose=False):
    embeds = self.embed(input)
    packedSeq = pack_padded_sequence(embeds.permute(1,0,2),self.seqLengths,batch_first=True,enforce_sorted=False)
    output,hidden = self.lstm_cell(packedSeq,hidden)
    outputt, input_sizes = pad_packed_sequence(output, batch_first=True)
    reshaped_out = outputt.reshape(outputt.size()[0],outputt.size()[1]*outputt.size()[2])
    lin_output = self.lf(reshaped_out)
    finalOutputs = self.sf_max(lin_output)
    finalOutputs = torch.argmax(finalOutputs,dim=1) ## Index 1 -> positive , 0 -> negative
    finalOutputs = torch.where(finalOutputs==1,1,0).type(torch.float32)
    finalOutputs.requires_grad=True
    return finalOutputs

    if verbose:
      print("input shape",input.shape)
      print('embed shape',embeds.shape)
      print("Rehaped output",reshaped_out.size())
      print("After Fully conn layer :",lin_output.size())

  def init_hiddenlayer(self,batch_size):
    return (torch.zeros(1,batch_size,self.hidden_size),torch.zeros(1,batch_size,self.hidden_size))



In [163]:
def _trainLoader(model=None,dataloader=None,vocabList = None,optimm =None,loss_func=None,epochs=1):
  ## Hyperparameters ##
  dims = 10
  hidden_size = 20
  
  loss_per_epoch,train_accuracy,test_accuracy = 0,None,None
  ## Batch Optimization ##
  for i in range(epochs):
    cummLoss = 0
    for ind,data in enumerate(dataloader):
      wordInput,seqLengths,targets = data["Vocab"].permute(1,0),data["Seqlen"],data["Senti"]
      modObj = model(len(vocabList),dims,hidden_size,seqLengths)
      hidden = modObj.init_hiddenlayer(wordInput.size()[-1])
      source = modObj(wordInput,hidden)
      if optimm==None:
        print("entering on;y once")
        optimm = optim.SGD(modObj.parameters(),lr=0.05)
      loss = lossFn(source,targets)
      loss.backward(retain_graph=True)
      optimm.zero_grad()
      optimm.step()
      cummLoss+=loss.item()
    
    train_accuracy = accuracy_score(targets,source)
    loss_per_epoch = cummLoss/len(data)
    print("Loss per Epoch : {} , Training Accuracy : {}".format(loss_per_epoch,train_accuracy))



In [164]:
lossFn = nn.BCEWithLogitsLoss()
_trainLoader(model=SentiClassify_Model,dataloader=dataloader1,vocabList = trainVocab,loss_func=lossFn)

entering on;y once


KeyboardInterrupt: ignored

In [59]:
sample_tens = torch.tensor([[0.45,0.5],
                            [0.5,0.48]])
print(sample_tens)
bb = torch.argmax(sample_tens,dim=1)
print(bb)
bb = torch.where(bb==1,1,0)
print(bb)

tensor([[0.4500, 0.5000],
        [0.5000, 0.4800]])
tensor([1, 0])
tensor([1, 0])
