<a href="https://colab.research.google.com/github/deekshakoul/Sentiment-Analysis-for-movie-reviews/blob/master/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Build a classifier on IMDB movie dataset using a TF-IDF reresentation and logistic regression(and Naive Bayes)

Taken data from: http://ai.stanford.edu/~amaas/data/sentiment/ 


In [None]:
import os
import csv

dirpath1 = ' ' #path where negative reviews are stored for training
dirpath2 = ' ' #path where positive reviews are stored for training
output = 'output_file.csv'
with open(output, 'w') as outfile:
    csvout = csv.writer(outfile)
    csvout.writerow(['senti', 'review'])

    files = os.listdir(dirpath1)

    for filename in files:
        with open(dirpath1 + '/' + filename) as afile:
            csvout.writerow([0, afile.read()])
            afile.close()

    files = os.listdir(dirpath2)

    for filename in files:
        with open(dirpath2 + '/' + filename) as afile:
            csvout.writerow([1, afile.read()])
            afile.close()

    outfile.close()

# **Text - processing**
* convert to lower case
* get rid of ascents
* tokenization
* create vocab to index
* replace words into numbers: Encoding the reviews
* Encode the labels  -- already done as per script combine.py
    
    [‘positive’ as 1 and ‘negative’ as 0]

**Note:** 

To convert text docs to numbers we can use foll. techniques:

1.   CountVectorizer
2.   TfidfVectorizer
3.   HashingVectorizer 

In this I have used TfidfVectorizer from sklearn, this module takes care of removal of stop words, ascents, lower case and also tokenization.


In [None]:
#!pip install -U nltk==3.4
import random
import nltk
nltk.download('movie_reviews')
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]
#movie_reviews.fileids("pos") --> pos/cv937_9811.txt
#documents is  a list, each element in list is a tuple
#tuple : ( list1 , string) where list1 is a list of strings(words in review)
#A clear eample is documents[0][0][0] --> first review, first word 
# documents[0][1] --> pos or neg of first review
# /content/drive/My Drive/iisc/summer/datasets/IMDB Dataset.csv              

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('/content/drive/My Drive/iisc/summer/datasets/train.csv')
df = df.sample(frac=1).reset_index(drop=True) #shuffle rows

In [3]:
df_test = pd.read_csv('/content/drive/My Drive/iisc/summer/datasets/test.csv')
df_test = df_test.sample(frac=1).reset_index(drop=True) #shuffle rows

**Note**: No tokenization has been done as that will be handled by tf-idf




**TF-IDF** : converting text to words to integers

**TF:**The number of times a word appears in a document divded by the total number of words in the document. Every document has its own term frequency.
![alt text](https://miro.medium.com/proxy/1*HM0Vcdrx2RApOyjp_ZeW_Q.png)


**IDF:**The log of the number of documents divided by the number of documents that contain the word w. It determines the weight of rare words across all documents in the corpus.

![alt text](https://miro.medium.com/proxy/1*A5YGwFpcTd0YTCdgoiHFUw.png)


Lastly, the TF-IDF is simply the TF multiplied by IDF.
![alt text](https://miro.medium.com/proxy/1*nSqHXwOIJ2fa_EFLTh5KYw.png)

**TfidfVectorizer** class provided by **sklearn** : Convert a collection of raw documents to a matrix of TF-IDF features. It will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.

This is equivalent to CountVectorizer followed by TfidfTransformer.


**NOTE:** to make sure that both train and test have same dimemsnions for tfidf matrix we need to ::
[SO](https://stackoverflow.com/questions/39170169/identical-dimensions-for-train-and-test-matrices-in-text-analysis)

train_X = vectorizer.fit_transform(train)

test_X = vectorizer.transform(test)

or

vectorizer.fit_transform([train, test])

Using transform instead of fit_transform preserves the vocabulary created from fit_transform in the previous line, and ensures identical columns for these matrices.

In [4]:
#Without going into the math, TF-IDF are word frequency scores that try to highlight
# words that are more interesting, e.g. frequent in a document but not across documents.
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer_model = TfidfVectorizer(lowercase=True, strip_accents='ascii', analyzer='word',stop_words='english') #matrix returned is by default float64 

tv_train_reviews=vectorizer_model.fit_transform(df["review"])
#transformed test reviews
tv_test_reviews=vectorizer_model.transform(df_test["review"])
#Now document-term matrix "tv_train_reviews" has the TF-IDF values of all the documents in the corpus. This is a big sparse matrix.


Simple logistic regression technique using TFIDF representation

In [5]:
print(tv_train_reviews.shape) 
#MODELLING 
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
lr.fit(tv_train_reviews, df['senti'])

#predict using test data
test_predicted = lr.predict(tv_test_reviews)
score = accuracy_score(df_test["senti"],test_predicted)
print(score) #87

(25000, 74515)
0.87928


Multinomial Naive Bayes using TFIDF representation

In [6]:
#training the model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb_tf = nb.fit(tv_train_reviews, df['senti'])
#predict using test data
test_predicted = nb.predict(tv_test_reviews)
score = accuracy_score(df_test["senti"],test_predicted)
print(score)  #83

0.82992


**Do Text Cleaning on my own to verify results:**

*   lower case
*   Stopwords removal
*   special chars removal
*   text stemming



In [7]:
import nltk

#lower cases
df['review'] = df['review'].str.lower()
df_test['review'] = df_test['review'].str.lower()

#Punctuations
import string
def rem_punctuation(text):
  translator = str.maketrans('','',string.punctuation) #!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
  modified_text = text.translate(translator)
  return modified_text;

#remove special charcters
import re
def rem_special_characters(text, remove_digits=True):
    text = re.sub(r'<.*?>',"",text) #remove html tags/urls
    pattern=r'[^a-zA-z0-9\s]' #remove evrythig except
    text = re.sub(pattern,'',text) 
    return text

#remove stopwords    
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
#set stopwords to english
stop=set(stopwords.words('english'))
def remove_stopwords(text):
  tokens = word_tokenize(text)
  nostop_tokens =  [i for i in tokens if i not in stop];
  text  = " ".join(nostop_tokens)
  return text


#Text Stemming
from nltk.stem.snowball import SnowballStemmer
#from nltk.stem.porter import PorterStemmer
def text_stemmer(text):
  stemmer = SnowballStemmer(language='english')
  tokens = word_tokenize(text)
  text = " ".join([stemmer.stem(token) for token in tokens]) 
  return text  

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
df['review'] = df['review'].apply(text_stemmer)  
df['review'] = df['review'].apply(rem_special_characters)
df['review'] = df['review'].apply(remove_stopwords)
df_test['review'] = df_test['review'].apply(text_stemmer)  
df_test['review'] = df_test['review'].apply(rem_special_characters)
df_test['review'] = df_test['review'].apply(remove_stopwords)

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tv_train_reviews=vectorizer.fit_transform(df["review"])
tv_test_reviews=vectorizer.transform(df_test["review"])

In [10]:
print(tv_train_reviews.shape) 

from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
lr.fit(tv_train_reviews, df['senti'])

#predict using test data
test_predicted = lr.predict(tv_test_reviews)
score = accuracy_score(df_test["senti"],test_predicted)
print(score)

(25000, 75527)
0.88132



---
