# Sentiment Analysis Model

**Data description :** This data contains 50K movie reviews posted on IMDB website. The data is divided into 2 folders `train` and `test` each containing 25K observation. `train` and `test` are further divided in to two folder `pos` (positive obs) and `neg` (negative obs) each containing 12.5K observations while consolidating the data into one file (i.e. single file containing all `train` and other containing all `test` data) I have put first 12.5K as `pos` observations and next 12.5K as `neg` observation.


## Loading Data

In [20]:
#Flow >> reading File line by line and stripping empty spaces of each line.
test_set = [line.strip() for line in open("./data/movie_data/full_test.txt", 'r')]
train_set = [line.strip() for line in open("./data/movie_data/full_train.txt", 'r')]

## Data Preprocessing

In [8]:
## required packages 
import re

In [None]:
#Flow >> iterating list, for each line removing all punctuation and brackers and replacing HTML tags with space.
no_space = re.compile("[.;:!\'?,\"()\[\]]")
space = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

preprocess = lambda data : [space.sub(" ", line) for line in [no_space.sub("", line.lower()) for line in data]]

In [None]:
test_set_clean = preprocess(test_set)
train_set_clean = preprocess(train_set)

## Text Processing

In [None]:
## required packages
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Removing stopwords

In [None]:
#Flow >> Iterating list, for each line removing all stop words present in it (eg. in, of, at, a, the etc.)
eng_stop_words = stopwords.words('english')
rem_stopwords = lambda x :[" ".join([word for word in line.split() if word not in eng_stop_words]) for line in x]

In [None]:
no_stop_train = rem_stopwords(train_set_clean)
no_stop_test = rem_stopwords(test_set_clean)

## Normalizing text



### Stemming

In [None]:
#required packages
from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()

In [None]:
#Flow >> Iterating list, for each word in line, converting that word into root word. (eg. sleeping -> sleep)
stem_text = lambda x : [" ".join([stemmer.stem(word) for word in line.split()]) for line in x]

In [None]:
train_stem = stem_text(no_stop_train)
test_stem = stem_text(no_stop_test)

## Vectorization 

In [None]:
#required packages
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
c_vect = CountVectorizer(binary=True)

In [None]:
c_vect.fit(train_stem)
X = c_vect.transform(train_stem)
X_test_set = c_vect.transform(test_stem)

## Scaling data

In [None]:
#required packages
from sklearn.preprocessing import maxabs_scale

In [None]:
X = maxabs_scale(X)

## Building Model

The aim is to built a model which can classify positive words (eg. great, good, excellent etc.) and negative words (eg. bad, worst, boring etc.)

In [None]:
#required packages
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
#it will be same for train as well as test (reason mentioned at the top)
y = [1 if i < 12500 else 0 for i in range(25000)]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [None]:
#parameter tuning of LogisticRegression model 
#inverse regularization parameter - c

score_dict = {}
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr_model = LogisticRegression(C=c)
    lr_model.fit(X_train, y_train)
    score_dict.setdefault(c, accuracy_score(lr_model.predict(X_test), y_test) )

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [None]:
max_acc = [(c, acc) for c, acc in sorted(score_dict.items(), key = lambda tup: tup[1], reverse=True)]

print(f"for c = {max_acc[0][0]} accuracy {max_acc[0][1]} is maximum.")

for c = 0.05 accuracy 0.8796363636363637 is maximum.


### Final model

In [None]:
lr_model_final = LogisticRegression(C=0.05)
lr_model_final.fit(X, y)

LogisticRegression(C=0.05, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Predicting and testing accuracy

In [None]:
pred = lr_model_final.predict(X_test_set)

In [None]:
#as said earlier y is same for both train and test
print(f"Accuracy for final model on test data is {round(accuracy_score(y, pred)*100 , 2)} %")

Accuracy for final model on test data is 87.64 %


### Sanity Check

In [None]:
# when multi_class='multinomial', coef_ corresponds to outcome 1 (True)
# and -coef_ corresponds to outcome 0 (False).

# both lr_model_final and c_vect is trained on train_stem 
# that's why feature_name in c_vect will corrospond to coef in lr_model_final

feat_coeff = {word : coeff for word, coeff in zip(c_vect.get_feature_names(),lr_model_final.coef_[0])}

In [None]:
pos_neg = sorted(feat_coeff.items(), key= lambda x : x[1], reverse=True)

**Top 5 positive words :**

In [None]:
for i,j in pos_neg[:5]:
    print(f"{i} :: {j}")

excel :: 0.9626706903838785
perfect :: 0.7808248316882564
favorit :: 0.7215479650741811
great :: 0.6526663723624709
refresh :: 0.6317573415665363


**Top 5 negative words :**

In [None]:
for i,j in pos_neg[-5:][::-1]:
    print(f"{i} :: {j}")

worst :: -1.3323651866026152
wast :: -1.1568194985169875
aw :: -1.069636256865342
poorli :: -0.8897462546755232
bore :: -0.828017672610324


Output looks pretty much sane as it is correctly giving more +ve weights to positive words and vice versa.

In [None]:
rev = ["didnt really liked the movie","i really liked the movie it was great", "simran i love you like i hate you dammit", "movie was just okay"]
testing = c_vect.transform(rev)

In [None]:
f=lr_model_final.predict(testing)

In [None]:
for i, j in zip(rev,f):
    print(f">>> {i} >>> {j}")

>>> didnt really liked the movie >>> 0
>>> i really liked the movie it was great >>> 1
>>> simran i love you like i hate you dammit >>> 1
>>> movie was just okay >>> 0
