<img src="https://rhyme.com/assets/img/logo-dark.png" align="center"> <h2 align="center">Logistic Regression: A Sentiment Analysis Case Study</h2>

### Introduction
___

- IMDB movie reviews dataset
- http://ai.stanford.edu/~amaas/data/sentiment
- Contains 25000 positive and 25000 negative reviews
<img src="https://i.imgur.com/lQNnqgi.png" align="center">
- Contains at most 300 reviews per movie to avoid any bias to a movie
- At least 7 stars out of 10 $\rightarrow$ positive (label = 1)
- At most 4 stars out of 10 $\rightarrow$ negative (label = 0)
- 50/50 train/test split
- Evaluation accuracy

<b>Features: bag of 1-grams with TF-IDF values</b>:
- Extremely sparse feature matrix - close to 97% are zeros
- The feature matrix will have 25000 rows( 25000 training reviews) and 75000 columns i.e.total 75000 unique words from 25000 training reviews and this will be the dictionary(fit) on which the testing data will be tested(transform)

 <b>Model: Logistic regression</b>
- $p(y = 1|x) = \sigma(w^{T}x)$
- Linear classification model
- Can handle sparse data
- Fast to train
- Weights can be interpreted
<img src="https://i.imgur.com/VieM41f.png" align="center" width=500 height=500>

### Task 1: Loading the dataset
---

In [1]:
import pandas as pd

df=pd.read_csv('movie_data.csv')
df.head(10)

  return f(*args, **kwds)


Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1


In [2]:
df['review'][1]

"OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low energy style and he will steal a scene effortlessly. But, Disappearance is his misstep. Holy Moly, this was a bad movie! <br /><br />I must give kudos to the cinematography and and the actors, including Kris, for trying their darndest to make sense from this goofy, confusing story! None of it made sense and Kris probably didn't understand it either and he was just going through the motions hoping someone would come up to him and tell him what it was all about! <br /><br />I don't care that everyone on this movie was doing out of love for the project, or some such nonsense... I've seen low budget movies that had a plot for goodness sake! This had none, zilcho, nada, zippo, empty of reason... a complete waste of good talent, scenery and celluloid! <br /><br />I rented this piece of garbage for a buck, and I want my money back! I want my 2 hou

## <h2 align="center">Bag of words / Bag of N-grams model</h2>

### Task 2: Transforming documents into feature vectors

Below, we will call the fit_transform method on CountVectorizer. This will construct the vocabulary of the bag-of-words model and transform the following three sentences into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two

Bag of Words=CountVectorizer technique on corpus of docs
So we will treat the three sentences as 3 docs and the CountVectorizer will produce a feature matrix of 3 rows and (total number of unique words across all 3 docs)columns where each element in row r and column c indicates number of occurences of  word c in doc r.     

In [3]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count=CountVectorizer()
doc=np.array(['The sun is shining',
'The weather is sweet',
'The sun is shining, the weather is sweet, and one and one is two'])
bag=count.fit_transform(doc)  

#fit_transorm will first use fit to form a dictionary of all unique words from doc and then transorm will convert the input i.e. doc to a feature matrix using the built dictionary.
#just if fit called then the input to fit method will be used to expand the dictionary.
#just transorm will connvert the input to transorm method to feature matrix using the previous built dictionary and not use this input to expand the dictionary.



  return f(*args, **kwds)


In [4]:
count.vocabulary_   #Dictionary item which represnts the dictionary of words built from all fit methods

{'the': 6,
 'sun': 4,
 'is': 1,
 'shining': 3,
 'weather': 8,
 'sweet': 5,
 'and': 0,
 'one': 2,
 'two': 7}

In [5]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


Raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*

### Task 3: Word relevancy using term frequency-inverse document frequency

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, and df(d, t) is the number of documents d that contain the term t.

word count vectorizer not used as common words like a,an,the given higher weights and counts in the vector and so not useful so tfidf used where tf calculates frequency of each word across all docs in corpus and idf downscales the effect of words with very high frequency and thus solves our problem
tfidf highlights and gives more weight to imp words i.e words that are frequent in a document but not across documents

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
np.set_printoptions(precision=2) #for a clearer array description set the np array precisions to 2 decimal places

tfidf=TfidfTransformer(use_idf=True,norm='l2',smooth_idf=True)

#for this Tfidf Transformer we need few arguments like use_idf=True to tell not just calculate tf but also multipy it by idf factor,norm indicates normalize the vector values based on l2 norm,smooth_idf=True to avoid any division by zeros in idf term and so avoid any errors.
#Tfidf Transormer used so as to covert CountVectorized matrix in tfidf form but instead we could have done TfidfVectorizer directly,we hhave done this to study mathematical working of the method.

print(tfidf.fit_transform(bag).toarray())
# we can observe effects such as in doc 3 the 2 nd element is for word 'is' which in bag of words approach had an inflated value of 3 but now as it is a common word appearing in all 3 docs its weight was scaled down.


[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


The equations for the idf and tf-idf that are implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$
The tf-idf equation that is implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

### Task 4: Data Preparation

In [7]:
# we will strip the text data of irrelevant info such as punctuations,html tags,emojis,etc.

df.loc[0,'review'][-50:] #last 50 chars of 1st review and we see punctuations,html tags for breaks,colons,etc

'is seven.<br /><br />Title (Brazil): Not Available'

In [8]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text) #stripping html tags and replace them with an empty string
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) #finding all emojis and moving them all towards the end of the review.
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '') # punctuations also removed
    return text

In [9]:
print(preprocessor(df.loc[0,'review'][-50:]))

is seven title brazil not available


In [10]:
#try with an example with an emoji,html tag and punctuations
print(preprocessor('</a>This :) is :( a Test :-) !'))

#all punctuations and html tags stripped and emojis moved to end

this is a test :) :( :)


In [12]:
df['review']=df['review'].apply(preprocessor)

### Task 5: Tokenization of documents

In [13]:
from nltk.stem.porter import PorterStemmer

porter=PorterStemmer()

  return f(*args, **kwds)


In [14]:
def tokenizer(text):
    return text.split()

In [15]:
def tokenizer_stemmer(text):
    return [porter.stem(word) for word in text.split()]

In [21]:
tokenizer('runners like ruuning and thus they run')

['runners', 'like', 'ruuning', 'and', 'thus', 'they', 'run']

In [22]:
tokenizer_stemmer('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [23]:
#we can actually first remove stop words such as an,the,etc
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ajinkeya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
from nltk.corpus import stopwords

stop=stopwords.words('english')
[w for w in tokenizer_stemmer('a runner likes running and so he runs a lot') if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

### Task 6: Transform Text Data into TF-IDF Vectors

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(strip_accents=None,
                     lowercase=False,   #already taken care of by tokenizer_stemmer function which we weill specify next
                     preprocessor=None,  #we have already applied the function on our data
                     tokenizer=tokenizer_stemmer,
                     use_idf=True,  #to multiply df by idf weighing term also and not just use frequency/count directly
                     norm='l2',     #normalize all values of each row based on l2 normalization
                     smooth_idf=True #to avoid any division by 0s and hence avoid any errors
                     )  
y=df.sentiment.values
X=tfidf.fit_transform(df.review)

### Task 7: Document Classification using Logistic Regression

In [26]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1,test_size=0.5,shuffle=False)

In [31]:
import pickle  #used to save our trained model to disk
from sklearn.linear_model import LogisticRegressionCV  #CV so can directly incorporate k fold CV operation as a parameter

clf=LogisticRegressionCV(cv=5,    #5-fold CV
                        scoring='accuracy',
                        random_state=0,
                        n_jobs=-1,  #no. of CPU cores to dedicate to training i.e. all 4 here
                        verbose=3,
                        max_iter=300)#default 100 iterartions may not be enough for paras to converge in such a large training set so for safety we set it to 300.
clf.fit(X_train,y_train)
saved_model=open('saved_model.sav','wb') #the trained model will be saved under this name in same file as our notebook
pickle.dump(clf,saved_model)  #to save our trained model to saved_model file on disk
saved_model.close()

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  3.1min remaining:  4.6min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  3.8min finished


### Task 8: Model Evaluation

In [32]:
#to load the model
filename='saved_model.sav'
saved_clf=pickle.load(open(filename,'rb')) #can work with clf also but this is how we save trained models and use them at a later time.

In [33]:
saved_clf.score(X_test,y_test)



0.89608