 <h2 align="center">Logistic Regression: A Sentiment Analysis Case Study</h2>

### Introduction
___

- IMDB movie reviews dataset
- http://ai.stanford.edu/~amaas/data/sentiment
- Contains 25000 positive and 25000 negative reviews
<img src="https://i.imgur.com/lQNnqgi.png" align="center">
- Contains at most reviews per movie
- At least 7 stars out of 10 $\rightarrow$ positive (label = 1)
- At most 4 stars out of 10 $\rightarrow$ negative (label = 0)
- 50/50 train/test split
- Evaluation accuracy metric will be used
- 300 reviews per movie to ensure that the dataset is NOT biased

<b>Features: bag of 1-grams with TF-IDF values</b>:
- Extremely sparse feature matrix - close to 97% are zeros
- Sparse feature matrix of TF-IDF values is input for Logistic Regression

 <b>Model: Logistic regression</b>
- $p(y = 1|x) = \sigma(w^{T}x)$
- Linear classification model
- Can handle sparse data
- Fast to train
- Weights can be interpreted
<img src="https://i.imgur.com/VieM41f.png" align="center" width=500 height=500>

### Loading the dataset
---

In [1]:
import pandas as pd

df = pd.read_csv(r'movie_data.csv', engine = 'python', encoding='utf-8', error_bad_lines=False)
df.head(10)

Skipping line 10570: unexpected end of data


Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0
5,Leave it to Braik to put on a good show. Final...,1
6,Nathan Detroit (Frank Sinatra) is the manager ...,1
7,"To understand ""Crash Course"" in the right cont...",1
8,I've been impressed with Chavez's stance again...,1
9,This movie is directed by Renny Harlin the fin...,1


In [2]:
df['review'][1]

"OK... so... I really like Kris Kristofferson and his usual easy going delivery of lines in his movies. Age has helped him with his soft spoken low energy style and he will steal a scene effortlessly. But, Disappearance is his misstep. Holy Moly, this was a bad movie! <br /><br />I must give kudos to the cinematography and and the actors, including Kris, for trying their darndest to make sense from this goofy, confusing story! None of it made sense and Kris probably didn't understand it either and he was just going through the motions hoping someone would come up to him and tell him what it was all about! <br /><br />I don't care that everyone on this movie was doing out of love for the project, or some such nonsense... I've seen low budget movies that had a plot for goodness sake! This had none, zilcho, nada, zippo, empty of reason... a complete waste of good talent, scenery and celluloid! <br /><br />I rented this piece of garbage for a buck, and I want my money back! I want my 2 hou

## <h2 align="center">Bag of words / Bag of N-grams model</h2>

### Transforming documents into feature vectors

Below, we will call the fit_transform method on CountVectorizer. This will construct the vocabulary of the bag-of-words model and transform the following three sentences (documents) into sparse feature vectors:
1. The sun is shining
2. The weather is sweet
3. The sun is shining, the weather is sweet, and one and one is two


The Bag of words model takes the sentences/documents and converts it to numeric values


First implement Bag of words model using ython basics

In [3]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()

#load data as a list and store it as a numpy array

list1 = ['The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two']

docs = np.array(list1)

bag = count.fit_transform(docs)

In [4]:
print(count.vocabulary_)           # unique words in document is mapped to unique integer indices of the python dictionary

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [5]:
# print feature vectors created

print(bag.toarray())     # Accross every index as mentioned above the number on time the word has occured in that sequence/ document is mentioned. This is called term frequency

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


Raw term frequencies: *tf (t,d)*—the number of times a term t occurs in a document *d*

### Word relevancy using term frequency-inverse document frequency

$$\text{tf-idf}(t,d)=\text{tf (t,d)}\times \text{idf}(t,d)$$

$$\text{idf}(t,d) = \text{log}\frac{n_d}{1+\text{df}(d, t)},$$

where $n_d$ is the total number of documents, 

and df(d, t) is the number of documents d that contain the term t.


tf_idf transformer takes the raw term frequencies from the count vectorizer output and transforms it into the tf-idf values

tf-idf weights data according to number of times they appear in text corpus so as to determine how much discriminatory information can that particular data provide.



In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
np.set_printoptions(precision = 2)
tfidf = TfidfTransformer(use_idf = True, norm = 'l2', smooth_idf = True )          # use_idf = True enables the idf weighting
# l2 normalization is used i.e. sum of squares of vector elements
# smooth_idf = true prevents divisions by 0, therefore we get a get a smooth idf output


print(tfidf.fit_transform(bag).toarray())# convert to array and store the output of tfidf transform i.e. the tf-idf value in bag


[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


The equations for the idf and tf-idf that are implemented in scikit-learn are:

$$\text{idf} (t,d) = log\frac{1 + n_d}{1 + \text{df}(d, t)}$$
The tf-idf equation that is implemented in scikit-learn is as follows:

$$\text{tf-idf}(t,d) = \text{tf}(t,d) \times (\text{idf}(t,d)+1)$$

### Data Preparation

Removal of irrevalent data like html tags, punctuations, emojis etc. 

In [7]:
df.loc[0, 'review'][-50:]                #printing the last 50 character of the first review

'is seven.<br /><br />Title (Brazil): Not Available'

Hardcode to remove the common irrelevent info

moving emojis to the end of review

In [8]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [9]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [10]:
preprocessor("</a>This :) is a test :-)!")

'this is a test :) :)'

In [11]:
df['review'] = df['review'].apply(preprocessor)     # apply function of pandas is used for this

### Tokenization of documents

Goal of tokenization is to represent the data as a collection of words/tokens. 

Word level preprocessing tasks like stemming is used. 

STEMMING:

Goal: To reduce the inflectional forms and sometimes derivationally related forms of a common word to a based form. E.g. organizer, organized, organizing all mean the same thing

In [12]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()              # initializes the portstemmer


In [13]:
def tokenizer(text):
    return text.split()                # returns the sentence split into words based on the space character

In [14]:
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]            # returns output of stemming process, this performs stemming plus tokenization

In [15]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [16]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

Remove stopwords like and, the, is etc and then pass the tokenizer porter, because they are not useful for the analysis process 

In [17]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\alroy
[nltk_data]     Lobo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]     #performs stemming, splitting in tokens and also removes stop words

['runner', 'like', 'run', 'run', 'lot']

### Transform Text Data into TF-IDF Vectors

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents = None,
                   lowercase = False,
                   preprocessor = None,
                   tokenizer = tokenizer_porter,
                   use_idf = True,
                   norm = 'l2',
                   smooth_idf = True)    #preprocessor set to false because have already performed preprocessing, tokenizer set to tokenizer_porter thst was created in the previous step

y = df.sentiment.values
X = tfidf.fit_transform(df.review)


### Document Classification using Logistic Regression

In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1, test_size = 0.5, shuffle = False)            # test_size = 0.5 i.e. 50-50 train-test split, no shuffling

In [21]:
import pickle                  #used to save model to disc
from sklearn.linear_model import LogisticRegressionCV     #cross validation logistic regression used to tune the hyperparameters of Logistic Regression


clf = LogisticRegressionCV(cv = 5,
                          scoring = 'accuracy',
                          random_state = 0,
                          n_jobs = -1,
                          verbose = 3, 
                          max_iter = 300).fit(X_train, y_train)


# cv=5 hence 5 fold cross validation to tune the hyperparameters
# scoring factor of accuracy is considered
# n_jobs = -1 is used to dedicate a CPU to perform this task
# verbose =3 to see output while the model is been built
# max_iter used to define the number of cross validations the algo is going to run


saved_model = open('saved_model.sav', 'wb')
pickle.dump(clf,saved_model)
saved_model.close()

# open saved model file dump the model and save



[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  1.1min remaining:  1.7min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.5min finished


### Task 8: Model Evaluation

In [22]:
filename = 'saved_model.sav'
saved_clf = pickle.load(open(filename, 'rb'))

In [23]:
saved_clf.score(X_test, y_test )

0.8629825889477668