### Sentiment Analysis for IMDb reviews 

In this notebook, we will build a classification model to predict whether a movie review from IMDb is positive or negative. We will use the dataset named [IMDb Dataset of 50K Movie Reviews](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) from [Kaggle](https://en.wikipedia.org/wiki/Kaggle). 

In [1]:
# # If using Google Colab
# from google.colab import drive
# drive.mount('/content/drive')
# cd '/content/drive/Shareddrives/Machine Learning datasets'

ModuleNotFoundError: No module named 'google'

In [8]:
import warnings 
warnings.filterwarnings("ignore")

import pandas as pd
df = pd.read_csv("IMDB Dataset.csv")

In [10]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [11]:
df.iloc[0, 0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

Each input example for the neural network is a vector(point) in $n$-dimensional space where $n$ is the number of input nodes. Our input consists of text, which is a sequence of words, spaces, punctuations, emojis, etc. So, we need to convert this input into a $n$-dimensional feature vector, which consists of numerical values. How can we do this?


We will use the techniques from **natural language processing (NLP) for text classification**. This particular model would be an example of **sentiment analysis**, which as the name suggests identifies the sentiment of the text. 

### Bag Of Words (BOW)

A simple way to vectorize a text would be to convert it into a sequence of words. For example,
```
"It is sunny in Los Angeles." ->  ["It", "is", "sunny", "in", "Los", "Angeles", "."]
```


So, now we have a vector but the values are not numerical. One way to solve this would be to create a vocabulary and then use it to create a feature vector by counting the occurence of each word. For example,
```
Training text: ["I like to read in cafes.", "The walk in the park is nice."]
Vocabulary: ["I", "like", "to", "read", "in", "cafes", "the", "walk", "park", "is", "nice"]
```

```
New text: "I like the walk in the park."
```

|I| like| to| read| in| cafes|the|walk| park|is|nice|
|-|-----|---|-----|---|------|---|----|-----|--|----|
|1|  1  | 0 |  0  | 1 |  0   | 2 |  1 |  1  |0 | 0  |

```
Vectorization: "I like the walk in the park." -> [1, 1, 0, 0, 1, 0, 2, 1, 1, 0, 0]
``` 

If I know the vocabulary set `["I", "like", "to", "read", "in", "cafes", "the", "walk", "park", "is", "nice"]` and I am given the vector `[1, 1, 0, 0, 1, 0, 2, 1, 1, 0, 0]` corresponding to this vocabulary. Can I retrieve the original sentence? If not, what is missing?

This technique is called **Bag of words (BOW)** as it disregards the order of the words. You can think of it as putting all the words from a sentence in a bag and thereby breaking the sequence of words completely.

In practical examples, your vocabulary needs to be very large which means you will have many columns. The number of columns adds to the complexity of the model. To keep overfitting in check, you will need a much higher number of rows (training examples) to train the model. 


### CountVectorizer


The above process of vectorization can be performed using [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from [`scikit-learn`](https://scikit-learn.org/stable/) as follows. 

First we import and define the vectorizer.
```
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() 
```

Then, we use only the training set to train/fit the vectorizer. Once it is trained, we transformed the training set.
```
X_train_vectorized = vectorizer.fit_transform(X_train)
```
Lastly, we transform the validation set. Note that we do not use the validation set to fit/train the vectorizer.
```
X_valid_vectorized = vectorizer.transform(X_valid)
```

The variables `X_train_vectorized` and `X_valid_vectorized` thus obtained are numerical vectors that can be fed into the model.

The common words such as "the", "a", "is", "it", etc. can be conveniently removed. They called **stopwords**. 

```
vectorizer = CountVectorizer(stop_words="english", preprocessor=clean_text)                         
```

Since, the vocabulary is coming solely from the training set, the performance of our model depends on making sure that the training set is large and diverse enough to contain most of the needful vocabulary.

### Exercise:

Let's go ahead and try build the model! 

Guideline: 
* Divide the dataset into training and validation set
* Define the function for cleaning text to be used in the next step
* Vectorize both training and validation set using [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Make sure to pass arguments for `stop_words` and `preprocessor` keywords.
* Train a logistic classifier using [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) on the vectorized training set 
* Predict the labels for the validation set and test their accuracy
* Write a few reviews and test them to see if the model correctly predicts the sentiment labels (Optional)

In [16]:
from sklearn.model_selection import train_test_split
# default is 75% / 25% train-test split
X = df['review'] 
y = df['sentiment'].replace({'positive': 1, 'negative': 0})
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=0)

In [17]:
import re
def clean_text(text):
    """
    Applies some pre-processing on the given text.

    Steps :
    - Removing HTML tags
    - Removing punctuations and other characters
    """
    
    # remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # remove punctuation and other characters
    text = re.sub("[,.:;?!@#$%^&*()-+_=/{}]+", '', text)
    
    # remove the characters [\], ['] and ["]
    text = re.sub("[\'\"\[\]]", '', text) 

    return text

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression

# Vectorization
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = countVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_valid_vectorized = vectorizer.transform(X_valid)



# Model training
from sklearn.linear_model import LogisticRegression
LR_clf = LogisticRegression()
LR_cld.fit(X_train_vectorized, y_train)


# Evaluation
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(LR_clf.score(X_train, y_train)))
print('Accuracy of Logistic regression classifier on validation set: {:.2f}'
     .format(LR_clf.score(X_valid, y_valid)))


NameError: name 'countVectorizer' is not defined

### TF-IDF Vectorizer

If we were to look only at words such as in Bag-Of-Words (BOW), some words such as "wonderful", "disgusting", etc. would be stronger indicators for the sentiment of the reviews than words such as "watching", "become", "every", "after", etc. In the above method, the words were weighted solely based on their frequency in a review. Wouldn't it be useful to weigh rarer words higher than commonly occuring ones?

Term Frequency Inverse Document Frequency (TF-IDF)

$$ \text{TF-IDF} = \text{TF (Term Frequency)} * \text{IDF (Inverse Document Frequency)} $$

Term Frequency (TF) is the same as above viz the number of times a word occur in a review. It is multiplied by Inverse Document Frequency (IDF) which is a measure of the originality of the word. The words that are rarer have higher IDF values and hence, they are weighted more in TF-IDF than their true frequency as compared to commonly occuring words.

$$ \text{Inverse Document Frequency (IDF) for a word} = \log \Bigg( \frac{\text{Total number of reviews}}{\text{Number of reviews that contain this word}}\Bigg)$$



Term Frequency Inverse Document Frequency (TF-IDF) vectorization is implemented in [`scikit-learn`](https://scikit-learn.org/stable/) as [`TfidfVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and has the same syntax as [`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) above.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression

# Vectorization




# Model training



# Evaluation
acc = accuracy_score(y_valid, y_pred)
print("Accuracy of the IMDB dataset: {:.2f}".format)

### Using n-grams

The above methods using Bag-Of-Words (BOW) technique are not good at detecting negation. Let's predict the sentiment for some of the reviews. Recall that $0$ corresponds to negative and $1$ corresponds to positive sentiment.

In [13]:
review1 = ["In and of itself it is not a bad film."]
vectorized_review1 = vectorizer2.transform(review1)
model2.predict(vectorized_review1)

NameError: name 'vectorizer2' is not defined

In [13]:
review2 = ["""It plays on our knowledge and our senses, particularly with the scenes concerning
          Orton and Halliwell and the sets are terribly well done."""]
vectorized_review2 = vectorizer2.transform(review2)
model2.predict(vectorized_review2)

array([0])

In [14]:
review3 = ["""This show was not really funny anymore."""]
vectorized_review3 = vectorizer2.transform(review3)
model2.predict(vectorized_review3)

array([1])

An improvement would be to include phrases in the model instead of simply breaking the sentence into words. This is achieved using $n$-grams for words. The bigrams take two words together at a time, the trigrams take three words and so on. It is implemented using the keyword `ngram_range` as follows in the vectorizer:
```
vectorizer = TfidfVectorizer(stop_words="english",
                             preprocessor=clean_text,
                             ngram_range=(1, 3))
```

where
```
ngram_range: tuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
```


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression

# Vectorization




# Model training



# Evaluation


In [16]:
review1 = ["In and of itself it is not a bad film."]
vectorized_review1 = vectorizer3.transform(review1)
model3.predict(vectorized_review1)

array([0])

In [17]:
review2 = ["""It plays on our knowledge and our senses, particularly with the scenes concerning
          Orton and Halliwell and the sets are terribly well done."""]
vectorized_review2 = vectorizer3.transform(review2)
model3.predict(vectorized_review2)

array([1])

In [18]:
review3 = ["""This show was not really funny anymore."""]
vectorized_review3 = vectorizer3.transform(review3)
model3.predict(vectorized_review3)

array([1])

As you can see, the model is correctly predicting the sentiment only for the second review. It still does not get the sentiment for the other two reviews! There are limitations with using Logistic Regression than can only draw linear decision boundaries, so we will use neural network with hidden layers on this dataset to see if it improves the results. We will also study some neural network architectures that are especially designed to have memory of previous words in a sentence in the next sessions.

### Stemming and Lemmatization

Many languages are inflected, that is they contain words that are derived from another word and their inflected form changes based on usage.


"In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change" [Wikipedia]


Stemming and Lemmatization both are used to get the root form of the inflected words. 
* Stemming reduced the word into a stem which may or may not be an actual word in the language
* Lemmatization reduces the word to a lemma which is an actual language word.

Stemming is faster as compared to lemmatization because it uses an algorithm to remove suffixes and prefix and thereby reduces the word to its stem. Lemmatization identifies the parts-of-speech for the word and then looks up the WordNet corpus to find its corresponding lemma.


Lemmatization is usually preferable to stemming but it can be harder to implement than stemming on a new language.