<a href="https://colab.research.google.com/github/chege-kimaru/mlp-sentiment-analysis/blob/main/mlp_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Explore the IMDB dataset**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

# download the tsv file from the github repo
# upload it in your drive, copy the path and set it below
labeledTrainDataPath = '/content/drive/My Drive/colab-data/labeledTrainData.tsv'

Mounted at /content/drive


In [2]:
import re
import pandas as pd
 
def clean_review(text):
    # Strip HTML tags
    text = re.sub('<[^<]+?>', ' ', text)
 
    # Strip escaped quotes
    text = text.replace('\\"', '')
 
    # Strip quotes
    text = text.replace('"', '')
 
    return text
 
df = pd.read_csv(labeledTrainDataPath, sep='\t', quoting=3)
 
# Create a cleaned_review column
df['cleaned_review'] = df['review'].apply(clean_review)
 
# Check out how the cleaned review compares to the original one
print(df['review'][0])
print("\n\n")
print(df['cleaned_review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

**Bag of Words (BOW)**
Basically, with BOW, we need to compute the vocabulary (all possible words) and then a text is represented by a vector having 1 (or the number of appearances) for the present words in the text and 0 for all the other indices.
Beow is a very simplified and not optimized BOW transformer. But that is basically the algorithm.

In [3]:
VOCABULARY = ['dog', 'cheese', 'cat', 'mouse']
TEXT = 'the mouse ate the cheese'
 
def to_bow(text):
    words = text.split(" ")
    return [1 if w in words else 0 for w in VOCABULARY]
 
print(to_bow(TEXT))  # [0, 1, 0, 1]

[0, 1, 0, 1]


We can now use vectorizers of Scikit Learn to transform a text to its BOW representation.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
 
 
vectorizer = CountVectorizer(vocabulary=VOCABULARY)
vectorizer.transform([TEXT]).todense()  # matrix([[0, 1, 0, 1]])

matrix([[0, 1, 0, 1]])

**Logistic Regression**
- For each feature (numeric feature) the LogisticRegression has an associated weight
- When classifying a feature vector, we multiply the features with their weights (w * x). We apply a non-linear function to this value that maps the result in the **[0, 1]** space.
- If the resulting value is in the [0, 0.5) range than the data point belongs to the first class. If it belongs to the (0.5, 1] range then the data point belongs to the second class.
- The tricky part is figuring out the weights of the model. We do this using the **Gradient Descent method**. This is called training the model. We optimize the weights so that the number of misclassified points is minimum.
- GradientDescent is an iterative algorithm. It’s not a formula we need to apply. It is an exploratory process that tries to find the best-suited values. There are other training methods, but Gradient Descent and variants of it is what it’s used in practice.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.model_selection import train_test_split
 
# Shuffle the data and then split it, keeping 20% aside for testing
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_review'], df['sentiment'], test_size=0.2)
 
vectorizer = CountVectorizer(lowercase=True)
vectorizer.fit(X_train)
 
# classifier = Perceptron()
classifier = LogisticRegression()
classifier.fit(vectorizer.transform(X_train), y_train)
 
print("Score:", classifier.score(vectorizer.transform(X_test), y_test))  # Score: 0.8778

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Score: 0.8838


**Going from training a LogisticRegression model to training a NeuralNetwork is easy peasy with Scikit-Learn**
- Notice the changes made: we used the MLPClassifier instead of LogisticRegression
- Let’s talk about the hidden_layer_sizes parameter. A neural network consists of layers. Every neural network has an input layer (size equal to the number of features) and an output layer (size equal to the number of classes). Between these two layers, there can be a number of hidden layers. The sizes of the hidden layers are a parameter. In this case we’ve only used a single hidden layer. Layers are composed of hidden units (or neurons). Each hidden unit is basically a LogisticRegression unit (with some notable differences, but close enough). This means that there are 100 LogisticRegression units doing their own thing.
- The training of a neural network is done via BackPropagation which is a form of propagating the errors from the output layer all the way to the input layer and adjusting the weights incrementally.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
 
# Shuffle the data and then split it, keeping 20% aside for testing
# x - features y - targets
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_review'], df['sentiment'], test_size=0.2)
 
vectorizer = CountVectorizer(lowercase=True)
vectorizer.fit(X_train)
 
#  play around with the number of hidden layers to see the score
classifier = MLPClassifier(hidden_layer_sizes=(50,))
classifier.fit(vectorizer.transform(X_train), y_train)
 
print("Score:", classifier.score(vectorizer.transform(X_test), y_test))  # Score: 0.8816



Score: 0.8786


**Test the model**

In [8]:
reviews = ['Best movie ever', 'I have watched better', 'This was an amazing movie']
classifier.predict(vectorizer.transform(reviews))

array([1, 0, 1])