# The Movie reviews sentiment analysis challenge

# Problem Statement

Sentiment Analysis is the most common text classification tool that analyses an incoming message and tells whether the underlying sentiment is positive, negative or neutral. You can input a sentence of your choice and gauge the underlying sentiment by playing with the demo [here](https://www.paralleldots.com/sentiment-analysis).

We have a dataset which has a couple of columns, one of them being a review (or the description). The idea is to determine the sentiment from the review text. Essentially, that would consist of figuring out how positive or negative the sentiment is. Which means, this would be more or less of a regression problem than a classification one. 


There are 2 files, a `train.tsv` and a `test.tsv`. We will train our models on the `train.tsv` dataset. First we will split the `train.tsv` dataset into training data and testing data to determine the model accuracy, take the best model and then test it on our `test.tsv`. It will be imperative to combine both train and test tsvs onto a single file, just for the purpose of Vectorizing them. The model will not be trained on the complete set!!


# About the dataset


The snapshot of the data you will be working on:

<img src="movie.png">




# Why solve it ?

Solving it will help you apply the following skills:

- Text preprocessing techniques like
    - Tokenization 
    - Stopword removal
    - Count vectorizer
    - Tf-idf vectorizer

- Implementation of 
    - Random forest
    - Naive Bayes
    - Linear SVM

## Load and Preprocess the data

Transforming text into something an algorithm can digest it a complicated process. We cannot feed the data as it is, some preprocessing needs to be done. In this task we will be doing some preprocessing to convert our data in a form that we can feed our model with.

## Instructions
* Load the data which is in a `.tsv` format and convert it to a dataframe using `pd.DataFrame.from_csv()` with parameters `path=path_train` and `sep="\t"` for training data and store it in a variable `df_train`. Similarly load the test data and store it in a variable `df_test`

* Concat the column `Phrase` from `df_train` and `df_test` data and store it in a new dataframe `phrases`

* Convert the `Phrase` column to lower case and assign it to a pandas series called `all_text`

## Hints
* Use `phrases = pd.concat([df_train[["Phrase"]], df_test[["Phrase"]]])` to concat the `Phrase` column from train and test dataframe

## Test case
* Variable declration `df_train`,`df_test`,`phrases` and `all_text`
    - df_train.shape==(156060, 3)
    - df_test.shape==(66292, 2)
    - phrases.Phrase[5]=='series'
    - all_text[7]=='of'

In [69]:
import os
os.chdir('/home/greyatom/Desktop/sentiment analysis/')
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from sklearn.naive_bayes import MultinomialNB


# Code starts here

# Load the training data
df_train = pd.DataFrame.from_csv("train.tsv", sep="\t")

# Load the testing data
df_test = pd.DataFrame.from_csv("test.tsv", sep="\t")

# Concat the 'Phrases' column from train and test dataset
phrases = pd.concat([df_train[["Phrase"]], df_test[["Phrase"]]])

# Shape of the new dataframe
print (phrases.shape[0]==(df_train.shape[0]+df_test.shape[0]))

# Convert all the phrases to lower case
all_text = phrases["Phrase"].str.lower()


True


In [62]:
df_train.head()

Unnamed: 0_level_0,SentenceId,Phrase,Sentiment
PhraseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,A series of escapades demonstrating the adage ...,1
2,1,A series of escapades demonstrating the adage ...,2
3,1,A series,2
4,1,A,2
5,1,series,2


## TF-IDF Vectorizer
Apart from Count vectorizer an alternative to calculate word frequencies , and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

    * Term Frequency: This summarizes how often a given word appears within a document.
    * Inverse Document Frequency: This downscales words that appear a lot across documents.

TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

## Instructions
* Instantiate `TfidfVectorizer(stop_words="english")` and store it in a variable `tfidf`

* Fit `tfidf` on `all_text` data

* Transform `all_text` data using `tfidf` and store it in a variable `X`

* Store data upto index 156060 in a variable `X_train` and the data after index 156060 in a variable `X_test`

* Store the `Sentiment` column from `df_train` in a variable `y`

* Split the train data into `X_train_train`, `X_train_test`,`y_train_train` and `y_train_test` and pass the parameters as `X_train`,`y`,`test_size`=0.3 and `random_state = 42`

## Hints
* Use `tfidf.transform(all_text).toarray()` to convert the vectorizer result to an array

* Use `X_train_train, X_train_test, y_train_train, y_train_test = tts(X_train,y,test_size=0.3, random_state=42)` to split the training data into train and test part.

## Test case

Variable declaration `tfidf`,`X`,`y`,`X_train_train`,`X_train_test`,`y_train_train`,`y_train_test`

In [70]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Code starts here
# TF-IDF vectorize all text after removing the 'stopwords'
tfidf = TfidfVectorizer(stop_words="english")
tfidf.fit(all_text)

# Vectorizer result to an array
X = tfidf.transform(all_text)
# Train and test data
X_train= X[:156060]
X_test = X[156060:]

# New column 'y' assigned with the `sentiment` column from training data
y = df_train["Sentiment"]

# X_train and y shape match
print (X_train.shape[0]==y.shape[0])

# Dividing the train data into training and testing data
X_train_train, X_train_test, y_train_train, y_train_test = train_test_split(X_train,y,test_size=0.3, random_state=42)

True


## Naive Bayes classsifier

A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem. Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive. We will use Naive Bayes classifier and see how it performs on our data.

## Instructions
* Instantiate the `MultinomialNB()` into a variable `nb`

* Fit `nb` on `X_train_train` and `y_train_train`

* Store the values predicted by `nb` on `X_train_test` in a variable `y_pred`

* Find the accuracy score using `accuracy_score()` and store it in a variable `nb_accuracy`

## Hints
* Use `nb.fit(X_train_train,y_train_train)` to fit the model on train data

## Test case


In [72]:
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB

# Code starts here

# Instantiate NB classifier
nb = MultinomialNB()

# Fit the NB classifier on train and test data
nb.fit(X_train_train,y_train_train)

# Predictions of the test data
y_pred = nb.predict(X_train_test)

# Accuracy of the model
nb_accuracy = accuracy_score(y_train_test,y_pred)

## Count vectorizer with Support vector classifier and Naive bayes algorithm

We will here fit the data on Count vectorizer which counts the word frequencies and use this data on two machine learning models 'Support vector classifier' and 'Naive bayes'. We will check which of the two models gives a better performance and use that model on our final test data to get the predictions.


## Instructions

* Instantiate `CountVectorizer(stop_words="english")` to a variable `cv`

* Fit Count vectorizer i.e `cv` on `all_text` data

* Transform `all_text` data using `cv`, convert the data into an array amnd store it in a variable `X`

* Store data upto index 156060 in a variable `X_train` and the data after index 156060 in a variable `X_test`

* Split the train data into `X_train_train`, `X_train_test`,`y_train_train` and `y_train_test` and pass the parameters as `X_train`,`y`,`test_size`=0.3 and `random_state = 42`

* Fit `nb` i.e the Naive Bayes classifier model on `X_train_train` and `y_train_train`

* Store the predicted values by `nb` model on `X_train_test` and store in a variable `y_pred`

* Store the accuracy score using `accuracy_score()` in a variable `nb_accuracy` and print it as well

* Instantiate the Support vector classifier model `LinearSVC()` and store it in a variable `svc`

* Fit `svc` model on `X_train_train` and `y_train_train` and store it in a variable `model`

* Store the predicted values by `model` on `X_train_test` in a variable `y_pred`

* Store the accuracy score using `accuracy_score()` in a variable `svc_accuracy` and print it as well

## Hints

* Use `X_train_train, X_train_test, y_train_train, y_train_test = tts(X_train,y,test_size=0.3, random_state=42)` to split the data and store it in the respective variables

* Use `svc.fit(X_train_train,y_train_train)` to fit the `svc` model on training data

## Test case


In [78]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC


# Code starts here

# Instantiate Count vectorizer
cv = CountVectorizer(stop_words="english")

# Fit the data on count vectorizer
cv.fit(all_text)

# Vectorizer result to an array
X = cv.transform(all_text)

# Train and test data
X_train= X[:156060]
X_test = X[156060:]
print (X_train.shape[0]==y.shape[0])

# Split the data into train and test data
X_train_train, X_train_test, y_train_train, y_train_test = train_test_split(X_train,y,test_size=0.3, random_state=42)

# Fit NB classifier on train data
nb.fit(X_train_train,y_train_train)

# Predicted result of test data using NB classifier
y_pred_nb = nb.predict(X_train_test)

# Accuracy of model
nb_accuracy = accuracy_score(y_train_test,y_pred_nb)

# Instantiate the LinearSVC() model
svc = LinearSVC()

# Fit svc classifier on train data
model = svc.fit(X_train_train,y_train_train)

# Predicted result of test data using svc
y_pred_svc = model.predict(X_train_test)

# Accuracy model
svc_accuracy = accuracy_score(y_train_test,y_pred_svc)

True


## Using the best model on test data

As seen in the previous tasks that we preprocessed and worked only on our training data so that our test data is not exposed to the final model any how. In the previous task we saw that Linear SVC with Count Vectorizer gives the best accuracy. We will run this particular model on the test data and get the output

## Instructions
* Store the predicted values by `model` on `X_test` and in a variable `y_pred`

* Append a new column `predicted_sentiment` to the test data `df_test` data

## Note : 
A code snippet to convert your predicted dataframe to an excel sheet is shared with you. This will be helpful during hackathons and competetions.
```python
writer = pd.ExcelWriter('test_output.xlsx')
df_test.to_excel(writer,'Sheet1')

writer.save()
```



## Hints
* Use `df_test["predicted_sentiment"] = y_pred` to store the predicted values in a new column `predicted_sentiment`

## Test case


In [81]:
# Run the best model on the test data
y_pred = model.predict(X_test)

# Append a column 'predicted_sentiment' to the test data
df_test["predicted_sentiment"] = y_pred

# Convert your dataframe into an excel sheet and save it


## END OF NOTEBOOK