# Movie Rating Prediction

Cody directed a movie which just recently premiered and he got variety of reviews for his movie from the viewers . As there have been many viewers it will be hard for Cody to analyse all the reviews.So Cody wants you to create him a sentiment analysis model which will take review of the viewer as input and output the sentiment associated with it.
- pos-if positive
- neg-if negative sentiment is provided by the review.
Cody has already collected reviews of all his previous movies which were already labelled manually. Now your task is to train your model using this data and create a movie review sentiment analysis model.

The data consist of three files as given below :
* Train.csv - the training set consists of both review and sentiment pos/neg 
* Test.csv - the test set consists of only reviews your model will be scored on this set
* Sample_submission.csv -it shows the format of your submissions.it consists of two columns Id and your label . Id is just an index number and label can be pos or neg.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data=pd.read_csv("train.csv")

In [3]:
data.columns

Index(['review', 'label'], dtype='object')

## Label Encoding of Y

In [4]:
from sklearn.preprocessing import LabelEncoder 

In [5]:
le=LabelEncoder()

In [6]:
data.label=le.fit_transform(data.label)

In [9]:
le.transform([['pos'],['neg']])

array([1, 0], dtype=int64)

In [8]:
le.inverse_transform([0,1])

array(['neg', 'pos'], dtype=object)

In [11]:
data.head()

Unnamed: 0,review,label
0,mature intelligent and highly charged melodram...,1
1,http://video.google.com/videoplay?docid=211772...,1
2,Title: Opera (1987) Director: Dario Argento Ca...,1
3,I think a lot of people just wrote this off as...,1
4,This is a story of two dogs and a cat looking ...,1


## Bag of Words

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
from nltk.corpus import stopwords
sw=stopwords.words("english")

from nltk.stem import PorterStemmer
ps=PorterStemmer()

from nltk.tokenize import RegexpTokenizer
regex=RegexpTokenizer("[a-zA-Z]+")

def filter_words(x):
    return [ps.stem(i) for i in x if i not in sw]

def my_tokenizer(x):
    y=regex.tokenize(x)
    return filter_words(y)

In [13]:
my_tokenizer("you are my sunshine my only hope")

['sunshin', 'hope']

In [16]:
cv=CountVectorizer(tokenizer=my_tokenizer)

In [17]:
X=cv.fit_transform(data.review)

In [None]:
test=pd.read_csv("Test.csv")
X_test=cv.transform(test.review)


## Tfidf Vectorizer

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [30]:
tfidf=TfidfVectorizer(tokenizer=my_tokenizer,)

In [31]:
X=tfidf.fit_transform(data.review)

In [32]:
test=pd.read_csv("Test.csv")
X_test=tfidf.transform(test.review)


In [25]:
Y=data.label.values
Y

array([1, 1, 1, ..., 0, 1, 1])

## Conversion of Features

In [19]:
from sklearn.naive_bayes import MultinomialNB

In [20]:
mnb=MultinomialNB()

In [33]:
mnb.fit(X,Y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [34]:
Yp=mnb.predict(X_test)

In [35]:
Ans=le.inverse_transform(Yp)
Main=pd.DataFrame(Ans)
Main.to_csv("Ans.csv",header=["label"],index_label="Id")