# Films Reviews

### Goal: 
You are a Machine Learning Engineer in XYZ. Reviews of a set of films are given to you and your goal is to build a classification model and predict reviews as negative or positive which helps customers in one of deciding factors to watch a film.

### Data Description: 
The given data is about reviews of different films from multiple genres and different languages.

### Attribute Information:
- id: Unique identifier for each row.
- category: The reviews have been categorized into two categories representing positive review and negative review. 0 represents positive review and 1 represents a negative review.
- text: review is basically text data and that has been Tokenized for all reviews in training and testing sets.

### Objective Of The Problem: 
The objective of the problem is to predict the categories of the “text” attribute in the “Test file” and write the same into a CSV (Comma Separated Values) along with the “id” attribute. Please note that one to one mapping exists between all the attributes and the same must be preserved. Upload the predicted solution file using the upload file field below and click on submit to get a score. View the sample submission file to get an understanding of how the solution file must be written. Please note that the headers of the solution file being uploaded should be the same as the headers of the sample submission file.

### Evaluation Criteria: 
The evaluation metric for this problem statement is precision-based accuracy. All scores would be normalized to 100. If predictions are to be made for “y” tuples and “x” tuples are predicted with correctness, then the score assigned would be (x/y * 100)



In [287]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt

In [288]:
train = pd.read_csv(r"D:\denis\training_nlp skillnza.csv")

test = pd.read_csv(r"D:\denis\test_nlp skillnza.csv")

In [289]:
print(train.category.value_counts())
print("----------------------")
print(train.shape)

0    3348
1     116
Name: category, dtype: int64
----------------------
(3464, 3)


In [290]:
train.head()

Unnamed: 0,id,category,text
0,959,0,5573 1189 4017 1207 4768 8542 17 1189 5085 5773
1,994,0,6315 7507 6700 4742 1944 2692 3647 4413 6700
2,995,0,5015 8067 5335 1615 7957 5773
3,996,0,2925 7199 1994 4647 7455 5773 4518 2734 2807 8...
4,997,0,7136 1207 6781 237 4971 3669 6193


In [291]:
test.head()

Unnamed: 0,id,text
0,3729,2705 4888 5050 5815 2472 5157 652 2117 2110 32...
1,3732,389 4978 315 5178 513 5249 5853 3267 315 3891 ...
2,3761,4478 4231 4858 2638 4231 867 371 686 4888 4179...
3,5,3015 1911 112 3905 825 337 315 1693 4677 825 5...
4,7,5136 3918 5153 2023 3091 4159 315 3711 1409 27...


In [292]:
from sklearn.feature_extraction.text import CountVectorizer

bow_transformer = CountVectorizer().fit(train['text'])

print(len(bow_transformer.vocabulary_))

8484


In [293]:
#train Data
train_bow = bow_transformer.transform(train.text) 
#creating TDM
train_bow.shape


(3464, 8484)

In [294]:
bow_transformer.vocabulary_

{'5573': 4986,
 '1189': 208,
 '4017': 3288,
 '1207': 229,
 '4768': 4112,
 '8542': 8230,
 '17': 765,
 '5085': 4456,
 '5773': 5203,
 '6315': 5798,
 '7507': 7103,
 '6700': 6219,
 '4742': 4084,
 '1944': 1034,
 '2692': 1845,
 '3647': 2886,
 '4413': 3722,
 '5015': 4383,
 '8067': 7713,
 '5335': 4726,
 '1615': 674,
 '7957': 7593,
 '2925': 2101,
 '7199': 6763,
 '1994': 1089,
 '4647': 3980,
 '7455': 7046,
 '4518': 3838,
 '2734': 1891,
 '2807': 1972,
 '853': 8216,
 '2283': 1402,
 '753': 7128,
 '5107': 4481,
 '5922': 5368,
 '4355': 3658,
 '6054': 5514,
 '4608': 3938,
 '2199': 1311,
 '142': 460,
 '4211': 3502,
 '8103': 7754,
 '7747': 7365,
 '7136': 6695,
 '6781': 6307,
 '237': 1497,
 '4971': 4334,
 '3669': 2910,
 '6193': 5664,
 '6730': 6252,
 '3349': 2563,
 '2325': 1448,
 '7714': 7330,
 '5172': 4549,
 '6254': 5730,
 '4097': 3376,
 '6500': 6002,
 '6927': 6466,
 '5045': 4414,
 '3533': 2766,
 '3255': 2459,
 '3225': 2427,
 '6110': 5576,
 '6912': 6450,
 '7732': 7349,
 '3283': 2490,
 '2698': 1851,
 '220'

In [295]:
#spliting into train and test
from sklearn.model_selection import train_test_split

In [298]:
train_x, test_x, train_y, test_y = train_test_split(train_bow,train.category,
                                                   test_size = .25 ,random_state = 40,)

In [299]:
print(train_x.shape)
print(test_x.shape)
print(train_y.shape)
print(test_y.shape)

(2598, 8484)
(866, 8484)
(2598,)
(866,)


#  Model building on Naive bayes

In [300]:
# Building Model using naive bayes
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(alpha=0.0035, fit_prior=True, class_prior=None,)

In [301]:
model = nb.fit(train_x,train_y)

In [302]:
from sklearn.metrics import confusion_matrix

pred_value = nb.predict(test_x)

In [303]:
cm = confusion_matrix(pred_value , test_y)
cm

array([[835,   2],
       [ 12,  17]], dtype=int64)

In [304]:
cm.diagonal().sum()* 100 /cm.sum()

98.38337182448036

In [305]:
from sklearn.metrics import classification_report

In [306]:
classification_report(test_y , pred_value)

'              precision    recall  f1-score   support\n\n           0       1.00      0.99      0.99       847\n           1       0.59      0.89      0.71        19\n\n    accuracy                           0.98       866\n   macro avg       0.79      0.94      0.85       866\nweighted avg       0.99      0.98      0.99       866\n'

## Model  build on Decision Tree

In [307]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
dt=DecisionTreeClassifier(criterion='entropy',
    splitter='best',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=2,
    min_weight_fraction_leaf=0.0001,                      
    max_features=None,
    random_state=None,
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    min_impurity_split=None,
    class_weight=None,
    presort='deprecated',
    ccp_alpha=0.00,
)

In [308]:
model1= dt.fit(train_x,train_y)
pred_dt = dt.predict(test_x)

tab_dt = confusion_matrix(pred_dt,test_y)
tab_dt

array([[843,   6],
       [  4,  13]], dtype=int64)

In [309]:
Accuracy = tab_dt.diagonal().sum() / tab_dt.sum()*100
Accuracy

98.84526558891456

In [310]:
test1 = bow_transformer.transform(test.text) 
#creating TDM
test1

<1360x8484 sparse matrix of type '<class 'numpy.int64'>'
	with 27169 stored elements in Compressed Sparse Row format>

In [311]:
test1.shape

(1360, 8484)

In [313]:
pred_value = dt.predict(test1)
pred_value

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [314]:
final_submission10 = pd.DataFrame({"id":test["id"],
                               "category":pred_value})

In [315]:
final_submission10.to_csv("D:\skillinza\problem1/final_submission10.csv",index=False)