# Data Description

In this project,we have a client who has a website where people write different reviews for technical products.Now they are adding a new feature to their website i.e. The reviewer will have to add stars(rating)as well with the review. The rating is out 5 stars and it only has 5 options available 1 star, 2 stars,3 stars, 4 stars, 5 stars. Now they want to predict ratings for the reviews which were written in the past and they don’t have a rating. 

So, we have to build an application which can predict the rating by seeing the review.

In [14]:
#Importing important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [15]:
#Loading the dataset
df=pd.read_excel('Rating_prediction.xlsx')

In [16]:
#Checking the first 5 rows
df.head()

Unnamed: 0,Index,Reviews,Ratings
0,1,"'Very good product',",5
1,2,"'Great Birthday Gift...Loved it!',",5
2,3,"'FREAKING AWESOME',",5
3,4,"'Very Nice Phone',",5
4,5,"'Good phone for the price',",5


In [17]:
df.columns

Index([' Index', 'Reviews', 'Ratings'], dtype='object')

In [18]:
#Dropping the column 'Index'
df.drop(columns=[' Index'],axis=1,inplace=True)

In [19]:
df.head()

Unnamed: 0,Reviews,Ratings
0,"'Very good product',",5
1,"'Great Birthday Gift...Loved it!',",5
2,"'FREAKING AWESOME',",5
3,"'Very Nice Phone',",5
4,"'Good phone for the price',",5


In [20]:
df.shape #Checking the shape of the dataset

(2129, 2)

In [21]:
#New column for length of reviews 
df['length']=df.Reviews.str.len()
df.head()

Unnamed: 0,Reviews,Ratings,length
0,"'Very good product',",5,20
1,"'Great Birthday Gift...Loved it!',",5,35
2,"'FREAKING AWESOME',",5,20
3,"'Very Nice Phone',",5,19
4,"'Good phone for the price',",5,28


PREPROCESSING:

In [22]:
#Convert all reviews to lower case
df['Reviews']=df['Reviews'].str.lower()

In [23]:
df.head()

Unnamed: 0,Reviews,Ratings,length
0,"'very good product',",5,20
1,"'great birthday gift...loved it!',",5,35
2,"'freaking awesome',",5,20
3,"'very nice phone',",5,19
4,"'good phone for the price',",5,28


In [24]:
#Regular expressions

#Replace numbers with 'numbr'
df['Reviews']=df['Reviews'].str.replace(r'\d+(\.\d+)?','numbr')

In [25]:
#Remove punctuations
df['Reviews']=df['Reviews'].str.replace(r'[^\w\d\s]','')

#Replace white space between terms with a single space
df['Reviews']=df['Reviews'].str.replace(r'\s+','')

#Remove leading and trailing whitespace
df['Reviews']=df['Reviews'].str.replace(r'^\s+|\s+?$','')

In [26]:
df.head()

Unnamed: 0,Reviews,Ratings,length
0,verygoodproduct,5,20
1,greatbirthdaygiftlovedit,5,35
2,freakingawesome,5,20
3,verynicephone,5,19
4,goodphonefortheprice,5,28


In [28]:
#Remove stopwords
#Importing some libraries for NLP
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words=set(stopwords.words('english')+['u','ur','4','2','im','dont','doin','use'])
df['Reviews']=df['Reviews'].apply(lambda x:''.join(term for term in x.split() if term not in stop_words))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Neeti\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [29]:
#New column(clean_length)after punctuations,stopwords removal
df['clean_length']=df.Reviews.str.len()

In [30]:
df.head()

Unnamed: 0,Reviews,Ratings,length,clean_length
0,verygoodproduct,5,20,15
1,greatbirthdaygiftlovedit,5,35,24
2,freakingawesome,5,20,15
3,verynicephone,5,19,13
4,goodphonefortheprice,5,28,20


In [31]:
#Total length removal
print('origin length',df.length.sum())
print('clean length',df.clean_length.sum())

origin length 72233
clean length 53032


MODEL BUILDING

In [32]:
#Convert text into vectors using TF-IDF
#Instantiate MultinomialNB classifier
#Split feature and Ratings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

tf_vec=TfidfVectorizer()
naive=MultinomialNB()
features=tf_vec.fit_transform(df['Reviews'])
X=features
y=df['Ratings']

In [33]:
#Train and predict
X_train,x_test,Y_train,y_test=train_test_split(X,y,random_state=42,)
naive.fit(X_train,Y_train)
y_pred=naive.predict(x_test)
print('Final score=>',accuracy_score(y_test,y_pred))

Final score=> 0.5984990619136961


In [34]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00        26
           2       0.67      0.15      0.25        53
           3       1.00      0.07      0.13        71
           4       0.55      0.28      0.37        92
           5       0.60      0.96      0.74       291

    accuracy                           0.60       533
   macro avg       0.56      0.29      0.30       533
weighted avg       0.62      0.60      0.51       533



In [35]:
#plot confusion matrix heatmap
conf_mat=confusion_matrix(y_test,y_pred)
conf_mat

array([[  0,   2,   0,   0,  24],
       [  0,   8,   0,   1,  44],
       [  0,   1,   5,   9,  56],
       [  0,   1,   0,  26,  65],
       [  0,   0,   0,  11, 280]], dtype=int64)