## Portfolio Part 3

In this Portfolio task, you will continue working with the dataset you have used in portfolio 2. But the difference is that the rating column has been changed with like or dislike values. Your task is to train classification models to predict whether a user like or dislike an item.  


The header of the csv file is shown below. 

| userId | timestamp | review | item| rating | helpfulness | gender | category |
    | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | 
    
#### Description of Fields

* __userId__ - the user's id
* __timestamp__ - the timestamp indicating when the user rated the shopping item
* __review__ - the user's review comments of the item
* __item__ - the name of the item
* __rating__ - the user like or dislike the item
* __helpfulness__ - average rating from other users on whether the review comment is helpful. 6-helpful, 0-not helpful. 
* __gender__ - the gender of the user, F- female, M-male
* __category__ - the category of the shopping item


Your high level goal in this notebook is to try to build and evaluate predictive models for 'rating' from other available features - predict the value of the __rating__ field in the data from some of the other fields. More specifically, you need to complete the following major steps: 
1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.
2) Convert object features into digit features by using an encoder
3) Study the correlation between these features. 
4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step. Evaluate the accuracy of your model.
6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance

Note 1: We did not provide any description of each step in the notebook. You should learn how to properly comment your notebook by yourself to make your notebook file readable. 

Note 2: you are not being evaluated on the ___accuracy___ of the model but on the ___process___ that you use to generate it. Please use both ___Logistic Regression model___ and ___KNN model___ for solving this classification problem. Accordingly, discuss the performance of these two methods.

In [1]:
# import libraries
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

#import matplotlib.pyplot as plt
#import seaborn as sns
#%matplotlib inline

#import warnings
#warnings.filterwarnings('ignore')

In [2]:
# read csv
df = pd.read_csv('Portfolio 3.csv')
df.head(5)

Unnamed: 0,userId,timestamp,review,item,rating,helpfulness,gender,category
0,4259,11900,"Finally, Something for (Relatively) Nothing",MyPoints.com,like,4,F,Online Stores & Services
1,4259,12000,Shocking!,Sixth Sense,like,4,F,Movies
2,4259,12000,Simply Shaggadelic!,Austin Powers: The Spy Who Shagged Me,like,4,F,Movies
3,4259,12000,Better Than The First!,Toy Story 2,like,3,F,Movies
4,4259,12000,Blair Witch made me appreciate this,Star Wars Episode I: The Phantom Menace,dislike,4,F,Movies


In [3]:
df.shape # size before cleaning data

(2899, 8)

In [4]:
# 1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.

df.isna().sum() # checking if there is any null values

userId         0
timestamp      0
review         0
item           0
rating         0
helpfulness    0
gender         0
category       0
dtype: int64

Hence there are no null values in the dataframe

In [5]:
df['category'].unique() # cross-verify if there are any abnormal data in the columns

array(['Online Stores & Services', 'Movies', 'Hotels & Travel', 'Games',
       'Personal Finance', 'Media', 'Kids & Family',
       'Restaurants & Gourmet', 'Books'], dtype=object)

Checked with the above code for all other required columns i.e. with review, item, rating, helpfulness, gender; no abnormal value/wrong values are found.

In [6]:
# 2) Convert object features into digit features by using an encoder 

df.info() # check type of data, we found 5 object features

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2899 entries, 0 to 2898
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   userId       2899 non-null   int64 
 1   timestamp    2899 non-null   int64 
 2   review       2899 non-null   object
 3   item         2899 non-null   object
 4   rating       2899 non-null   object
 5   helpfulness  2899 non-null   int64 
 6   gender       2899 non-null   object
 7   category     2899 non-null   object
dtypes: int64(3), object(5)
memory usage: 181.3+ KB


In [7]:
enc = OrdinalEncoder() # initiating OrdinalEncoder() object

In [8]:
# assigning it to main dataset
df['review'] = enc.fit_transform(df[['review']])
df['item'] = enc.fit_transform(df[['item']])
df['rating'] = enc.fit_transform(df[['rating']])
df['gender'] = enc.fit_transform(df[['gender']])
df['category'] = enc.fit_transform(df[['category']])

In [9]:
df.info() # check converted types of data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2899 entries, 0 to 2898
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   userId       2899 non-null   int64  
 1   timestamp    2899 non-null   int64  
 2   review       2899 non-null   float64
 3   item         2899 non-null   float64
 4   rating       2899 non-null   float64
 5   helpfulness  2899 non-null   int64  
 6   gender       2899 non-null   float64
 7   category     2899 non-null   float64
dtypes: float64(5), int64(3)
memory usage: 181.3 KB


In [10]:
# 3) Study the correlation between these features.

# correlation between columns and rating
df_review = df['rating'].corr(df['review'])
df_item = df['rating'].corr(df['item'])
df_gender = df['rating'].corr(df['gender'])
df_category = df['rating'].corr(df['category'])

print("The correlations value:\n")
print("review and rating: ", df_review)
print("item and rating: ",df_item)
print("gender and rating: ",df_gender)
print("category and rating: ",df_category)

The correlations value:

review and rating:  -0.046934643586446896
item and rating:  0.013628997625434916
gender and rating:  0.022575696214408688
category and rating:  -0.11631209500485062


In [11]:
# 4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model. 

train_case, test_case = train_test_split(df, test_size=0.2, random_state=142) # Splitting data into 80% training and 20% testing data
print(train_case.shape)
print(test_case.shape)

(2319, 8)
(580, 8)


In [12]:
x_train = train_case[['review', 'item', 'gender', 'category']] # select these four features that directly can affect the ratings.
y_train = train_case['rating'] # take rating in y-axis to evaluate the like and dislike of the user
x_test = test_case.drop(['userId', 'timestamp','rating','helpfulness'], axis = 1) # drop unwanted columns that are not directly relate to rating values.
y_test = test_case['rating'] # same in training data and testing data

print("x train size:", x_train.shape)
print("y train size:", y_train.shape)
print("x test size:", x_test.shape)
print("y test size:", y_test.shape)

x train size: (2319, 4)
y train size: (2319,)
x test size: (580, 4)
y test size: (580,)


In [13]:
# fitting logistic regression model
model = linear_model.LogisticRegression()
model.fit(x_train,y_train)

LogisticRegression()

In [14]:
# predict data of logistic regression
train_predict = model.predict(x_train)
test_predict = model.predict(x_test)

In [15]:
print("Accuracy score:\n")
print("train data: ", accuracy_score(y_train,train_predict))
print("test data: ", accuracy_score(y_test,test_predict))

Accuracy score:

train data:  0.6326002587322122
test data:  0.6413793103448275


In [16]:
# using RFE model
rfe = linear_model.LogisticRegression()
selector = RFE(rfe,n_features_to_select=5,step=1)
selector.fit(x_train,y_train)

RFE(estimator=LogisticRegression(), n_features_to_select=5)

In [17]:
# predict data of RFE model
train_fpredict = selector.predict(x_train)
test_fpredict = selector.predict(x_test)

In [18]:
print("Accuracy score:\n")
print("train data after RFE: ", accuracy_score(y_train,train_fpredict))
print("test data after RFE: ", accuracy_score(y_test,test_fpredict))

Accuracy score:

train data after RFE:  0.6326002587322122
test data after RFE:  0.6413793103448275


The accuracy of test data is almost same while using RFE and without using RFE here, the difference of training and testing data accuracy score is around 0.01 concluding that the model is well predicted.

In [19]:
# 5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step
# and evaluate the accuracy of your model.

# using knn model
knn = KNeighborsClassifier(n_neighbors=15)
knn.fit(x_train,y_train)

KNeighborsClassifier(n_neighbors=15)

In [20]:
y_pred = knn.predict(x_test) # predicting data

accuracy = accuracy_score(y_pred, y_test) # counting accuracy score
print('Testing accuracy is: ', accuracy)

Testing accuracy is:  0.6172413793103448


Hence, the knn model will help in selecting K value and find the best scored accuracy value.

In [21]:
# 6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance

parameter_grid = {'n_neighbors': range(1, 50)}
knn_tunning = KNeighborsClassifier()
clf = GridSearchCV(knn_tunning, parameter_grid, scoring='accuracy', cv=10)
clf.fit(x_train, y_train)

# Identify the best parameters
print('Best K value: ', clf.best_params_['n_neighbors'])
print('The accuracy: ', clf.best_score_)

Best K value:  37
The accuracy:  0.6373413942379459


Looking at the results of que 5 and 6, we can say that by tunning the hyper-parameter K, we can better select the k-value and thus influencing the performance of the model.