# Machine Learning Project for Python

Problem: 

A website sent advertisements by emails to users who interested in their product. Your task is to find a good model to predict if an advertisement will be clicked with given datasets. 

user_features.csv - features describing our users
product_feature.csv - features describing products shown in the advertisements. 
click_history.csv - contains which products users had previously seen and whether that user ordered products in this website before

In [1]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn import metrics
from sklearn import linear_model, tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

Question 1: Data Understanding
Explore the basic information of the datasets.

In [16]:
#Question 1.)
#establishing file path of where input file is located
path="C:\\Users\esbro\Desktop\DAAN862\Final Exam"
os.chdir(path) #changing current working directory to the path listed above 
hClick=pd.read_csv("click_history.csv") #reading in input file from file path listed above
hshape=hClick.shape #structure of input file (how many rows & columns)
print("Click history has %5d rows & %1d columns\n" %(hshape[0],hshape[1]))
hsize=hClick.size #column * rows= numerical number
print("This dataframe of the file data as a whole has %6d elements present\n" % hsize)
#data types in the input file
hCols=hClick.dtypes
print("Click history is made up of these data types:\n",hCols)
#unique users in the input file by user id
uniUserIDs=hClick.user_id.unique()
print("\nFirst 25 unique user_ids in click history are:\n",uniUserIDs[:25])
#unique product ids in the input file
uniProdIDs=hClick.product_id.unique()
print("\nFirst 25 unique product ids in click history are:\n",uniProdIDs[:25])
#summary statistics of values in click history file
dataset=hClick.describe()
print("\nSummary statistics of data values in click history dataset are:\n",dataset)
#how many users actually clicked on the ads
countClick=hClick.clicked.value_counts()    
print("The amount of users that clicked on the ads were: ",countClick[1])     
print("The amount of users that did not click on the ads were: ",countClick[0])     

prodFeat=pd.read_csv("product_features.csv")#reading in input file from file path listed above
pFeatShape=prodFeat.shape #structure of input file (how many rows & columns)
print("\nProduct features has %5d rows & %1d columns\n" %(pFeatShape[0],pFeatShape[1]))
pFeatSize=prodFeat.size #column * rows= numerical number
print("This dataframe of the file data as a whole has %4d elements present\n" % pFeatSize)
#data types in the input file
pFeatCols=prodFeat.dtypes
print("Product features is made up of these data types:\n",pFeatCols)
#unique product ids in the input file
pFeatUniID=prodFeat.product_id.unique()
print("\nFirst 25 unique product ids in product features are:\n",pFeatUniID[:25])
#unique product categories in product features
pFeatUniCat=prodFeat.category.unique()
print("\nThe unique product categories in product features are:\n",pFeatUniCat)
#how many products were on sale or not
pFeatSaleSat=prodFeat.on_sale.value_counts()
print("\nThe amount of products that were on sale were: ",pFeatSaleSat[1])
print("The amount of products that were not on sale were: ",pFeatSaleSat[0])
#summary statistics of values in product features file
dataset2=prodFeat.describe()
print("\nSummary statistics of data values in product features dataset are:\n",dataset2)

userFet=pd.read_csv("user_features.csv")#reading in input file from file path listed above
userShape=userFet.shape #structure of input file (how many rows & columns)
print("\nUser features has %5d rows & %1d columns" %(userShape[0],userShape[1]))
userSize=userFet.size #column * rows= numerical number
print("This dataframe of the file data as a whole has %4d elements present" % userSize)
userCols=userFet.dtypes #data types in the input file
print("\nUser features is made up of these data types:\n",userCols)
#unique users in user features file based on their user ids
userUniIDs=userFet.user_id.unique()
print("\nFirst 25 unique user_ids in User features are:\n",userUniIDs[:25])
#the rating score table of the number of clicks made before the ad was sent to the user by the user
userUniClick=userFet.number_of_clicks_before.unique()
print("\nUser rating score of the number of clicks made previously by users of previous ads:\n",userUniClick)
#count of previous ordered products by users
userFetCountOr=userFet.ordered_before.value_counts()
print("\nHow many products were ordered before by users were: ",userFetCountOr[1])
print("How many products were not ordered before by users were: ",userFetCountOr[0])
#different types of interests listed by each user which might relate to future products purchased by them
userFetUniInt=userFet.personal_interests.unique()
print("\nFirst 25 listed personal interests of each user:\n",userFetUniInt[:25])
#determining how users listed no personal interests in the survey or have no personal interests that they can think of
blankIntCount=0 #count index initalized
for i in userFet.personal_interests:
    if i=="[]": #checking if personal interests variable is blank
        blankIntCount+=1
print("\nThe number of users that do not have personal interests listed are: ",blankIntCount)

Click history has 35990 rows & 3 columns

This dataframe of the file data as a whole has 107970 elements present

Click history is made up of these data types:
 user_id       int64
product_id    int64
clicked        bool
dtype: object

First 25 unique user_ids in click history are:
 [104863 108656 100120 104838 107304 106682 110052 100142 101559 110501
 109185 106343 107547 101696 111749 101080 111088 103033 100467 108385
 111168 106003 102163 107394 100901]

First 25 unique product ids in click history are:
 [1350 1321 1110 1443 1397 1246 1897 1843 1159 1268 1281 1670 1671 1228
 1445 1384 1245 1404 1108 1855 1349 1649 1058 1527 1143]

Summary statistics of data values in click history dataset are:
              user_id    product_id
count   35990.000000  35990.000000
mean   106017.080161   1500.232898
std      3483.480090    288.101984
min    100001.000000   1000.000000
25%    102976.500000   1250.000000
50%    106060.000000   1503.000000
75%    109049.000000   1749.000000
max    1119

Question 2: Data Cleaning and Preprocessing
Clean and preprocess the datasets (such as missing values, outliers, dummy, merging etc.).

In [19]:
#Question 2.)
#both the user feature & click history datasets both have the same user_id columns in them
combined_data=pd.merge(userFet,hClick,on='user_id')
print("First few rows of new combined dataset of user features data and user click history data:\n",combined_data.head(15))
#now the merged dataset created has product id to add product features dataset on
combined_data2=pd.merge(combined_data,prodFeat,on='product_id')
print("\nFirst few rows of new combined dataset of user and product features data:\n",combined_data2.head(20))
#handling na/missing values in the number_of_clicks before variable/column
combined_data2['number_of_clicks_before'] =combined_data2['number_of_clicks_before'].fillna(0)
#handling values less than 0 which should not happen because review scores should be greater of equal to 0
combined_data2.loc[combined_data2['avg_review_score']<0,'avg_review_score']=0
#changing number of clicks before values to only numeric values so 6+ has to get change to similar numeric value
combined_data2.loc[combined_data2['number_of_clicks_before']=='6+','number_of_clicks_before']=6
#all values of this variable are now all numeric so we can change the variable data type to integer type
combined_data2=combined_data2.astype({'number_of_clicks_before': 'int_'})
#all values of order_before need to numeric so we should change the data type of this variable to integer type
combined_data2=combined_data2.astype({'ordered_before': 'int8'})
#changing Order_before variable values to numeric ones to conduct proper analysis (0 & 1)
combined_data2.loc[combined_data2['ordered_before']=='False','ordered_before']=0
combined_data2.loc[combined_data2['ordered_before']=='True','ordered_before']=1
#changing clicked variable values to numeric ones to conduct proper analysis 0 & 1
combined_data2=combined_data2.astype({'clicked': 'int8'})
combined_data2.loc[combined_data2['clicked']=='True','clicked']=1
combined_data2.loc[combined_data2['clicked']=='False','clicked']=0
#on sale variable values to numeric to conduct proper analysis 0 & 1
combined_data2=combined_data2.astype({'on_sale': 'int8'})
combined_data2.loc[combined_data2['on_sale']=='True','on_sale']=1
combined_data2.loc[combined_data2['on_sale']=='False','on_sale']=0
#reindexing variables to make slicing easier to be able to drop certain variables that I create dummy variables for
combined_data2 = combined_data2.reindex(columns=
                            ['user_id','product_id','personal_interests','category',
                             'clicked','ordered_before','number_of_clicks_before','on_sale',
                             'number_of_reviews','avg_review_score'])
#getting dummy values for these categorical variables to be able to include them in our classification models
catDumm=pd.get_dummies(combined_data2['category'],prefix='category')
perDumm=pd.get_dummies(combined_data2['personal_interests'],prefix='personal_Interest',
                       prefix_sep=':')
#adding both categorical variables to our dataframe with numeric dummy values
combined_data2=combined_data2.join([catDumm,perDumm])
#dropping the original variables that we created dummy values for
combined_data2=combined_data2.drop(['personal_interests','category'],axis=1)
#making sure the newly created variables and existing variables that we changed values for too do not have missing values
missingSum=combined_data2.isnull().values.sum()
print("\nThe number of missing values in the combined dataset of user features,product features, and click history is: ",
      missingSum)

First few rows of new combined dataset of user features data and user click history data:
     user_id number_of_clicks_before  ordered_before  \
0    104939                       2            True   
1    104939                       2            True   
2    104939                       2            True   
3    104939                       2            True   
4    104939                       2            True   
5    101562                       2            True   
6    101562                       2            True   
7    101562                       2            True   
8    101562                       2            True   
9    101562                       2            True   
10   102343                       2            True   
11   102343                       2            True   
12   102343                       2            True   
13   102343                       2            True   
14   102343                       2            True   

                            

Question 3: Model Generation and Evaluation
Please split the data into train and test sets with a ratio of 0.7:0.3. Build and optimize classification models you learned in this course.

In [20]:
#Question 3.)
#outcome variable for classification needs to be the variable clicked
#because it is the variable that tells whether the user clicked on the ad,which we are trying to find out
y=combined_data2.clicked
#x/predictor variables should be variables that could be used to predict whether a user will click on the ad
#these include variables such as ordered_before,number of clicks before,on sale, number of reviews, type of category product is
X=combined_data2.iloc[:,3:5193]
#splitting the data into train and test sets with a ratio of 0.7:0.3
X_train,X_test,y_train,y_test=train_test_split(X,y,
                                               test_size=0.3,random_state=4)

In [21]:
#building classification models to predict whether a user will click on an advertisement sent by the company's website

#1.) Logistic regression model 
lr=linear_model.LogisticRegression()
lr.fit(X_train,y_train)
lr_train_pred=lr.predict(X_train)
lr_test_pred=lr.predict(X_test)

log_acc=metrics.accuracy_score(y_test,lr_test_pred)
log_f1=metrics.f1_score(y_test,lr_test_pred)

#2.) Decision tree model
DT=tree.DecisionTreeClassifier(max_depth=10,min_samples_split=5)
DT.fit(X_train,y_train)
DT_pred=DT.predict(X_test)

dt_acc=metrics.accuracy_score(y_test,DT_pred)
dt_f1=metrics.f1_score(y_test,DT_pred)

#making a dataframe to show the feature importance of each predictor variable in the model 
varImport=pd.DataFrame({'variable':X.columns[:],
              'importance':DT.feature_importances_})
varImport=varImport.sort_values(by='importance',ascending=False)
#showing top 30 variables that are the most important at predicting the outcome variable
varImport=varImport.head(30)
print("Variable Importance based on Decision tree model is:\n",varImport)

#3.) K-Nearest Neighbours model
#list of potential parameter values to get the most optimal K-Nearest neighbours model results for the model
tunned_parameters={'n_neighbors':[95,110]}
knn=KNeighborsClassifier(n_neighbors=5)
knn_optimizer=GridSearchCV(knn,tunned_parameters,scoring='accuracy',
                           return_train_score=False,verbose=2)
knn_optimizer.fit(X_train,y_train)
#for testing purpose to see which parameter value produced the best k-nearest neigbours model results score
results=pd.DataFrame(knn_optimizer.cv_results_)
knn_best_model=knn_optimizer.best_estimator_
knn_best_pred=knn_best_model.predict(X_test)

knn_acc=metrics.accuracy_score(y_test,knn_best_pred)
knn_f1=metrics.f1_score(y_test,knn_best_pred)

#4.) Random Forest model
rfc=RandomForestClassifier(n_estimators=100,max_features=10,
                           min_samples_split=5,random_state=2)
rfc.fit(X_train,y_train)
rfc_pred=rfc.predict(X_test)

#making a dataframe to show the feature importance of each predictor variable in the model 
varImport2=pd.DataFrame({'variable':X.columns[:],
              'importance':rfc.feature_importances_})
varImport2=varImport2.sort_values(by='importance',ascending=False)
#showing top 30 variables that are the most important at predicting the outcome variable
varImport2=varImport2.head(30)
print("\nVariable importance based on Random Forest model is:\n",varImport2)

rfc_acc=metrics.accuracy_score(y_test,rfc_pred)
rfc_f1=metrics.f1_score(y_test,rfc_pred)

#f1 & accuracy scores for each classification model
print("\nF1 Scores for the model are:")
print("Logistic regresion - ",log_f1)
print("Decision tree - ",dt_f1)
print("K-Nearest Neighbours - ",knn_f1)
print("Random Forest - ",rfc_f1)

print("\nAccuracy Scores for the model are:")
print("Logistic Regression - ",log_acc)
print("Decision tree - ",dt_acc)
print("K-Nearest Neighbours - ",knn_acc)
print("Random Forest - ",rfc_acc)

Variable Importance based on Decision tree model is:
                                                variable  importance
3                                     number_of_reviews    0.738610
0                                        ordered_before    0.053167
4                                      avg_review_score    0.043384
5189                               personal_Interest:[]    0.021092
2                                               on_sale    0.020310
1                               number_of_clicks_before    0.015174
5                                         category_body    0.005855
6                                         category_foot    0.005340
10                                      category_makeup    0.003972
8                                         category_hair    0.003769
1897                         personal_Interest:['hair']    0.003538
2393                         personal_Interest:['hand']    0.003282
12                                category_men_skincare    0.0

Question 4: Which model has the best performance? What have you learned from the models you built?

   Through the project, the main objective we were trying to discover, was whether a user would click on an advertisement sent via email, based on certain factors. To best answer this objective, one had to address certain issues. These questions included whether we wanted a classification model that would handle false positives the same as false negatives. Or we wanted a classification model that would handle correctly predicted positive observations in relation to the amount of total predicted positive observations. Or did we want a classification model capable of handling correctly predicted positive observations in relation to all observations in the actual class. These questions all would involve either our model having a high accuracy, F1, and/or precision score.
    In our case, our model would need to care about correctly predicting users click rate on advertisements, evenly. Which would mean the model could not lean on either side of predicting positive or negative values falsely. Thus, our classification model would need to have a relatively high accuracy or F1 score. Since, both scores consider false positive and false negative values, equally. Though models with high F1 scores would be able to handle uneven data distribution better than models that have high accuracy scores.
    Once the type of score metric was determined, I had to choose which classification model would produce the highest accuracy and F1 scores. I ended up choosing from models that included the Logistic Regression, Naïve Bayes, Stochastic Gradient Descent, K-Nearest Neighbors, Decision Tree, Random Forest, and Support Vector Machine. From this selection, I knew I could eliminate three of the listed models from my analysis. Those three models included the Naïve Bayes, Stochastic Gradient Descent, and Support Vector Machine models.
    The reason why I eliminated the Naïve Bayes model from my analysis was because the model is often times a bad estimator of data. Which often leads to the model loosely predicting the outcome variable, and ultimately having low F1 and accuracy scores. It is also a model that assigns zero probability values to categorical variables in test datasets that it could not find in training datasets. I did not need this issue since, there is no way of knowing whether certain values of a factor variable will show up in certain datasets. Because we are splitting the training and test datasets with a ratio of 0.70:0.30.
    The Stochastic Gradient Descent model was removed from my analysis because the model was mostly efficient at fitting linear models. Which is great if we knew our data had solely linear correlations, but that was not the case in this instance. I did not know if the data would be best suited for a linear modelling approach. Another issue with this model was that it only updated after it went through every observation in the dataset. Thus, making it a slower learner model. Again, in our case, the dataframe being modeled on had numerous rows and columns. Which would cause this model to slowly converge data points.
    The reason why I chose not to move further with the Support Vector Machine classification model was because of its computation inefficiency. I knew I would need to run a grid search on the kernel argument of the model. Which would take a long time to compile because of the size of our dataframe. Consequently, training speeds would be reduced tremendously. For this classification model to be useful, my dataset would need to have less data points, so my computer's memory would not be taxed storing all the support vectors. I also did not want to scale all my numeric variables, which is required by this model.
    Once I eliminated the Naïve Bayes, Stochastic Gradient Descent, and Support Vector Machine models, I was left with the K-Nearest Neighbors, Decision Tree, Random Forest, and Logistic Regression classification methods. The Logistic Regression model made sense right way because of its fast approach at classifying unknown records, its easy to implement algorithm, and its concept of making no assumptions on distribution of factor variables in feature space. I knew I needed a classification model that did not require intensive computing power because as previously stated the dataset being trained was relatively large. Since the size of the dataset was increased by the dummy variables created from the original personal interests and product category variable.
    The second model I knew I could add was the Decision Tree model. For one, this model could be used for both classification and regression problems. So, it was a very versatile method at handling both continuous and discrete values. All of which I had, in the dataset. For example, the review score variable in the dataset was continuous, while variables such as the number of clicks and the number of reviews a product had were discrete. The model could also handle nonlinear relationships well too. Which was another benefit because not all variables in the dataset had linear correlations. For one, I knew the dummy variables I created for product categories and user interests did not have linear relationships to whether the product was on sale or made a user click on a previous advertisement. Also, as with the Logistic Regression model was fast at computing its algorithm around data points. While being easy to use to visualize data. The importance model attribute showed its visualization strength by being able to display each variable’s level of importance at predicting the outcome variable. Lastly, normalization was not needed when I had to preprocess the original dataset.
    The K-Nearest Neighbors model was the third classification model I decided to add to my analysis because of its ability to handle noisy training data. This capability was needed because some of my categorical variables had some difficult to understand values. Mainly because I had to change most of the categorical variables’ values to binary values. These variables included whether a product was ordered before, user interests, and product categories. Also, since the K-Nearest Neighbors model was a model with a memory-based approach, it always is evolving to newly collected data. The only downside for this model was that it is a lazy learner. So, as the dataset grows, its efficiency and speed declines drastically. Therefore, I decided to add this model behind the Logistic Regression and Decision tree model.
    The final classification model I decided to use in my analysis was the Random Forest model. I decided to add this algorithm to my analysis, mainly because of its use of a meta-estimator to control over-fitting issues better than decision tree models. As a result, the Random Forest model tended to be more accurate than Decision Tree models. Also, the algorithms modeling approach that fits several decision trees on various sub-samples of a dataset, helps improve the model’s accuracy score, as well. However, it is a more complex model than a Decision Tree model. So, it’s sort of difficult to control the outcome of what the model does on a dataset. Often this is why the model is considered a black box approach for statistical modelling.
    At the end of my analysis, using all five classification models listed above, I found out the Decision Tree model had the all-around best performance. The decision tree model had the highest F1 score of 64.30% and the highest accuracy score of 75.22%. While the K-Nearest Neighbors and Random Forest model came in 3rd and 4th, respectively. The Random Forest model came in last. Which was probably because I did not use the most optimal estimator value on the data. If I increased the number & size of decision trees to ensemble around the dataset, computation cost would have increased dramatically. However, if computation cost was not a factor, and I had more memory space available, I would be able to increase the depth and meta estimator by at least 40 percent. Thus, making the Random Forest model’s accuracy and F1 scores at least equal to the Decision Tree model’s score.  