# MSDS 699 - Machine Learning Laboratory

## Final Project

Author: Tencent Inc.
Source: [KDD Cup](https://www.kddcup2012.org/) - 2012
Please cite:

This data set is the same as version 4, but has additional unlabeled data attached to it. This is meant for a machine learning challenge. The complete labeled version of this dataset is version 6 (but this version is kept private for the duration of the challenge).

This data is derived from the 2012 KDD Cup. The data is subsampled to 1% of the original number of instances, downsampling the majority class (click=0) so that the target feature is reasonably balanced (5 to 1).

The data is about advertisements shown alongside search results in a search engine, and whether or not people clicked on these ads.
The task is to build the best possible model to predict whether a user will click on a given ad.

A search session contains information on user id, the query issued by the user, ads displayed to the user, and target feature indicating whether a user clicked at least one of the ads in this session. The number of ads displayed to a user in a session is called ‘depth’. The order of an ad in the displayed list is called ‘position’. An ad is displayed as a short text called ‘title’, followed by a slightly longer text called ’description’, and a URL called ‘display URL’.
To construct this dataset each session was split into multiple instances. Each instance describes an ad displayed under a certain setting (‘depth’, ‘position’). Instances with the same user id, ad id, query, and setting are merged. Each ad and each user have some additional properties located in separate data files that can be looked up using ids in the instances.

The dataset has the following features:
* Click – binary variable indicating whether a user clicked on at least one ad.
* Impression - the number of search sessions in which AdID was impressed by UserID who issued Query.
* Url_hash - URL is hashed for anonymity
* AdID
* AdvertiserID - some advertisers consistently optimize their ads, so the title and description of their ads are more attractive than those of others’ ads.
* Depth - number of ads displayed to a user in a session
* Position - order of an ad in the displayed list
* QueryID - is the key of the data file 'queryid_tokensid.txt'. (follow the link to the original KDD Cup page, track 2)
* KeywordID - is the key of 'purchasedkeyword_tokensid.txt' (follow the link to the original KDD Cup page, track 2)
* TitleID - is the key of 'titleid_tokensid.txt'
* DescriptionID - is the key of 'descriptionid_tokensid.txt' (follow the link to the original KDD Cup page, track 2)
* UserID – is also the key of 'userid_profile.txt' (follow the link to the original KDD Cup page, track 2). 0 is a special value denoting that the user could be identified.

<h2>Table of Contents<span class="tocSkip"></span></h2>
<div class="toc"><ul class="toc-item"><li><span><a href="#Final-Project-Check-in" data-toc-modified-id="Final-Project-Check-in-1">Final Project Check-in</a></span></li><li><span><a href="#Group-Name" data-toc-modified-id="Group-Name-2">Group Name</a></span></li><li><span><a href="#Student-Names" data-toc-modified-id="Student-Names-3">Student Names</a></span></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-4">Load Data</a></span></li><li><span><a href="#Fit-scikit-learn-model" data-toc-modified-id="Fit-scikit-learn-model-5">Fit scikit-learn model</a></span></li><li><span><a href="#Evaluation-Metric" data-toc-modified-id="Evaluation-Metric-6">Evaluation Metric</a></span></li></ul></div>

In [5]:
import numpy as np
import pandas as pd
from scipy.io.arff import loadarff 

# Imports
# Do NOT import anything else
from   category_encoders          import *
import numpy as np
import pandas as pd
from   sklearn.compose            import *
from   sklearn.ensemble           import RandomForestClassifier, ExtraTreesClassifier, IsolationForest
from   sklearn.experimental       import enable_iterative_imputer
from   sklearn.impute             import *
from   sklearn.linear_model       import LogisticRegression, PassiveAggressiveClassifier, RidgeClassifier, SGDClassifier
from   sklearn.metrics            import balanced_accuracy_score # Evaluation metric 2.0 
from   sklearn.pipeline           import Pipeline
from   sklearn.preprocessing      import *
from   sklearn.tree               import DecisionTreeClassifier, ExtraTreeClassifier
from  sklearn.ensemble         import RandomForestClassifier
from   sklearn.metrics         import mean_absolute_error # Easier to interpert than MSE. We'll discuss it in detail later
from   sklearn.model_selection import train_test_split

## Exploratory Data Analysis

In [80]:
raw_data_arff = loadarff('phpfGCaQC.arff')
df = pd.DataFrame(raw_data_arff[0])

In [81]:
df["click"] = df['click'].str.decode("utf-8").astype("category")
df["ad_id"] = df["ad_id"].astype(int)
df["advertiser_id"] = df["advertiser_id"].astype(int)
df["depth"] = df["depth"].astype(int)
df["position"] = df["position"].astype(int)
df["query_id"] = df["query_id"].astype(int)
df["keyword_id"] = df["keyword_id"].astype(int)
df["title_id"] = df["title_id"].astype(int)
df["description_id"] = df["description_id"].astype(int)
df["user_id"] = df["user_id"].astype(int)
df["impression"] = df["impression"].astype(int)
df["url_hash"] = df["url_hash"].astype("category")

In [68]:
df["click"] = df['click'].str.decode("utf-8").astype("category")
df["ad_id"] = df["ad_id"].astype("category")
df["advertiser_id"] = df["advertiser_id"].astype("category")
df["depth"] = df["depth"].astype(int)
df["position"] = df["position"].astype("category")
df["query_id"] = df["query_id"].astype("category")
df["keyword_id"] = df["keyword_id"].astype("category")
df["title_id"] = df["title_id"].astype("category")
df["description_id"] = df["description_id"].astype("category")
df["user_id"] = df["user_id"].astype("category")
df["impression"] = df["impression"].astype(int)
df["url_hash"] = df["url_hash"].astype("category")

In [8]:
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Pandas Profiling Report")
profile.to_widgets()
profile.to_file("Pandas Profiling Report.html")

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=26.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render widgets', max=1.0, style=ProgressStyle(description…

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Export report to file', max=1.0, style=ProgressStyle(desc…




In the pandas profiling it says that there are 22 duplicate rows in dataset. These rows are dropped.

In [91]:
df_new = df[['click', 'impression', 'url_hash', 'ad_id', 'advertiser_id', 'depth',
       'position', 'query_id', 'keyword_id', 'title_id', 'description_id',
       'user_id']]

Index(['click', 'impression', 'url_hash', 'ad_id', 'advertiser_id', 'depth',
       'position', 'query_id', 'keyword_id', 'title_id', 'description_id',
       'user_id'],
      dtype='object')

In [106]:
df = df[['click', 'impression', 'url_hash', 'ad_id', 'advertiser_id', 'depth',
       'position', 'keyword_id', 'description_id',
       'user_id']]

In [107]:
df = df.drop_duplicates()

In [83]:
df.head()

Unnamed: 0,click,impression,url_hash,ad_id,advertiser_id,depth,position,query_id,keyword_id,title_id,description_id,user_id
0,0,1,1.071003e+19,8343295,11700,3,3,7702266,21264,27892,1559,0
1,1,1,1.736385e+19,20017077,23798,1,1,93079,35498,4,36476,562934
2,0,1,8.915473e+18,21348354,36654,1,1,10981,19975,36105,33292,11621116
3,0,1,4.426693e+18,20366086,33280,3,3,0,5942,4057,4390,8778348
4,0,1,1.15726e+19,6803526,10790,2,1,9881978,60593,25242,1679,12118311


In [84]:
df["url_hash"].nunique()

6941

In [85]:
df.isnull().sum()

click             0
impression        0
url_hash          0
ad_id             0
advertiser_id     0
depth             0
position          0
query_id          0
keyword_id        0
title_id          0
description_id    0
user_id           0
dtype: int64

In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39926 entries, 0 to 39947
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   click           39926 non-null  category
 1   impression      39926 non-null  int64   
 2   url_hash        39926 non-null  category
 3   ad_id           39926 non-null  int64   
 4   advertiser_id   39926 non-null  int64   
 5   depth           39926 non-null  int64   
 6   position        39926 non-null  category
 7   query_id        39926 non-null  int64   
 8   keyword_id      39926 non-null  int64   
 9   title_id        39926 non-null  int64   
 10  description_id  39926 non-null  int64   
 11  user_id         39926 non-null  int64   
dtypes: category(3), int64(9)
memory usage: 3.6 MB


feature selection variance trehshold
feature selection SelectKBest

In [108]:
y = df["click"]
X = df.iloc[:,1:]

In [109]:
X_train, X_validation, y_train, y_validation = train_test_split(
    X, y)

In [121]:
hyperparameters = {"n_jobs": -1, "class_weight":"balanced_subsample", "max_depth":2, "max_leaf_nodes":10 }

In [123]:
from sklearn.model_selection import cross_val_score, KFold

kfold = KFold(n_splits=10, shuffle=True, random_state=42)


In [None]:
import numpy as np 
C = np.logspace(0, 4, 10)
penalty = ['l2', 'none']
hyperparameters = dict(C=C, 
                       penalty=penalty, # type of regularization
                       )

clf_grid = GridSearchCV(LogisticRegression(),
                        hyperparameters,
                        cv=5,
                        verbose=True,
)
clf_grid.fit(X_train, y_train);

In [None]:
clf_grid.best_estimator_.get_params()

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
clf_rand = RandomizedSearchCV(estimator=LogisticRegression(), 
                              param_distributions=hyperparameters, 
                              n_iter=13,
                              cv=5,
                              verbose=True)
clf_rand.fit(X_train, y_train) 

In [None]:
clf_rand.best_estimator_.get_params()

In [None]:

from sklearn import set_config

set_config(display='diagram')

pipe_dt 

In [124]:
results = cross_val_score(pipe, 
                          X_train,
                          y_train, 
                          cv=kfold
)

In [125]:
print(f"The mean training validation accuracy - {results.mean():.4f}")

The mean training validation accuracy - 0.8314


In [122]:
con_pipe = Pipeline([('imputer', SimpleImputer(missing_values=0 ,strategy='median', add_indicator=True)), 
                        ('scaler', MinMaxScaler())])

cat_pipe = Pipeline([('imputer', SimpleImputer(missing_values=0 , strategy='most_frequent')),
                        ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessing = ColumnTransformer([('categorical', cat_pipe,  (X_train.dtypes == object)), 
                                    ('continuous',  con_pipe, ~(X_train.dtypes == object))])

pipe = Pipeline([('preprocessing', preprocessing), 
                    ('lm', LogisticRegression())])

pipe.fit(X_train, y_train)

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('categorical',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(missing_values=0,
                                                                                 strategy='most_frequent')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  impression        False
url_hash          False
ad_id             False
advertiser_id     False
depth             False
position          False
keyword_id        False
description_id    False
user_id           False
dtype: bool),
                                                 ('continuous',
                                                  Pipeline(steps=[('imputer',
     

In [120]:
y_pred = pipe.predict(X_validation)
mean_absolute_error(y_validation, y_pred)

0.17003519356460534

In [None]:
results = {}
for n_components in range(1, 5):
    pipe = Pipeline([('scalar', StandardScaler()),
                     ('pca',    PCA(n_components=n_components, random_state=42)),
                ])
    pipe.fit(X);
    results[n_components] = pipe

In [116]:
from sklearn import datasets
from sklearn.decomposition import PCA
# Load the data
digits = datasets.load_digits()
# Standardize the feature matrix
X = StandardScaler().fit_transform(digits.data)
pca = PCA(n_components=0.99) 
# Conduct PCA
X_pca = pca.fit_transform(X)
print(f"Original number of features: {X.shape[1]}")
print(f"Reduced number of features:  {X_pca.shape[1]}")

Original number of features: 64
Reduced number of features:  54


In [88]:
con_pipe = Pipeline([('imputer', SimpleImputer(missing_values=0 ,strategy='median', add_indicator=True)), 
                        ('scaler', StandardScaler())])

cat_pipe = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),
                        ('encoder', HelmertEncoder(handle_unknown='ignore'))])

preprocessing = ColumnTransformer([('categorical', cat_pipe,  (X_train.dtypes == object)), 
                                    ('continuous',  con_pipe, ~(X_train.dtypes == object))])

pipe = Pipeline([('preprocessing', preprocessing), 
                    ('lm', ExtraTreeClassifier())])

pipe.fit(X_train, y_train)

Pipeline(steps=[('preprocessing',
                 ColumnTransformer(transformers=[('categorical',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('encoder',
                                                                   HelmertEncoder(handle_unknown='ignore'))]),
                                                  impression        False
url_hash          False
ad_id             False
advertiser_id     False
depth             False
position          False
query_id          False
keyword_id        False
title_id          False
description_id    False
user_id           False
dtype: bool),
                                                 ('continuous',
                                                  Pipeline(steps=[('imputer',
                                                       

Evaluation Metric
----

In [126]:
from sklearn.metrics import accuracy_score

In [100]:
y_pred = pipe.predict(X_validation)
mean_absolute_error(y_validation, y_pred)

0.16129032258064516

In [127]:
accuracy_score(y_validation, y_pred)

0.8299648064353947