# Foundations of Data Mining: Assignment 2

Please complete all assignments in this notebook. You should submit this notebook, as well as a PDF version (See File > Download as).

In [1]:
%matplotlib inline
from preamble import *
plt.rcParams['savefig.dpi'] = 100 # This controls the size of your figures
# Comment out and restart notebook if you only want the last output of each cell.
InteractiveShell.ast_node_interactivity = "all" 

In [2]:
# This is a temporary read-only OpenML key. Replace with your own key later.
oml.config.apikey = '11e82c8d91c5abece86f424369c71590'

## A benchmark study (3 points (2+1))

A benchmark study is an experiment in which multiple algorithms are evaluated on multiple datasets. The end goal is to study whether one algorithm is generally better than the others. Meaningful benchmark studies can grow quite complex, here we do a simplified variant.

* Download OpenML datasets 37, 470, 1120, 1464 and 1471. They are sufficiently large (e.g., at least 500 data points) so that the performance estimation is trustworthy. Select at least three classifiers that we discussed in class, e.g. kNN, Logistic Regression, Random Forests, Gradient Boosting, SVMs, Naive Bayes. Note that some of these algorithms take longer to train. Evaluate all classifiers (with default parameter settings) on all datasets, using a 10-fold CV and AUC. Show the results in a table and interpret them. Which is the best algorithm in this benchmark?
    * Note that these datasets have categorical features, different scales, missing values, and (likely) irrelevant features. You'll need to build pipelines to correctly build all models. Also remove any row identifiers (see, e.g., https://www.openml.org/d/1120)
    * Hint: You can either compare the performances directly, or (better) use a statistical significance test, e.g. a pairwise t-test or (better) Wilcoxon signed ranks test, to see whether the performance differences are significant. This is covered in statistics courses. You can then count wins, ties and losses.
* Repeat the benchmark, but now additionally optimize the main hyperparameters of each algorithm in a grid or random search (explore at least 5 values per hyperparameter, where possible). Does this affect the ranking of the algorithms?

In [3]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


In [4]:
diabetes_data = oml.datasets.get_dataset(37)
X_diabetes, y_diabetes, attributes_diabetes = diabetes_data.get_data(
    target=diabetes_data.default_target_attribute,
    return_attribute_names=True)

diabetes_df = pd.DataFrame(X_diabetes, columns=attributes_diabetes)
display(diabetes_df.describe())

Unnamed: 0,preg,plas,pres,skin,insu,mass,pedi,age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.85,120.89,69.11,20.54,79.8,31.99,0.47,33.24
std,3.37,31.97,19.36,15.95,115.24,7.88,0.33,11.76
min,0.0,0.0,0.0,0.0,0.0,0.0,0.08,21.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24,24.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.37,29.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.63,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


In [5]:

profb_data = oml.datasets.get_dataset(470)
X_raw_profb, y_profb, attributes_profb = profb_data.get_data(
    target=profb_data.default_target_attribute,
    return_attribute_names=True)

profb_df = pd.DataFrame(X_raw_profb, columns=attributes_profb)
profb_df = profb_df[['Favorite_Points','Underdog_Points',
                    'Pointspread','Favorite_Name','Underdog_name']]
display(profb_df.describe())
profb_df = pd.get_dummies(profb_df, columns=['Favorite_Name', 'Underdog_name'])
X_profb = profb_df.values


Unnamed: 0,Favorite_Points,Underdog_Points,Pointspread,Favorite_Name,Underdog_name
count,672.0,672.0,672.0,672.0,672.0
mean,22.95,16.86,5.31,13.48,13.52
std,9.97,9.27,3.31,8.23,7.93
min,0.0,0.0,0.0,0.0,0.0
25%,16.0,10.0,3.0,6.0,7.0
50%,23.0,16.5,5.0,14.0,13.0
75%,29.0,23.0,7.0,20.0,21.0
max,61.0,47.0,19.5,27.0,27.0


In [6]:
#1120
telescope_data = oml.datasets.get_dataset(1120)
X_telescope, y_telescope, attributes_telescope = telescope_data.get_data(
    target=telescope_data.default_target_attribute,
    return_attribute_names=True)

telescope_df = pd.DataFrame(X_telescope, columns=attributes_telescope)
display(telescope_df.describe())

Unnamed: 0,fLength:,fWidth:,fSize:,fConc:,...,fM3Long:,fM3Trans:,fAlpha:,fDist:
count,19020.0,19020.0,19020.0,19020.0,...,19020.0,19020.0,19020.0,19020.0
mean,53.25,22.18,2.83,0.38,...,10.55,0.25,27.65,193.82
std,42.36,18.35,0.47,0.18,...,51.0,20.83,26.1,74.73
min,4.28,0.0,1.94,0.01,...,-331.78,-205.89,0.0,1.28
25%,24.34,11.86,2.48,0.24,...,-12.84,-10.85,5.55,142.49
50%,37.15,17.14,2.74,0.35,...,15.31,0.67,17.68,191.85
75%,70.12,24.74,3.1,0.5,...,35.84,10.95,45.88,240.56
max,334.18,256.38,5.32,0.89,...,238.32,179.85,90.0,495.56


In [7]:
#1464
blood_data = oml.datasets.get_dataset(1464)
X_blood, y_blood, attributes_blood = blood_data.get_data(
    target=blood_data.default_target_attribute,
    return_attribute_names=True)

blood_df = pd.DataFrame(X_blood, columns=attributes_blood)
display(blood_df.describe())


Unnamed: 0,V1,V2,V3,V4
count,748.0,748.0,748.0,748.0
mean,9.51,5.51,1378.68,34.28
std,8.1,5.84,1459.83,24.38
min,0.0,1.0,250.0,2.0
25%,2.75,2.0,500.0,16.0
50%,7.0,4.0,1000.0,28.0
75%,14.0,7.0,1750.0,50.0
max,74.0,50.0,12500.0,98.0


In [8]:
#1471
eeg_data = oml.datasets.get_dataset(1471)
X_eeg, y_eeg, attributes_eeg = eeg_data.get_data(
    target=eeg_data.default_target_attribute,
    return_attribute_names=True)

eeg_df = pd.DataFrame(X_eeg, columns=attributes_eeg)
display(eeg_df.describe())

Unnamed: 0,V1,V2,V3,V4,...,V11,V12,V13,V14
count,14980.0,14980.0,14980.0,14980.0,...,14980.0,14980.0,14980.0,14980.0
mean,4321.9,4009.78,4264.03,4164.96,...,4202.45,4279.24,4615.21,4416.44
std,2492.02,45.94,44.43,5216.36,...,37.79,41.54,1208.36,5890.98
min,1030.77,2830.77,1040.0,2453.33,...,3273.33,2257.95,86.67,1366.15
25%,4280.51,3990.77,4250.26,4108.21,...,4190.26,4267.69,4590.77,4342.05
50%,4294.36,4005.64,4262.56,4120.51,...,4200.51,4276.92,4603.08,4354.87
75%,4311.79,4023.08,4270.77,4132.31,...,4211.28,4287.18,4617.44,4372.82
max,309231.0,7804.62,6880.51,642564.0,...,6823.08,7002.56,152308.0,715897.0


In [None]:
##feature engineering to:
## -> diabetes_data (scaling)
## -> profb_data (select important columns, encode categorical values)
## -> telescope_data (scaling)
## -> blood_data (scaling)
## -> eeg_data (scaling)

In [9]:
#run kNN, Logistic Regression and SVM on diabetes dataset using pipelines
clf_scores = []

#run kNN over all datasets
knn_scores = {'clf': 'kNN'}
knn_pipe = Pipeline([("scaler", MinMaxScaler()), ("knn", KNeighborsClassifier())])

scores = cross_val_score(knn_pipe, X_diabetes, y_diabetes, cv=10, scoring='roc_auc')
knn_scores['diabetes'] = scores.mean()
scores = cross_val_score(knn_pipe, X_telescope, y_telescope, cv=10, scoring='roc_auc')
knn_scores['telescope'] = scores.mean()
scores = cross_val_score(knn_pipe, X_blood, y_blood, cv=10, scoring='roc_auc')
knn_scores['blood'] = scores.mean()
scores = cross_val_score(knn_pipe, X_eeg, y_eeg, cv=10, scoring='roc_auc')
knn_scores['eeg'] = scores.mean()

scores = cross_val_score(KNeighborsClassifier(), X_profb, y_profb, cv=10, scoring='roc_auc')
knn_scores['profb'] = scores.mean()
clf_scores.append(knn_scores)


#run Logistic Regression over all datasets
logistic_scores = {'clf': 'LogisticRegression'}
logistic_pipe = Pipeline([("scaler", MinMaxScaler()), ("logistic", LogisticRegression())])

scores = cross_val_score(logistic_pipe, X_diabetes, y_diabetes, cv=10, scoring='roc_auc')
logistic_scores['diabetes'] = scores.mean()
scores = cross_val_score(logistic_pipe, X_telescope, y_telescope, cv=10, scoring='roc_auc')
logistic_scores['telescope'] = scores.mean()
scores = cross_val_score(logistic_pipe, X_blood, y_blood, cv=10, scoring='roc_auc')
logistic_scores['blood'] = scores.mean()
scores = cross_val_score(logistic_pipe, X_eeg, y_eeg, cv=10, scoring='roc_auc')
logistic_scores['eeg'] = scores.mean()

scores = cross_val_score(LogisticRegression(), X_profb, y_profb, cv=10, scoring='roc_auc')
logistic_scores['profb'] = scores.mean()
clf_scores.append(logistic_scores)



#run SVM over all datasets
svm_scores = {'clf': 'SVM'}
svm_pipe = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC())])

scores = cross_val_score(svm_pipe, X_diabetes, y_diabetes, cv=10, scoring='roc_auc')
svm_scores['diabetes'] = scores.mean()
scores = cross_val_score(svm_pipe, X_telescope, y_telescope, cv=10, scoring='roc_auc')
svm_scores['telescope'] = scores.mean()
scores = cross_val_score(svm_pipe, X_blood, y_blood, cv=10, scoring='roc_auc')
svm_scores['blood'] = scores.mean()
scores = cross_val_score(svm_pipe, X_eeg, y_eeg, cv=10, scoring='roc_auc')
svm_scores['eeg'] = scores.mean()

scores = cross_val_score(SVC(), X_profb, y_profb, cv=10, scoring='roc_auc')
svm_scores['profb'] = scores.mean()
clf_scores.append(svm_scores)

scores_df = pd.DataFrame(clf_scores)
scores_df = scores_df[['clf', 'profb', 'diabetes', 'telescope', 'blood', 'eeg']]
display(scores_df)

Unnamed: 0,clf,profb,diabetes,telescope,blood,eeg
0,kNN,0.6,0.78,0.88,0.51,0.47
1,LogisticRegression,0.77,0.83,0.84,0.95,0.47
2,SVM,0.66,0.83,0.87,0.78,0.42


### Evaluate all classifiers (with default parameter settings)

The following classifiers (with default parametesr settings) were evaluated over 5 datasets:
* kNN 
* Logistic Regression
* SVM

From the table we get that Logistic Regression performs better for 4 datasets, kNN for 2 and SVM 1. Therefore we are inclined to think that Logistic Regression has the best performance, also it does not take as much time to train as SVM.

In [19]:
def tune_parameters(clf, params, X, y):
    [X_train, X_test, y_train, y_test] = train_test_split(X,y,
                                                     test_size=0.2,
                                                    random_state=0,
                                                     stratify=y)
    grid_search = GridSearchCV(clf,
                           params,
                           scoring='roc_auc',
                           cv=5)
    grid_search.fit(X_train, y_train)
    return grid_search.best_params_

def tune_svm_parameters(clf, params, X, y, n_iter_search):
    [X_train, X_test, y_train, y_test] = train_test_split(X,y,
                                                     test_size=0.2,
                                                    random_state=0,
                                                     stratify=y)
    best_score = 0.0
    best_params = None
    for param_grid in params:
        random_search = RandomizedSearchCV(clf,
                                       param_distributions=param_grid,
                                       n_iter=n_iter_search,
                                       scoring='roc_auc',
                                       cv=5)
    random_search.fit(X_train, y_train)
    if random_search.best_score_ > best_score:
        best_params = random_search.best_params_
    
    return best_params

In [12]:
##optimizing main hyperparameters for knn
## weight -> (uniform, distance)
## n_neighbors -> (3, 5, 7, 9, 11, 13, 15)

clf_scores = []

knn_params = [{'n_neighbors': [3,5,7,9,11,13,15]}]

#run kNN over all datasets
knn_scores = {'clf': 'kNN'}

#### performance for diabetes ####
knn_best_params = tune_parameters(KNeighborsClassifier(),
                                 knn_params,
                                 X_diabetes,
                                 y_diabetes)

knn_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("knn", KNeighborsClassifier(**knn_best_params))])

scores = cross_val_score(knn_pipe, X_diabetes, y_diabetes, 
                         cv=10, scoring='roc_auc')
knn_scores['diabetes'] = scores.mean()
print(knn_best_params)

#### performance for telescope ####
knn_best_params = tune_parameters(KNeighborsClassifier(),
                                 knn_params,
                                 X_telescope,
                                 y_telescope)

knn_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("knn", KNeighborsClassifier(**knn_best_params))])

scores = cross_val_score(knn_pipe, X_telescope, y_telescope, 
                         cv=10, scoring='roc_auc')
knn_scores['telescope'] = scores.mean()
print(knn_best_params)

#### performance for blood ####
knn_best_params = tune_parameters(KNeighborsClassifier(),
                                 knn_params,
                                 X_blood,
                                 y_blood)

knn_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("knn", KNeighborsClassifier(**knn_best_params))])

scores = cross_val_score(knn_pipe, X_blood, y_blood, 
                         cv=10, scoring='roc_auc')
knn_scores['blood'] = scores.mean()
print(knn_best_params)

#### performance for eeg ####
knn_best_params = tune_parameters(KNeighborsClassifier(),
                                 knn_params,
                                 X_eeg,
                                 y_eeg)

knn_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("knn", KNeighborsClassifier(**knn_best_params))])

scores = cross_val_score(knn_pipe, X_eeg, y_eeg, 
                         cv=10, scoring='roc_auc')
knn_scores['eeg'] = scores.mean()
print(knn_best_params)

#### performance for profb ####
scores= cross_val_score(GridSearchCV(KNeighborsClassifier(),
                                     knn_params,
                                     scoring='roc_auc',
                                     cv=5),
                        X_profb, y_profb,
                        cv=10, scoring='roc_auc')
knn_scores['profb'] = scores.mean()
print(knn_best_params)

print(knn_scores)

clf_scores.append(knn_scores)


## -> diabetes_data (scaling)
## -> profb_data (select important columns, encode categorical values)
## -> telescope_data (scaling)
## -> blood_data (scaling)
## -> eeg_data (scaling)


{'n_neighbors': 13}
{'n_neighbors': 15}
{'n_neighbors': 7}
{'n_neighbors': 5}
{'n_neighbors': 5}
{'diabetes': 0.80973646723646731, 'blood': 0.53439972480220166, 'telescope': 0.900310571597149, 'profb': 0.61405956801213712, 'clf': 'kNN', 'eeg': 0.47353510267537324}


In [21]:
##optimizing main hyperparameters for svm

svm_params = [
    {'kernel': ['poly'],
     'C': [0.001, 0.01, 0.1, 1, 10, 100, 100],
     'degree': np.arange(3,10).tolist(),
     'coef0': [1.0/4, 1.0/2, 1.0, 2, 4, 8, 16]},
    {'kernel': ['rbf'], 
     'C': [0.001, 0.01, 0.1, 1, 10, 100, 100], 
     'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
]

n_iter_search = 5

#run kNN over all datasets
svm_scores = {'clf': 'SVM'}

#### performance for diabetes ####
svm_best_params = tune_svm_parameters(SVC(),
                                 svm_params,
                                 X_diabetes,
                                 y_diabetes,
                                     n_iter_search)

svm_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("svm", SVC(**svm_best_params))])

scores = cross_val_score(svm_pipe, X_diabetes, y_diabetes, 
                         cv=10, scoring='roc_auc')
svm_scores['diabetes'] = scores.mean()
print(svm_best_params)

#### performance for telescope ####
svm_best_params = tune_svm_parameters(SVC(),
                                 svm_params,
                                 X_telescope,
                                 y_telescope,
                                     n_iter_search)

svm_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("svm", SVC(**svm_best_params))])

scores = cross_val_score(svm_pipe, X_telescope, y_telescope, 
                         cv=10, scoring='roc_auc')
svm_scores['telescope'] = scores.mean()
print(svm_best_params)

#### performance for blood ####
svm_best_params = tune_svm_parameters(SVC(),
                                 svm_params,
                                 X_blood,
                                 y_blood,
                                     n_iter_search)

svm_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("svm", SVC(**svm_best_params))])

scores = cross_val_score(svm_pipe, X_blood, y_blood, 
                         cv=10, scoring='roc_auc')
svm_scores['blood'] = scores.mean()
print(svm_best_params)

#### performance for eeg ####
svm_best_params = tune_svm_parameters(SVC(),
                                 svm_params,
                                 X_eeg,
                                 y_eeg,
                                     n_iter_search)

svm_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("svm", SVC(**svm_best_params))])

scores = cross_val_score(svm_pipe, X_eeg, y_eeg, 
                         cv=10, scoring='roc_auc')
svm_scores['eeg'] = scores.mean()
print(svm_best_params)

#### performance for profb ####
# Random search over polynomial kernel parameters
poly_scores = cross_val_score(RandomizedSearchCV(SVC(),
                                     svm_params[0],
                                     n_iter = n_iter_search,
                                     scoring='roc_auc',
                                     cv=5),
                        X_profb, y_profb,
                        cv=10, scoring='roc_auc')

# Random search over rbf kernel parameters
rbf_scores = cross_val_score(RandomizedSearchCV(SVC(),
                                     svm_params[1],
                                     n_iter = n_iter_search,
                                     scoring='roc_auc',
                                     cv=5),
                        X_profb, y_profb,
                        cv=10, scoring='roc_auc')

if poly_scores.mean() > rbf_scores.mean():
    svm_scores['profb'] = poly_scores.mean()
else:
    svm_scores['profb'] = rbf_scores.mean()

clf_scores.append(svm_scores)


{'gamma': 0.01, 'C': 0.1, 'kernel': 'rbf'}
{'gamma': 0.001, 'C': 1, 'kernel': 'rbf'}
{'gamma': 0.001, 'C': 10, 'kernel': 'rbf'}
{'gamma': 0.001, 'C': 100, 'kernel': 'rbf'}


In [22]:
##optimizing main hyperparameters for logistic regression

logistic_params = {'C': [0.0001,0.001,0.01,0.1,1,10,100, 1000],
                   'penalty': ['l1', 'l2']}

#run kNN over all datasets
logistic_scores = {'clf': 'LogisticRegression'}

#### performance for diabetes ####
logistic_best_params = tune_parameters(LogisticRegression(),
                                 logistic_params,
                                 X_diabetes,
                                 y_diabetes)

logistic_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("logistic", LogisticRegression(**logistic_best_params))])

scores = cross_val_score(logistic_pipe, X_diabetes, y_diabetes, 
                         cv=10, scoring='roc_auc')
logistic_scores['diabetes'] = scores.mean()
print(logistic_best_params)

#### performance for telescope ####
logistic_best_params = tune_parameters(LogisticRegression(),
                                 logistic_params,
                                 X_telescope,
                                 y_telescope)

logistic_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("logistic", LogisticRegression(**logistic_best_params))])

scores = cross_val_score(logistic_pipe, X_telescope, y_telescope, 
                         cv=10, scoring='roc_auc')
logistic_scores['telescope'] = scores.mean()
print(logistic_best_params)

#### performance for blood ####
logistic_best_params = tune_parameters(LogisticRegression(),
                                 logistic_params,
                                 X_blood,
                                 y_blood)

logistic_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("logistic", LogisticRegression(**logistic_best_params))])

scores = cross_val_score(logistic_pipe, X_blood, y_blood, 
                         cv=10, scoring='roc_auc')
logistic_scores['blood'] = scores.mean()
print(logistic_best_params)

#### performance for eeg ####
logistic_best_params = tune_parameters(LogisticRegression(),
                                 logistic_params,
                                 X_eeg,
                                 y_eeg)

logistic_pipe = Pipeline([("scaler", MinMaxScaler()),
                     ("logistic", LogisticRegression(**logistic_best_params))])

scores = cross_val_score(logistic_pipe, X_eeg, y_eeg, 
                         cv=10, scoring='roc_auc')
logistic_scores['eeg'] = scores.mean()
print(logistic_best_params)

#### performance for profb ####
scores= cross_val_score(GridSearchCV(LogisticRegression(),
                                     logistic_params,
                                     scoring='roc_auc',
                                     cv=5),
                        X_profb, y_profb,
                        cv=10, scoring='roc_auc')
logistic_scores['profb'] = scores.mean()

clf_scores.append(logistic_scores)



{'penalty': 'l2', 'C': 1000}
{'penalty': 'l2', 'C': 100}
{'penalty': 'l1', 'C': 10}
{'penalty': 'l2', 'C': 10}


### Benchmark with optimized hyperparameters

In [23]:
scores_df = pd.DataFrame(clf_scores)
scores_df = scores_df[['clf', 'profb', 'diabetes', 'telescope', 'blood', 'eeg']]
display(scores_df)

Unnamed: 0,clf,profb,diabetes,telescope,blood,eeg
0,kNN,0.61,0.81,0.9,0.53,0.47
1,SVM,0.72,0.82,0.81,0.81,0.42
2,LogisticRegression,0.76,0.83,0.84,0.95,0.46


From the table above we get that kNN performs better for 2 datasets and Logistic regression for the other 3. We are inclined to think that Logistic regression has a better performance than the other classifiers.
The hyperparameter optimization made kNN perform better and SVM perform worse relative to the other classifiers, the ranking was not affected. 