**Machine Learning Basic Principles 2018 - Data Analysis Project Report**

*All the text in italics is instructions for filling the template - remove when writing the project report!*

# *Title* 

*Title should be concise and informative, describes the approach to solve the problem. Some good titles from previous years:*

*- Comparing extreme learning machines and naive bayes’ classifier in spam detection*

*- Using linear discriminant analysis in spam detection*

*Some not-so-good titles:*

*- Bayesian spam filtering with extras*

*- Two-component classifier for spam detection*

*- CS-E3210 Term Project, final report*




## Abstract

*Precise summary of the whole report, previews the contents and results. Must be a single paragraph between 100 and 200 words.*



## 1. Introduction

*Background, problem statement, motivation, many references, description of
contents. Introduces the reader to the topic and the broad context within which your
research/project fits*

*- What do you hope to learn from the project?*
*- What question is being addressed?*
*- Why is this task important? (motivation)*

*Keep it short (half to 1 page).*



## 2. Data analysis

*Briefly describe data (class distribution, dimensionality) and how will it affect
classification. Visualize the data. Don’t focus too much on the meaning of the features,
unless you want to.*

*- Include histograms showing class distribution.*



In [1]:
# Import libraries
import pandas as pd
import numpy as np


(4362, 264)


Unnamed: 0,1040.7,2315.6,2839.1,2552.2,2290.4,1913.8,2152.6,1930.3,2079.3,1706.7,...,0.21649,0.36548,0.093584,0.16687,0.083426,0.11809,0.089792,0.074371,0.073162,0.059463
0,2309.4,4780.4,4055.7,3120.5,1979.9,2343.6,2634.2,3208.5,3078.0,3374.7,...,0.10067,0.14739,0.10256,0.21304,0.082041,0.080967,0.07645,0.052523,0.052357,0.055297
1,2331.9,4607.0,4732.3,5007.0,3164.9,3171.9,2915.7,3282.3,2400.0,1895.2,...,0.12676,0.36321,0.1142,0.22378,0.10077,0.18691,0.06727,0.061138,0.085509,0.049422
2,3350.9,6274.4,5037.0,4609.7,3438.8,3925.8,3746.4,3539.4,3053.7,3075.4,...,0.096479,0.2895,0.074124,0.20158,0.049032,0.13021,0.0458,0.080885,0.14891,0.042027
3,2017.6,3351.8,2924.9,2726.3,1979.9,1930.9,2083.4,1889.2,1695.4,1911.7,...,0.13834,0.38266,0.079402,0.063495,0.053717,0.08675,0.06209,0.048999,0.033159,0.070813
4,1229.8,3005.8,2818.4,2640.1,2329.1,2568.4,2772.1,3119.3,2505.8,2085.0,...,0.13729,0.065876,0.078278,0.058903,0.051245,0.049138,0.070669,0.067383,0.053383,0.037763


In [65]:
# Load the data and cleanup

# Read data
train_data_df = pd.read_csv("data/train_data.csv", header=None)
test_data_df = pd.read_csv("data/test_data.csv", header=None)
train_labels_df = pd.read_csv("data/train_labels.csv", header=None)

print(train_data_df.shape)
print(test_data_df.shape)
train_data_df.head()


(4363, 264)
(6544, 264)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,254,255,256,257,258,259,260,261,262,263
0,1040.7,2315.6,2839.1,2552.2,2290.4,1913.8,2152.6,1930.3,2079.3,1706.7,...,0.21649,0.36548,0.093584,0.16687,0.083426,0.11809,0.089792,0.074371,0.073162,0.059463
1,2309.4,4780.4,4055.7,3120.5,1979.9,2343.6,2634.2,3208.5,3078.0,3374.7,...,0.10067,0.14739,0.10256,0.21304,0.082041,0.080967,0.07645,0.052523,0.052357,0.055297
2,2331.9,4607.0,4732.3,5007.0,3164.9,3171.9,2915.7,3282.3,2400.0,1895.2,...,0.12676,0.36321,0.1142,0.22378,0.10077,0.18691,0.06727,0.061138,0.085509,0.049422
3,3350.9,6274.4,5037.0,4609.7,3438.8,3925.8,3746.4,3539.4,3053.7,3075.4,...,0.096479,0.2895,0.074124,0.20158,0.049032,0.13021,0.0458,0.080885,0.14891,0.042027
4,2017.6,3351.8,2924.9,2726.3,1979.9,1930.9,2083.4,1889.2,1695.4,1911.7,...,0.13834,0.38266,0.079402,0.063495,0.053717,0.08675,0.06209,0.048999,0.033159,0.070813


In [None]:
print(train_labels_df.shape)
print(train_labels_df.values().ravel().shape)



In [15]:
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train_data_df, train_labels_df.as_matrix().ravel(),
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_pipeline.py')



Generation 1 - Current best internal CV score: 0.6307150918655987


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.1, max_depth=3, max_features=0.55, min_samples_leaf=8, min_samples_split=13, n_estimators=100, subsample=0.35)
0.648945921173


True

In [83]:
pd.DataFrame.from_dict(tpot.evaluated_individuals_)

Unnamed: 0,"BernoulliNB(OneHotEncoder(input_matrix, OneHotEncoder__minimum_fraction=0.1, OneHotEncoder__sparse=False, OneHotEncoder__threshold=10), BernoulliNB__alpha=1.0, BernoulliNB__fit_prior=False)","BernoulliNB(PolynomialFeatures(input_matrix, PolynomialFeatures__degree=2, PolynomialFeatures__include_bias=False, PolynomialFeatures__interaction_only=False), BernoulliNB__alpha=1.0, BernoulliNB__fit_prior=True)","BernoulliNB(PolynomialFeatures(input_matrix, PolynomialFeatures__degree=2, PolynomialFeatures__include_bias=False, PolynomialFeatures__interaction_only=False), BernoulliNB__alpha=100.0, BernoulliNB__fit_prior=True)","BernoulliNB(VarianceThreshold(input_matrix, VarianceThreshold__threshold=0.001), BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=True)","BernoulliNB(input_matrix, BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=True)","DecisionTreeClassifier(LinearSVC(input_matrix, LinearSVC__C=25.0, LinearSVC__dual=True, LinearSVC__loss=squared_hinge, LinearSVC__penalty=l2, LinearSVC__tol=1e-05), DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=4, DecisionTreeClassifier__min_samples_leaf=14, DecisionTreeClassifier__min_samples_split=18)","DecisionTreeClassifier(LinearSVC(input_matrix, LinearSVC__C=25.0, LinearSVC__dual=True, LinearSVC__loss=squared_hinge, LinearSVC__penalty=l2, LinearSVC__tol=1e-05), DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=8, DecisionTreeClassifier__min_samples_leaf=14, DecisionTreeClassifier__min_samples_split=18)","DecisionTreeClassifier(MaxAbsScaler(input_matrix), DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=10, DecisionTreeClassifier__min_samples_leaf=6, DecisionTreeClassifier__min_samples_split=2)","DecisionTreeClassifier(Nystroem(input_matrix, Nystroem__gamma=0.0, Nystroem__kernel=laplacian, Nystroem__n_components=5), DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=3, DecisionTreeClassifier__min_samples_leaf=11, DecisionTreeClassifier__min_samples_split=7)","DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=8, DecisionTreeClassifier__min_samples_leaf=3, DecisionTreeClassifier__min_samples_split=5)",...,"GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=0.5, GradientBoostingClassifier__max_depth=10, GradientBoostingClassifier__max_features=0.45, GradientBoostingClassifier__min_samples_leaf=18, GradientBoostingClassifier__min_samples_split=16, GradientBoostingClassifier__n_estimators=100, GradientBoostingClassifier__subsample=0.95)","GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=1.0, GradientBoostingClassifier__max_depth=8, GradientBoostingClassifier__max_features=0.45, GradientBoostingClassifier__min_samples_leaf=12, GradientBoostingClassifier__min_samples_split=12, GradientBoostingClassifier__n_estimators=100, GradientBoostingClassifier__subsample=0.35)","LinearSVC(BernoulliNB(input_matrix, BernoulliNB__alpha=0.001, BernoulliNB__fit_prior=True), LinearSVC__C=15.0, LinearSVC__dual=False, LinearSVC__loss=squared_hinge, LinearSVC__penalty=l1, LinearSVC__tol=1e-05)","LinearSVC(BernoulliNB(input_matrix, BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=True), LinearSVC__C=15.0, LinearSVC__dual=False, LinearSVC__loss=squared_hinge, LinearSVC__penalty=l1, LinearSVC__tol=1e-05)","LinearSVC(ExtraTreesClassifier(input_matrix, ExtraTreesClassifier__bootstrap=True, ExtraTreesClassifier__criterion=entropy, ExtraTreesClassifier__max_features=0.3, ExtraTreesClassifier__min_samples_leaf=13, ExtraTreesClassifier__min_samples_split=4, ExtraTreesClassifier__n_estimators=100), LinearSVC__C=10.0, LinearSVC__dual=False, LinearSVC__loss=squared_hinge, LinearSVC__penalty=l2, LinearSVC__tol=0.01)","LinearSVC(ExtraTreesClassifier(input_matrix, ExtraTreesClassifier__bootstrap=True, ExtraTreesClassifier__criterion=entropy, ExtraTreesClassifier__max_features=0.3, ExtraTreesClassifier__min_samples_leaf=5, ExtraTreesClassifier__min_samples_split=4, ExtraTreesClassifier__n_estimators=100), LinearSVC__C=10.0, LinearSVC__dual=False, LinearSVC__loss=squared_hinge, LinearSVC__penalty=l2, LinearSVC__tol=0.01)","LinearSVC(input_matrix, LinearSVC__C=10.0, LinearSVC__dual=False, LinearSVC__loss=squared_hinge, LinearSVC__penalty=l2, LinearSVC__tol=0.01)","LinearSVC(input_matrix, LinearSVC__C=15.0, LinearSVC__dual=False, LinearSVC__loss=squared_hinge, LinearSVC__penalty=l1, LinearSVC__tol=0.01)","LinearSVC(input_matrix, LinearSVC__C=15.0, LinearSVC__dual=False, LinearSVC__loss=squared_hinge, LinearSVC__penalty=l1, LinearSVC__tol=1e-05)","LogisticRegression(input_matrix, LogisticRegression__C=10.0, LogisticRegression__dual=False, LogisticRegression__penalty=l2)"
crossover_count,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
generation,0,0,INVALID,INVALID,0,0,INVALID,0,0,INVALID,...,0,0,INVALID,INVALID,INVALID,0,INVALID,INVALID,0,INVALID
internal_cv_score,0.492854,0.421625,0.500776,0.494339,0.518495,0.552106,0.523355,0.505392,0.500776,0.493656,...,0.606209,0.0666809,0.614441,0.614438,0.555486,0.5567,0.5564,0.615668,0.6123,0.585104
mutation_count,0,0,1,1,0,0,1,0,0,1,...,0,0,1,1,0,0,1,0,0,1
operator_count,2,2,2,2,1,2,2,2,2,1,...,1,1,2,2,2,2,1,1,1,1
predecessor,"(ROOT,)","(ROOT,)","(BernoulliNB(PolynomialFeatures(input_matrix, ...","(BernoulliNB(input_matrix, BernoulliNB__alpha=...","(ROOT,)","(ROOT,)",(DecisionTreeClassifier(LinearSVC(input_matrix...,"(ROOT,)","(ROOT,)","(GaussianNB(input_matrix),)",...,"(ROOT,)","(ROOT,)","(LinearSVC(input_matrix, LinearSVC__C=15.0, Li...","(LinearSVC(input_matrix, LinearSVC__C=15.0, Li...","(LinearSVC(ExtraTreesClassifier(input_matrix, ...","(ROOT,)","(LinearSVC(ExtraTreesClassifier(input_matrix, ...","(LinearSVC(input_matrix, LinearSVC__C=15.0, Li...","(ROOT,)","(GaussianNB(input_matrix),)"


# Construct accuracy Kaggle submission

In [79]:
y_pred = tpot.predict(test_data_df.values)  # Predicts from 1-10
indices = np.arange(1, y_pred.shape[0] + 1)
labeled_y_pred = np.append(indices[:,None], y_pred[:,None], axis=1)
np.savetxt("data/tpot_accuracy_solution.csv", labeled_y_pred, delimiter=",", fmt="%d", 
           header="Sample_id,Sample_label",
           comments='')

# Construct log loss Kaggle submission

In [69]:
from keras.utils import to_categorical
one_hot = to_categorical(y_pred, dtype=int)  # Splits into classes from 0-10 (11 classes)
one_hot = one_hot[:, 1:]  # Trim unnecessary first column (class "0")
indices = np.arange(1, one_hot.shape[0] + 1)
labeled_one_hot = np.append(indices[:,None], one_hot, axis=1)
print(labeled_one_hot.shape)
np.savetxt("data/tpot_logloss_solution.csv", labeled_one_hot, delimiter=",", fmt="%d", 
           header="Sample_id,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9,Class_10",
           comments='')

(6544, 11)


In [3]:
#Analysis of the input data
# ...

## 3. Methods and experiments

*- Explain your whole approach (you can include a block diagram showing the steps in your process).* 

*- What methods/algorithms, why were the methods chosen. *

*- What evaluation methodology (cross CV, etc.).*



In [4]:
# Trials with ML algorithms

## 4. Results

*Summarize the results of the experiments without discussing their implications.*

*- Include both performance measures (accuracy and LogLoss).*

*- How does it perform on kaggle compared to the train data.*

*- Include a confusion matrix.*



In [5]:
#Confusion matrix ...

## 5. Discussion/Conclusions

*Interpret and explain your results *

*- Discuss the relevance of the performance measures (accuracy and LogLoss) for
imbalanced multiclass datasets. *

*- How the results relate to the literature. *

*- Suggestions for future research/improvement. *

*- Did the study answer your questions? *



## 6. References

*List of all the references cited in the document*

## Appendix
*Any additional material needed to complete the report can be included here. For example, if you want to keep  additional source code, additional images or plots, mathematical derivations, etc. The content should be relevant to the report and should help explain or visualize something mentioned earlier. **You can remove the whole Appendix section if there is no need for it.** *