<a href="https://colab.research.google.com/github/yandexdataschool/MLatImperial2019/blob/master/03_lab/baseline_kaggle_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
import pickle
import os
import matplotlib.pyplot as plt
%matplotlib inline
import scipy
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
import h5py
import pandas as pd

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
DATA_PATH = "/content/gdrive/My Drive/kaggle_1"

# !Link to challenge!

# https://www.kaggle.com/t/590bfa6b77b54fdeb2a28b9cf43c3b16

### Metric

For binary classification with a true label y $\in \{0,1\}$ and a probability estimate p = $\operatorname{Pr}(y = 1)$, the log loss per sample is the negative log-likelihood of the classifier given the true label:
$$
L_{\log}(y, p) = -\log \operatorname{Pr}(y|p) = -(y \log (p) + (1 - y) \log (1 - p))
$$

This extends to the multiclass case as follows. Let the true labels for a set of samples be encoded as a 1-of-K binary indicator matrix Y, i.e., $y_{i,k} = 1$ if sample i has label k taken from a set of K labels. Let P be a matrix of probability estimates, with $p_{i,k} = \operatorname{Pr}(t_{i,k} = 1)$. Then the log loss of the whole set is

$$
L_{\log}(Y, P) = -\log \operatorname{Pr}(Y|P) = - \frac{1}{N} \sum_{i=0}^{N-1} \sum_{k=0}^{K-1} y_{i,k} \log p_{i,k}
$$

# Grading

Your task is to try as many techniques that you have learned this week as possible.

The outcome of your work should be a small table with results, i.e 
Method - parameters tuned with CV - score + features created on top of exiting ones. The table should be accompanied by a small report of your workflow and reasoning. Also, you need to send the code.

The archive with the files should be sent to mlaticl2019@yandex.ru with the topic: **Surname_name_kaggle_1**

The total amount of points is 10. You will get additional points based on your final ranking.

** 1 Point **

Try different linear and metric methods. Do all of them work? Why?

** 1 Point **

Search parameters for linear methods.

** 1 Point **

Try using PCA or SVD. Is it helpful? Why?

** 1-2 Points **

Create you own features, st they improve the score.

** 2 Points **

Try decision trees, forests, boosting. Grid search parameters

** 1 Point **

Use the discussed techniques to estimate feature importances. Try varying the features used.

** 1 Point **

Explain, which metric you used as target metric. Try using other metrics (for example, MSE). Is it helpful? Why?

** 1 Point **

Use stacking and blending of the models trained above? Does it improve your score?

** Bonus **

Beat medium baseline - + 2 bonus points.



## About

In this notebook we prepare a simple solution.

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

### Read training and test files

In [0]:
data = pd.read_csv(os.path.join(DATA_PATH, 'training.csv.zip'))
test = pd.read_csv(os.path.join(DATA_PATH, 'test.csv.zip'))

In [0]:
data.shape, test.shape

((1200000, 50), (1200000, 50))

In [0]:
data.head()

Unnamed: 0,TrackP,TrackNDoFSubdetector2,BremDLLbeElectron,MuonLooseFlag,FlagSpd,SpdE,EcalDLLbeElectron,DLLmuon,RICHpFlagElectron,EcalDLLbeMuon,...,TrackNDoF,RICHpFlagMuon,RICH_DLLbeKaon,RICH_DLLbeElectron,HcalE,MuonFlag,FlagMuon,PrsE,RICH_DLLbeMuon,RICH_DLLbeProton
0,74791.156263,15.0,0.232275,1.0,1.0,3.2,-2.505719,6.604153,1.0,1.92996,...,28.0,1.0,-7.2133,-0.2802,5586.589846,1.0,1.0,10.422315,-2.081143e-07,-24.8244
1,2738.489989,15.0,-0.357748,0.0,1.0,3.2,1.864351,0.263651,1.0,-2.061959,...,32.0,1.0,-0.324317,1.707283,-7e-06,0.0,1.0,43.334935,2.771583,-0.648017
2,2161.409908,17.0,-999.0,0.0,0.0,-999.0,-999.0,-999.0,0.0,-999.0,...,27.0,0.0,-999.0,-999.0,-999.0,0.0,0.0,-999.0,-999.0,-999.0
3,15277.73049,20.0,-0.638984,0.0,1.0,3.2,-2.533918,-8.724949,1.0,-3.253981,...,36.0,1.0,-35.202221,-14.742319,4482.803707,0.0,1.0,2.194175,-3.070819,-29.291519
4,7563.700195,19.0,-0.638962,0.0,1.0,3.2,-2.087146,-7.060422,1.0,-0.995816,...,33.0,1.0,25.084287,-10.272412,5107.55468,0.0,1.0,1.5e-05,-5.373712,23.653087


In [0]:
test.head()

Unnamed: 0,TrackP,TrackNDoFSubdetector2,BremDLLbeElectron,MuonLooseFlag,FlagSpd,SpdE,EcalDLLbeElectron,DLLmuon,RICHpFlagElectron,EcalDLLbeMuon,...,RICHpFlagMuon,RICH_DLLbeKaon,RICH_DLLbeElectron,HcalE,MuonFlag,FlagMuon,PrsE,RICH_DLLbeMuon,RICH_DLLbeProton,ID
0,55086.199233,18.0,-0.438763,0.0,1.0,3.2,-1.843821,-4.579244,1.0,-1.732886,...,1.0,18.674086,-1.355015,24510.990244,0.0,1.0,9.325265,-0.250015,35.408585,0
1,3393.820071,17.0,-0.554341,0.0,1.0,0.0,-0.883237,-6.203035,1.0,-0.097206,...,1.0,16.536804,-17.601196,778.675303,0.0,1.0,-6e-06,-6.646096,14.011904,1
2,18341.359361,12.0,-0.554339,0.0,1.0,0.0,-2.653786,-3.922639,1.0,0.936484,...,1.0,-1.306109,-4.536409,7915.21242,0.0,1.0,1.371346,-2.132609,-5.617409,2
3,27486.710933,7.0,-0.492411,1.0,1.0,3.2,-999.0,2.034453,1.0,-999.0,...,1.0,-4.222793,3.149207,-999.0,1.0,1.0,61.985428,0.946207,-8.657193,3
4,6842.249996,16.0,0.098706,0.0,1.0,3.2,2.644499,-1.471364,1.0,-2.90947,...,1.0,-3.425113,23.147387,-1.3e-05,0.0,1.0,2.468453,2.614987,-5.713513,4


### Look at the labels set

In [0]:
set(data.Label)

{'Electron', 'Ghost', 'Kaon', 'Muon', 'Pion', 'Proton'}

### Define training features

Exclude `Label` from the features set

In [0]:
features = list(set(data.columns) - {'Label'})
features

['GhostProbability',
 'RICHpFlagPion',
 'EcalE',
 'RICHpFlagMuon',
 'EcalShowerLongitudinalParameter',
 'RICH_DLLbeBCK',
 'RICHpFlagElectron',
 'RICHpFlagProton',
 'PrsE',
 'HcalDLLbeMuon',
 'FlagSpd',
 'MuonLLbeMuon',
 'TrackQualitySubdetector1',
 'Calo2dFitQuality',
 'DLLelectron',
 'TrackQualitySubdetector2',
 'DLLproton',
 'FlagHcal',
 'FlagBrem',
 'RICH_DLLbeProton',
 'FlagPrs',
 'BremDLLbeElectron',
 'HcalE',
 'RICH_DLLbeKaon',
 'HcalDLLbeElectron',
 'FlagRICH2',
 'TrackNDoFSubdetector2',
 'MuonLooseFlag',
 'MuonLLbeBCK',
 'TrackNDoF',
 'TrackPt',
 'TrackDistanceToZ',
 'Calo3dFitQuality',
 'TrackNDoFSubdetector1',
 'TrackP',
 'DLLmuon',
 'PrsDLLbeElectron',
 'DLLkaon',
 'FlagEcal',
 'RICHpFlagKaon',
 'SpdE',
 'EcalDLLbeElectron',
 'EcalDLLbeMuon',
 'FlagMuon',
 'RICH_DLLbeElectron',
 'MuonFlag',
 'RICH_DLLbeMuon',
 'TrackQualityPerNDoF',
 'FlagRICH1']

### Divide training data into 2 parts

In [0]:
training_data, validation_data = train_test_split(data, random_state=11, train_size=0.10)



In [0]:
len(training_data), len(validation_data)

(120000, 1080000)

### Simple logistic regression forest from `sklearn` training

train multiclassification model

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

In [0]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(training_data[features])

In [0]:
%%time

clf = LogisticRegression(penalty='l2', n_jobs=-1, solver='saga', multi_class='multinomial', random_state=42)
param_grid = {'C': [0.1, 1]}

gscv = GridSearchCV(clf, param_grid, scoring='neg_log_loss', cv=3, n_jobs=-1, verbose=1)
gscv.fit(X_train, training_data.Label)

Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  1.8min finished


CPU times: user 38.8 s, sys: 187 ms, total: 39 s
Wall time: 2min 28s




In [0]:
gscv.cv_results_



{'mean_fit_time': array([34.65780274, 35.07985179]),
 'mean_score_time': array([0.41966661, 0.36462768]),
 'mean_test_score': array([-0.87774634, -0.87361161]),
 'mean_train_score': array([-0.87553757, -0.87138096]),
 'param_C': masked_array(data=[0.1, 1],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 0.1}, {'C': 1}],
 'rank_test_score': array([2, 1], dtype=int32),
 'split0_test_score': array([-0.90848259, -0.90595279]),
 'split0_train_score': array([-0.90506437, -0.90247521]),
 'split1_test_score': array([-0.81828495, -0.81098105]),
 'split1_train_score': array([-0.81632204, -0.80901035]),
 'split2_test_score': array([-0.90647142, -0.90390092]),
 'split2_train_score': array([-0.9052263 , -0.90265731]),
 'std_fit_time': array([0.11267883, 0.18753866]),
 'std_score_time': array([0.05861189, 0.01642174]),
 'std_test_score': array([0.04205356, 0.04429441]),
 'std_train_score': array([0.04187175, 0.04410274])}

Train best model:

In [0]:
%%time
c = 1
clf = LogisticRegression(penalty='l2', C=c, n_jobs=-1, solver='saga', multi_class='multinomial')
clf.fit(X_train, training_data.Label)

CPU times: user 37.3 s, sys: 30.5 ms, total: 37.3 s
Wall time: 37.4 s




### Evaluate predictions on the validation sample

In [0]:
# predict each track
X_val = scaler.fit_transform(validation_data[features])
proba = clf.predict_proba(X_val)

### Log loss on the cross validation sample

In [0]:
log_loss(validation_data.Label, proba)

0.8708060229871727

## Prepare submission to kaggle

In [0]:
# predict test sample
X_test = scaler.fit_transform(test[features])
kaggle_proba = clf.predict_proba(X_test)
kaggle_ids = test.ID

In [0]:
from IPython.display import FileLink

def create_solution(ids, proba, names, filename='baseline.csv'):
    """saves predictions to file and provides a link for downloading """
    solution = pd.DataFrame({'ID': ids})
    
    for name in ['Ghost', 'Electron', 'Muon', 'Pion', 'Kaon', 'Proton']:
        solution[name] = proba[:, np.where(names == name)[0]]
    
    solution.to_csv(os.path.join(DATA_PATH, '{}'.format(filename)), index=False)
    return FileLink(os.path.join(DATA_PATH, '{}'.format(filename)))
    
create_solution(kaggle_ids, kaggle_proba, clf.classes_)

# Lets use kaggle API again to submit results

In [0]:
!mkdir /content/.kaggle
!cp /content/gdrive/My\ Drive/kaggle.json /content/.kaggle/
!chmod 600 /content/.kaggle/kaggle.json

mkdir: cannot create directory ‘/content/.kaggle’: File exists


In [0]:
%env KAGGLE_CONFIG_DIR=/content/.kaggle

env: KAGGLE_CONFIG_DIR=/content/.kaggle


In [0]:
!kaggle competitions submit -c icl2019-pid -f "{DATA_PATH}/baseline.csv" -m "Message"

100% 147M/147M [00:04<00:00, 36.1MB/s]
Successfully submitted to PID(ICL2018)