# Otto Group Product Classification Problem

This is a challenge from Kaggle challenges.
In this problem we want to classifiy Otto's products into 9 categories, each product is described with some annoynomes features. 
you can find the challenge here :  
https://www.kaggle.com/c/otto-group-product-classification-challenge

# Dataset

we have about 62000 samples for training and 144000 samples for testing.

Files:  
trainData.csv - the training set  
testData.csv - the test set

Data fields:  
id - an anonymous id unique to a product  
feat_1, feat_2, ..., feat_93 - the various features of a product  
target - the class of a product

# Error Evaluation

Submissions are evaluated using the multi-class logarithmic loss.  

<img src="log-loss.png">

where N is the number of products in the test set, M is the number of class labels, log is the natural logarithm, yij is 1 if observation i is in class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.

# Solution

First these are the libraries we will use : Pandas, NumPy, SciPy,Scikit-learn and XGBoost.  
Thanks for Anaconda (python package manager) all needes libraries are included but XGBoost.  
To install XGBoost : https://xgboost.readthedocs.org/en/latest/


In [4]:
import numpy as np
import pandas as pd
from sklearn import ensemble
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
from sklearn import grid_search
from sklearn import cross_validation;
import xgboost as xgb
from sklearn import neighbors
from sklearn import cluster
from sklearn.metrics import pairwise
from sklearn.externals import joblib
from sklearn import feature_selection
from sklearn import decomposition


### Loading training and testing data
Using pandas library we can read the train file into a Dataframe called trainData.  
then separate the trainData into 2 Dataframes one with the features and one with the correct output. 

In [5]:
trainData= pd.read_csv("train.csv");
correctClass=trainData['target'];
classes=np.unique(correctClass);
trainFeatures=trainData.drop(['id', 'target'], axis=1);

Load the testing data.

In [6]:
testData= pd.read_csv("test.csv");
testFeatures=testData.drop(['id'], axis=1);
testIDs=testData['id'];


Print some information about the training data.

In [7]:
print(trainFeatures.describe())

            feat_1        feat_2        feat_3        feat_4        feat_5  \
count  61878.00000  61878.000000  61878.000000  61878.000000  61878.000000   
mean       0.38668      0.263066      0.901467      0.779081      0.071043   
std        1.52533      1.252073      2.934818      2.788005      0.438902   
min        0.00000      0.000000      0.000000      0.000000      0.000000   
25%        0.00000      0.000000      0.000000      0.000000      0.000000   
50%        0.00000      0.000000      0.000000      0.000000      0.000000   
75%        0.00000      0.000000      0.000000      0.000000      0.000000   
max       61.00000     51.000000     64.000000     70.000000     19.000000   

             feat_6        feat_7        feat_8        feat_9       feat_10  \
count  61878.000000  61878.000000  61878.000000  61878.000000  61878.000000   
mean       0.025696      0.193704      0.662433      1.011296      0.263906   
std        0.215333      1.030102      2.255770      3.47482

Make sure that there is no missing values in the training/testing features.

In [11]:
print(sum(trainFeatures.count(axis=1)< 93) )
print(sum(testFeatures.count(axis=1) < 93) )


0
0


### Using XGBoost classifier

we will use the sklearn wrapper of XGBoost (installed with XGBoost).  
- set the objective parmater to 'multi:softprob' becuase our problem is multiclass classification and we want to output a probability for each class.  
- use cross validation with kfolds=4.  

First try without paramters tunning.  

In [None]:
XGBClassifier=xgb.XGBClassifier(objective='multi:softprob');
crossValidator= cross_validation.StratifiedKFold(correctClass,n_folds=4);    
scores=cross_validation.cross_val_score(classifier,trainFeatures,verbose=10,y=correctClass,cv=crossValidator,scoring='log_loss');
print('crossValidation error = ',scores.mean())


crossValidation error = 0.656282936478

### Paramters Tunning using Grid Search
use gird search to tune the paramters accorading to the best 'log_loss'.

In [None]:
XGBClassifier=xgb.XGBClassifier(objective='multi:softprob');
parameters = {
        'learning_rate': [0.1,0.3],
        'n_estimators': [200,400,600], 
        'max_depth': [9,12],
        'gamma' :[1],       
        'subsample': [0.5],
        'colsample_bytree': [1],
        }

classifier=grid_search.GridSearchCV(XGBClassifier,parameters,scoring='log_loss',cv=4);                               
classifier.fit(trainFeatures,correctClass);

best_parameters, score, _ = max(classifier.grid_scores_, key=lambda x: x[1])
print('best cv score is ',score)
print('best paramters are')
for param_name in sorted(best_parameters.keys()):
     print("%s: %r" % (param_name, best_parameters[param_name]))
     

best cv score is  0.466627075491  
best paramters are  
colsample_bytree: 1  
gamma: 1  
learning_rate: 0.1  
max_depth: 9  
n_estimators: 400  
subsample: 0.5  



### Adding new features

feature 1: number of non zeros in the feature vector.  
feature 2: sum of all features.  

In [None]:
sumColumn=trainFeatures.apply(lambda row: row.sum(), axis=1);
nonZerosColumn=trainFeatures.apply(lambda row: sum(row!=0), axis=1);
trainFeatures['sum']=sumColumn
trainFeatures['nonZero']=nonZerosColumn;

sumColumnT=testFeatures.apply(lambda row: row.sum(), axis=1);
nonZerosColumnT=testFeatures.apply(lambda row: sum(row!=0), axis=1);
testFeatures['sum']=sumColumnT
testFeatures['nonZero']=nonZerosColumnT


Grid search again with the new features and get :  
        
best cv score is  0.459499757896  
best paramters are  
colsample_bytree: 0.8  
colsample_bylevel: 0.8  
min_child_weight=0.8  
gamma: 1  
learning_rate: 0.1  
max_depth: 9  
n_estimators: 400  
subsample: 0.8  
 

### Final Model

The model now is performing good, next step is train two models with different seeds and best paramters then average the two models.


In [None]:
XGBClassifier1 =xgb.XGBClassifier(objective='multi:softprob',n_estimators=400,max_depth=9,learning_rate=0.1,subsample=0.8
                                  ,colsample_bytree=0.8,colsample_bylevel=0.8,min_child_weight=0.8,gamma=1,seed=23);
XGBClassifier2 =xgb.XGBClassifier(objective='multi:softprob',n_estimators=400,max_depth=9,learning_rate=0.1,subsample=0.8
                                  ,colsample_bytree=0.8,colsample_bylevel=0.8,min_child_weight=0.8,gamma=1,seed=100);

classifier=ensemble.VotingClassifier([('clf1',XGBClassifier1),('clf2',XGBClassifier2)],voting='soft',weights=[1,1]);


these model result in cross validation error = 0.459256230407 which is better than the previous one.  
let's train these model on the full training data and make a submission on kaggle.

In [None]:
classifier.fit(trainFeatures,correctClass);          
testingPreds=classifier.predict_proba(testFeatures);
submission = pd.DataFrame(data=testingPreds,columns=classes)
submission.insert(0,'id',testIDs)
submission.to_csv("XGBoost.csv", index=False)

#### Kaggle score = 0.43902 and rank 478.