In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

import evaluate
import merge
import load

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# Load evaluation data
test_columns = ['returnQuantity', 'articleID', 'productGroup', 'customerID', 'voucherID']
test_predictions = merge.merged_predictions(test=True, keep_columns=test_columns)
test_train = evaluate.test_complement(test_predictions)

# Load classification data
class_columns = ['articleID', 'productGroup', 'customerID', 'voucherID',]
class_predictions = merge.merged_predictions(keep_columns=class_columns)
class_train = load.orders_train()

In [26]:
# Impute zeroes and convert confidenes to std-distances
class_imputed = merge.impute_confidence(class_predictions)
test_imputed = merge.impute_confidence(test_predictions)

## Approach 4: Boost results using another classifier
Due to lack of comparability between confidence values and the probably problematic imputation of confidence values it might be interesting to follow the approach of Boosting. This means we look at the results the respective classifiers gave and try to vote always for the best in a given case. Imposing this as a machine learning problem we have as feature vector
$\boldsymbol{t}_k=(pred_k^A, pred_k^B, prediction_k^C, conf_k^A, conf_k^B, conf_k^C, art_k, cust_k, voucher_k, prod_k)$ while the last four arguments are binary and tell us if the respective category is known before evaluation. Further as target we have again class labels which refer binarily to *returned* or *not returned*. Thus, we have $y=returned, y \in \{0,1\}$ as possible labels.

We already know from section **Highest possible precision** that in around 21% of the rows we find disagreement. All other rows are not interesting for this problem and we won't touch them or learn anything from them.

### Baseline - Only using the most precise classifier
Is there one result which we can always use to resolve disagreement? We have to look at different performances when classifiers disagree.

In [10]:
agree_mask = ((test_predictions.prediction.A == test_predictions.prediction.B) & 
              (test_predictions.prediction.A == test_predictions.prediction.C))
baseline = merge.precision(test_predictions.original.returnQuantity[agree_mask],
                           test_predictions.prediction.B[agree_mask])
baseline_weight = len(preds[agree_mask])
baseline

0.69993014003494847

If classifiers agree, the precision is near 70% which is extremely good in comparison to single results. This shows that the 21% disagreement make a huge difference in classification error.

In [11]:
disagree_mask = ((test_predictions.prediction.A != test_predictions.prediction.B) | 
                 (test_predictions.prediction.A != test_predictions.prediction.C))
y_labels = test_predictions.original.returnQuantity[disagree_mask]

#### A, B, C on disagreement (baseline for boosting on disagreed rows)

In [12]:
a = merge.precision(y_labels, test_predictions.prediction.A[disagree_mask])
b = merge.precision(y_labels, test_predictions.prediction.B[disagree_mask])
c = merge.precision(y_labels, test_predictions.prediction.C[disagree_mask])
a, b, c

(0.45814186362326992, 0.54310640457606896, 0.55259187116774799)

### Optimizing accuracy using DecisionTree and SVM with k-fold parameter optimization and k-fold cross validation for final estimation

In [6]:
from sklearn import preprocessing
from sklearn import svm
from sklearn import grid_search
from sklearn import cross_validation
from sklearn import pipeline
from sklearn import tree
from sklearn import metrics
from sklearn import ensemble
from operator import itemgetter

In [10]:
categories = ['articleID', 'productGroup', 'customerID', 'voucherID']
X, y = merge.boosting_features(test_train, test_imputed, categories)

### SVM

In [12]:
steps = [('scaling', preprocessing.StandardScaler()),
         ('svm', svm.SVC())]
clf = pipeline.Pipeline(steps)
svc_params = [
    dict(svm__C=[0.5, 1.0, 5.0], svm__kernel=['poly'], svm__gamma=[0.1, 0.01, 0.5], svm__degree=[1, 2, 3, 4]),
    dict(svm__C=[0.5, 1.0, 5.0], svm__kernel=['rbf'], svm__gamma=[0.1, 0.01, 0.5])]
gs1 = grid_search.RandomizedSearchCV(clf, svc_params[0], n_jobs=-1, n_iter=36)
gs1.fit(X[:2500], y[:2500])
x1 = gs1.score(X, y)
gs2 = grid_search.RandomizedSearchCV(clf, svc_params[1], n_jobs=-1, n_iter=9)
gs2.fit(X[:2500], y[:2500])
x2 = gs2.score(X, y)
x1, x2

(0.55259187116774799, 0.55260558840070784)

In [None]:
sorted(gs1.grid_scores_, key=itemgetter(1), reverse=True)[:10]

In [None]:
sorted(gs2.grid_scores_, key=itemgetter(1), reverse=True)[:10]

Outstanding parameters with good mean and low variance per fold were the ones defined in the following pipeline

In [13]:
clf = pipeline.Pipeline(steps=[
        ('scaling', preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)), 
        ('svm', svm.SVC(C=5.0, cache_size=200, class_weight=None, coef0=0.0,
                    decision_function_shape=None, degree=3, gamma=0.5, kernel='poly',
                    max_iter=-1, probability=False, random_state=None, shrinking=True,
                    tol=0.001, verbose=False))])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.07)
clf.fit(X_train, y_train)
merge.precision(clf.predict(X_test), y_test)

0.56017876633529018

### Decision Tree

In [14]:
clfT = tree.DecisionTreeClassifier()
tree_params = dict(criterion=['gini', 'entropy'], max_features=[2, 4, 8, 10], 
                   max_depth=[2, 4, 8, 100], min_samples_split=[2, 4, 8, 30, 100],
                   min_samples_leaf=[1, 4, 32, 100])
gsT = grid_search.RandomizedSearchCV(clfT, tree_params, n_iter=200)
gsT.fit(X[:40000], y[:40000])
gsT.score(X, y)

0.55975226677274659

In [None]:
sorted(gsT.grid_scores_, key=itemgetter(1), reverse=True)[:10]

optimized parameters against overfitting:
```
'criterion': 'gini', 'max_depth': 4, 'max_features': 8
```

In [15]:
clfTg = tree.DecisionTreeClassifier(criterion='gini', max_depth=4, max_features=8, min_samples_leaf=100)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.7)
clfTg.fit(X_train, y_train)
merge.precision(clfTg.predict(X_test), y_test)

0.56013808554902722

#### In the following the logical proof that we are not overfitting

In [16]:
clfT.fit(X, y)
merge.precision(clfTg.predict(X), y)

0.56200875159462838

### Random Forest

In [None]:
clfF = ensemble.RandomForestClassifier(n_jobs=-1)
tree_params = dict(criterion=['gini', 'entropy'], max_features=[2, 4, 'auto'], 
                   max_depth=[2, 4, 8, 100], min_samples_split=[2, 4, 8, 30, 100],
                   min_samples_leaf=[1, 4, 32, 100])
gsF = grid_search.RandomizedSearchCV(clfF, tree_params, n_iter=175)
gsF.fit(X[:40000], y[:40000])
gsF.score(X[40000:], y[:40000:])

In [None]:
sorted(gsF.grid_scores_, key=itemgetter(1), reverse=True)[:10]

optimized paramters for Random Forest using k-fold:
```
'min_samples_leaf': 32, 'min_samples_split': 2, 'criterion': 'gini', 'max_depth': 8, 'max_features': 4
```

In [119]:
clfF = ensemble.RandomForestClassifier(n_jobs=-1, min_samples_leaf=32, min_samples_split=2, criterion='gini',
                                       max_depth=8, max_features=4)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.5)
clfF.fit(X_train, y_train)
merge.precision(clfF.predict(X_test), y_test)

0.56722130011933991

#### Furthermore no overfitting

In [19]:
clfF.fit(X, y)
merge.precision(clfF.predict(X), y)

0.5800263370872828

## Best performer: Random Forest
Random forest should now be trained on test_train, evaluated on test_predictions. After evaluating and another kross-validation on test_train the parameters can be used to predict final binary labels on class_predictions after learning on class_train.

In [15]:
clfF = ensemble.RandomForestClassifier(n_jobs=-1)
tree_params = dict(criterion=['gini'], max_features=[4, 'auto'], 
                   max_depth=[4, 8, 100], min_samples_split=[2, 8, 30, 100],
                   min_samples_leaf=[4, 32])
gsF = grid_search.RandomizedSearchCV(clfF, tree_params, n_iter=48, n_jobs=-1)
gsF.fit(X[75000:], y[75000:])
gsF.score(X[:75000], y[:75000]), merge.precision(gsF.predict(X[:75000]), y[:75000])

(0.56942666666666664, 0.56942666666666664)

In [16]:
sorted(gsF.grid_scores_, key=itemgetter(1), reverse=True)[:10]

[mean: 0.56590, std: 0.00064, params: {'max_depth': 8, 'criterion': 'gini', 'min_samples_split': 30, 'min_samples_leaf': 32, 'max_features': 4},
 mean: 0.56548, std: 0.00205, params: {'max_depth': 8, 'criterion': 'gini', 'min_samples_split': 100, 'min_samples_leaf': 4, 'max_features': 'auto'},
 mean: 0.56531, std: 0.00163, params: {'max_depth': 8, 'criterion': 'gini', 'min_samples_split': 8, 'min_samples_leaf': 32, 'max_features': 'auto'},
 mean: 0.56531, std: 0.00163, params: {'max_depth': 8, 'criterion': 'gini', 'min_samples_split': 30, 'min_samples_leaf': 32, 'max_features': 'auto'},
 mean: 0.56480, std: 0.00117, params: {'max_depth': 8, 'criterion': 'gini', 'min_samples_split': 100, 'min_samples_leaf': 32, 'max_features': 4},
 mean: 0.56470, std: 0.00372, params: {'max_depth': 4, 'criterion': 'gini', 'min_samples_split': 8, 'min_samples_leaf': 32, 'max_features': 4},
 mean: 0.56449, std: 0.00157, params: {'max_depth': 8, 'criterion': 'gini', 'min_samples_split': 2, 'min_samples_lea

```
[mean: 0.56837, std: 0.00299, params: {'min_samples_leaf': 32, 'min_samples_split': 30, 'criterion': 'gini', 'max_depth': 8, 'max_features': 'auto'},
 mean: 0.56736, std: 0.00163, params: {'min_samples_leaf': 32, 'min_samples_split': 8, 'criterion': 'gini', 'max_depth': 8, 'max_features': 'auto'},
 mean: 0.56722, std: 0.00162, params: {'min_samples_leaf': 32, 'min_samples_split': 2, 'criterion': 'gini', 'max_depth': 8, 'max_features': 'auto'},
 mean: 0.56709, std: 0.00244, params: {'min_samples_leaf': 4, 'min_samples_split': 30, 'criterion': 'gini', 'max_depth': 8, 'max_features': 'auto'},
 mean: 0.56699, std: 0.00237, params: {'min_samples_leaf': 4, 'min_samples_split': 2, 'criterion': 'gini', 'max_depth': 8, 'max_features': 4},
 mean: 0.56699, std: 0.00237, params: {'min_samples_leaf': 4, 'min_samples_split': 8, 'criterion': 'gini', 'max_depth': 8, 'max_features': 4},
 mean: 0.56696, std: 0.00400, params: {'min_samples_leaf': 4, 'min_samples_split': 100, 'criterion': 'gini', 'max_depth': 8, 'max_features': 'auto'},
 mean: 0.56688, std: 0.00190, params: {'min_samples_leaf': 32, 'min_samples_split': 100, 'criterion': 'gini', 'max_depth': 8, 'max_features': 4},
 mean: 0.56671, std: 0.00460, params: {'min_samples_leaf': 4, 'min_samples_split': 100, 'criterion': 'gini', 'max_depth': 8, 'max_features': 4},
 mean: 0.56665, std: 0.00322, params: {'min_samples_leaf': 32, 'min_samples_split': 100, 'criterion': 'gini', 'max_depth': 8, 'max_features': 'auto'}]
 ```

The following configuration has a low variance high mean

In [11]:
clfF = ensemble.RandomForestClassifier(n_jobs=-1, min_samples_leaf=32, min_samples_split=2, criterion='gini',
                                       max_depth=8, max_features=4)

In [12]:
clfF.fit(X[:70000], y[:70000])
merge.precision(clfF.predict(X[70000:]), y[70000:])

0.56970792327379227

Overall precision is calculated by the following rule $prec = \frac{prec_{agree} size_{agree}}{size_test} + \frac{prec_{disagree} size_{disagree}}{size_test}$

In [19]:
baseline * baseline_weight / len(test_predictions) + c * len(y_labels) / len(test_predictions)

0.66858996896942602

Assuming we can be as precise when predicting on the class set we use as our precision for disagreements

In [21]:
prec_merge = merge.precision(clfF.predict(X[70000:]), y[70000:])
baseline * baseline_weight / len(test_predictions) + prec_merge * len(y_labels) / len(test_predictions)

0.67156004576754724

And would obtain a result which is around $0.3%$ better.

# Merge predictions after training a classifier (estimated precision 0.672)

In [18]:
categories = ['articleID', 'productGroup', 'customerID', 'voucherID']
X_train, y_train = merge.boosting_features(test_train, test_imputed, categories)
X_class, class_dis, class_agr = merge.class_features(class_train, class_imputed, categories)

In [19]:
clfF.fit(X_train, y_train)
y_merged = clfF.predict(X_class)
class_dis['merged_prediction'] = y_merged

In [20]:
class_agr['merged_prediction'] = class_imputed.prediction.A

In [21]:
class_unified = pd.concat([class_dis, class_agr])

Class_unified now contains the final prediction by learning to select predictions from classifiers on the test set and running it on the class data. The classifier is prevented from overfitting as shown above and is evaluated using CV.

**FINAL**

In [25]:
class_unified.merged_prediction

orderID   articleID  colorCode  sizeCode
a1744179  i1001147   1001       42          0
          i1001151   3082       42          0
          i1001461   2493       42          0
          i1001480   1001       42          0
          i1003229   2112       42          0
a1744181  i1003656   7178       29          0
a1744184  i1003520   7126       33          0
          i1003863   3001       38          0
                     3086       42          1
a1744187  i1003276   1111       36          1
                     1493       36          1
a1744195  i1003920   1092       36          1
a1744196  i1003248   1001       44          0
          i1003270   1082       44          0
a1744199  i1003190   1493       44          1
          i1003211   1493       42          1
a1744203  i1001165   1001       44          0
a1744204  i1001443   1001       38          0
a1744210  i1001281   1055       40          0
          i1003857   1078       40          1
          i1003870   1001       40     

## Test the same approach for the test set (assurance)

In [13]:
categories = ['articleID', 'productGroup', 'customerID', 'voucherID']
X_traint, y_traint = merge.boosting_features(test_train, test_imputed, categories)
X_testt, class_dist, class_agrt = merge.class_features(test_train, test_imputed, categories)

In [14]:
clfF.fit(X_traint, y_traint)
y_mergedt = clfF.predict(X_testt)
class_dist['merged_prediction'] = y_mergedt

In [15]:
class_agrt['merged_prediction'] = test_imputed.prediction.A

In [16]:
class_unifiedt = pd.concat([class_dist, class_agrt])

In [17]:
class_unifiedt['prediction_true'] = test_imputed.original.returnQuantity.astype(int)
merge.precision(class_unifiedt.merged_prediction, class_unifiedt.prediction_true)

0.67416730249922319

Shows that the approach works. Indeed the forest is overfitting here because of evaluation on training data but the previous rows are just to make sure everything works as expected.