In [185]:
import merge

# Results

## Team Performance on Test Data

### Test Data
To be able to compare team performances we agreed on a common test set. We generated it by splitting the original training data into 70% training and 30% test rows. Together with the actual submission, each team handed in their prediction for test set. This chapter evaluates these predictions.

### Mean Accuracy
All teams showed similar accuracies on the test set, around **67%**.

In [119]:
mean_accuracies = merge.mean_accuracies()
mean_accuracies

b        0.666572
c_big    0.668581
c_new    0.667811
c_old    0.667811
dtype: float64

### Split performance
All teams realisied that their classifiers performance depends highly on the amount of historical knowledge, e.g., the number of products returned by a customer. This realisation led to an approach that would divide classification into *splits* of known and unknown values. The figure below shows the size of each split and the accuracy of each team.

In [147]:
splits = merge.evaluate()
splits

Unnamed: 0,articleID,productGroup,customerID,voucherID,b,c_big,c_new,c_old,size,best
0,False,False,False,False,,,,,0,
1,False,False,False,True,,,,,0,
2,False,False,True,False,1.0,1.0,1.0,1.0,1,b
3,False,False,True,True,,,,,0,
4,False,True,False,False,0.661214,0.666959,0.666002,0.666002,12533,c_big
5,False,True,False,True,0.653027,0.656597,0.656575,0.656575,138668,c_big
6,False,True,True,False,0.673185,0.670785,0.67103,0.67103,36666,b
7,False,True,True,True,0.667421,0.669908,0.66892,0.66892,209496,c_big
8,True,True,False,False,0.675924,0.678176,0.680616,0.680616,10658,c_new
9,True,True,False,True,0.661093,0.665166,0.665166,0.665166,106631,c_big


### Differences

The figure below shows how the percentage of differing predictions.

In [184]:
merge.differences()

Unnamed: 0,b,c_big,c_new,c_old
b,0.0,0.111,0.112,0.112
c_big,0.111,0.0,0.063,0.063
c_new,0.112,0.063,0.0,0.0
c_old,0.112,0.063,0.0,0.0


In [180]:
df1 = team_predictions['c_old'].drop('confidence', axis=1)
df2 = team_predictions['c_big'].drop('confidence', axis=1)
combined = df1.merge(df2, on=['orderID', 'articleID', 'colorCode', 'sizeCode'])
different = np.round((combined['prediction_x'] != combined['prediction_y']).sum() / len(combined), 3)
different

0.063

## Team Performance on Target Data

## Merge data

In [118]:
merged = merge.merge()
merged

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,team,b,b,c_big,c_big,c_new,c_new,c_old,c_old
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,result,confidence,prediction,confidence,prediction,confidence,prediction,confidence,prediction
orderID,articleID,colorCode,sizeCode,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
a1744178,i1002632,3097,I,0.701320,0,0.500000,0.0,0.570000,0.0,0.570000,0.0
a1744178,i1003278,1097,40,0.723813,0,0.570000,0.0,0.620000,0.0,0.620000,0.0
a1744178,i1003279,1114,40,0.714540,0,0.500000,0.0,0.610000,0.0,0.610000,0.0
a1744178,i1003279,1116,40,0.714540,0,0.620000,0.0,0.660000,0.0,0.660000,0.0
a1744178,i1003279,1117,40,0.714540,0,0.600000,0.0,0.600000,0.0,0.600000,0.0
a1744179,i1001147,1001,42,0.267293,1,0.595477,0.0,0.595477,0.0,0.595477,0.0
a1744179,i1001151,3082,42,0.336710,1,0.628530,0.0,0.628530,0.0,0.628530,0.0
a1744179,i1001160,1108,42,0.307641,1,0.580849,1.0,0.580849,1.0,0.580849,1.0
a1744179,i1001461,2493,42,0.315724,1,0.559468,0.0,0.559468,0.0,0.559468,0.0
a1744179,i1001480,1001,42,0.319979,1,0.523900,0.0,0.523900,0.0,0.523900,0.0


# Ensemble predictions

## Approach 1: Weighted Majority Vote

We use the following formula for the final $prediction_i$, with $i$ being the row index. In it, each $prediction_{c,i}$ of classifier $c$ is weighted.

$$
row_i = 
\dfrac
    {\sum_{c \in C} 
        prediction_{c,i} 
        \cdot weight_{c,i}} 
    {\sum_{c \in C} weight_{c,i}}
$$

The $weight_{c,i}$ takes into account the condfidence of $c$ in row $i$, the mean confidence of $c$ over all rows, the overall accuracy of $c$ and the mean accuracy of all classifiers.

$$
weight_{c,i} = \dfrac{confidence_{c,i}}{confidence_{c,\varnothing}} \cdot \bigg(\dfrac{accuracy_c}{accuracy_\varnothing}\bigg)^2
$$

The confidences returned by the classifiers are, as such, not comparable to each other. They differ in meaning and in range. To counteract and to avoid favoring classifiers that are generally -maybe mistakenly- confident, we take the confidence compared to the classifier's mean confidence.

The accuracy, we believe, is the best predictor of a classifier's performance in the target set. We take this information from a test set that each group used to evaluate their classifier. The test set is a 70/30 split of the training data. We compare the accuracy to the mean accuracy of all classifiers. As the difference tends to be big in information, but small in extent, the ratio is squared.

In [None]:
merge.majority_vote(merge.merge(team_results), accuracies)