### 0. Helpers

> Helper functions used in upcoming sections

In [1]:
import os 
import re # RegEx

# Get Separator line to match whole statement length. F.ex: text -> ----
getSeparator = lambda text: len(text)*"-" # table Top/Bottom separators

"""
Convert's a string list into a regular list:
F.ex. [0, 1, 2, 3] from type **String** -> to type **Int**

1. In the for loop: Convert the 'String' list into a regular list by:
    * Remove the Square brackets
    * Split the string into a list
2. For each element in the created list convert it from string to int

"""
getIntListFromStringList = lambda stringList: [int(listElement) for listElement in re.sub("[\[\]]", "", stringList).split(',')]

<hr>

## TRAINING 🏋️‍♀️

<hr>

### 1. Laden Sie die Trainingsdaten.

In [2]:
import pandas as pd
pd.options.mode.chained_assignment = None

train_df = pd.read_csv('sample_data/train.csv')

train_df.head(10)

Unnamed: 0,transactionId,basket,customerType,totalAmount,returnLabel
0,7934161612,[3],existing,77.0,0
1,5308629088,"[5, 3, 0, 3]",existing,64.0,0
2,1951363325,"[3, 3, 1, 4]",new,308.0,1
3,6713597713,[2],existing,74.0,0
4,8352683669,"[4, 4, 4, 4]",new,324.0,1
5,7416526871,"[4, 3, 4, 4]",new,256.0,1
6,5297184153,"[3, 4]",existing,110.0,1
7,5634650486,"[4, 4, 1, 4, 4]",new,100.0,1
8,5788331730,"[1, 5, 2, 2]",existing,240.0,0
9,7800645047,"[5, 5, 0]",existing,138.0,0


### 2. Füllen Sie die fehlenden Werte in den Trainingsdaten auf

#### 2.1 Analysis of missing data

> Analyze which data is missing

In [3]:
# Analysis of missing data

heading = "Total number of missing training values:"

print(getSeparator(heading))
print(heading)
print(train_df.isnull().sum()) # Check which values are null
print(getSeparator(heading))

----------------------------------------
Total number of missing training values:
transactionId      0
basket             0
customerType     517
totalAmount      484
returnLabel        0
dtype: int64
----------------------------------------


The missing `customerType` data is of categorical value(`existing` or `new`).  

In the following, let's check whether it's worthwhile to replace missing values:

In [4]:
# Calculate the percentage of the cases where the customer type is null
# Use this value to decide, whether this values can be "safely" removed

customerDataSum = train_df["customerType"].count()
missingCustomerTypeData = train_df["customerType"].isnull().sum()

# Calaculated values
customerType_isNull_percentage = round(missingCustomerTypeData * 100 / customerDataSum, 2)
missing_customerType_amount = round(customerDataSum*(customerType_isNull_percentage/100))

print(f'\nThe percentage of missing values in the {customerDataSum} big CustomerType dataset is only {str(customerType_isNull_percentage).replace(".", ",")} % = {str(missing_customerType_amount)} missing values')
print('==> Removing missing values should not have a big impact on the resulting model!')


The percentage of missing values in the 24483 big CustomerType dataset is only 2,11 % = 517.0 missing values
==> Removing missing values should not have a big impact on the resulting model!


#### 2.2 Actual  filling/removing of data

1. Remove missing `customerType` values
2. Fill na's with mean for `totalAmount` feature

In [5]:
# Drop customerType NaN' s -> See decision why above
train_cleaned = train_df[train_df['customerType'].notna()]

# Total amount: Fill with mean's
totalAmount_mean = train_cleaned['totalAmount'].mean()
train_cleaned['totalAmount'].fillna(totalAmount_mean, inplace=True)

print(f'Total number of missing training values after cleaning up: {train_cleaned.isnull().sum().sum()}')

Total number of missing training values after cleaning up: 0


### 3. Transformieren Sie die kategorischen Features mittles One-hot-encoding

1. Find categorical features
2. One-Hot Encode categorical features found in step 1

#### 3.1 Get a list/Find out categorical columns

In [6]:
columns = train_cleaned.columns

# Columns with numerical data
num_cols = train_cleaned._get_numeric_data().columns

# Now Substract all columns from the numerical ones
categorical_columns = list(set(columns) - set(num_cols))

print("Categorical attributes found: ", *categorical_columns, sep="\n* ")

Categorical attributes found: 
* customerType
* basket


#### 3.2 Hot encode customer type

> Use pandas build in function

In [7]:
one_hot_customerType = pd.get_dummies(train_cleaned['customerType'])

In [8]:
print('One hot encoding for the customer type:\n\n')

one_hot_customerType.head()

One hot encoding for the customer type:




Unnamed: 0,existing,new
0,1,0
1,1,0
2,0,1
3,1,0
4,0,1


#### 3.3 Hot encode the basket values

> We'll do this manually

<hr>
Get the max/min values in the basket feature lists
<hr>

In [9]:
# For each basket, get the min/max
# Web get them as strings, therefore we use our helper function to convert them to int lists
min_basket_value = min([min(getIntListFromStringList(list)) for list in train_cleaned['basket']]) 
max_basket_value = max([max(getIntListFromStringList(list)) for list in train_cleaned['basket']]) 

print(f'The minimal value basket element is {min_basket_value} and the max basket value is {max_basket_value}')

The minimal value basket element is 0 and the max basket value is 5


<hr>
Create new features based on the elements in the basket:
<hr>

In [10]:
# List with basket elements
basketElements = list(range(min_basket_value, max_basket_value+1))

# Data frame with columns: 'b_0 | b_1 | ...' for each of our basket elements
one_hot_basket = pd.DataFrame([], columns=[f'b_{basketElement}' for basketElement in basketElements])

'''
Do the one hot encoding for the basket feature.
Actually just:
    1. Check for the current basket, whether the element is present
    2. If present, set the encoding bit
'''
for basketElement in basketElements:
      one_hot_basket[f'b_{basketElement}'] = train_cleaned['basket'].apply(lambda x: x.count(str(basketElement)))

one_hot_basket.head()

Unnamed: 0,b_0,b_1,b_2,b_3,b_4,b_5
0,0,0,0,1,0,0
1,1,0,0,2,0,1
2,0,1,0,2,1,0
3,0,0,1,0,0,0
4,0,0,0,0,4,0


#### 3.4 Concatenate it back into the original dataframe

> Use pandas build in function

1. Concatenate
2. Clean up no more needed features

In [11]:
train_encoded = pd.concat([train_cleaned, one_hot_customerType, one_hot_basket], axis=1)

# We'll keep basket, as it'll probably be helpful for the feature engineering, 
# We'll only convert it to an int list(from string) for better handling
train_encoded['basket'] = [getIntListFromStringList(basket) for basket in train_cleaned.basket]

# Clean up: Drop customerType as it is no longer needed
train_encoded = train_encoded.drop(columns=['customerType'])

### 4. Datenattribute eliminieren

> Datenattribute die nicht mit Zielvariable korrelieren entfernen

Zu entfernen:
    
1. `transactionId`    

In [12]:
train_encoded = train_encoded.drop(columns=['transactionId'])
train_encoded.head(10)

Unnamed: 0,basket,totalAmount,returnLabel,existing,new,b_0,b_1,b_2,b_3,b_4,b_5
0,[3],77.0,0,1,0,0,0,0,1,0,0
1,"[5, 3, 0, 3]",64.0,0,1,0,1,0,0,2,0,1
2,"[3, 3, 1, 4]",308.0,1,0,1,0,1,0,2,1,0
3,[2],74.0,0,1,0,0,0,1,0,0,0
4,"[4, 4, 4, 4]",324.0,1,0,1,0,0,0,0,4,0
5,"[4, 3, 4, 4]",256.0,1,0,1,0,0,0,1,3,0
6,"[3, 4]",110.0,1,1,0,0,0,0,1,1,0
7,"[4, 4, 1, 4, 4]",100.0,1,0,1,0,1,0,0,4,0
8,"[1, 5, 2, 2]",240.0,0,1,0,0,1,2,0,0,1
9,"[5, 5, 0]",138.0,0,1,0,1,0,0,0,0,2


### 5. Versuchen Sie auf Basis des Attributs basket Features zu bauen (z.B. wie oft kommt jede Kategorie im Basket vor).

> Gesucht: Features zur Hilfe der Vorhersage ob Kunde Bestellung zurückschickt

Von one hot encoding:

- [x] Wie oft kommt jede Kategorie im Basket vor
- [x] Neukunde oder bestehender

Weitere Features:

- [x] Am meisten vorkommende Kategorie im aktuellen Basket
- [x] Anzahl der Bücher von der am höchsten vorkommenden Kategorie
- [x] Bestellungen, die überdurschnittlich viele Bücher im Warenkorb haben
- [x] Wenn alle Kategorien im Basket gleich sind
- [x] Korrelation zwischen anzahl gekaufter Bücher und dem Gesamtpreis den man dafür bezahlt hat

Schlussfolgerung: Die "Weiteren Features" haben nur minimalen Einfluss auf die Precision.

#### 5.1 Am Meisten vorkommende Kategorie im aktuellen Basket

In [13]:
f_eng_basket = list(train_encoded.basket) # basket shortcut

def most_frequent(List): 
    counter = 0
    num = List[0] 
    
    for i in List: 
        curr_frequency = List.count(i) 
        if(curr_frequency> counter): 
            counter = curr_frequency 
            num = i 

    return num 

# 1. Höchste Kategorie aktuellem Basket finden
train_encoded["highestCategory"] = train_encoded.basket.map(lambda x: max(x))
# 2. Am meisten vorkommende Kategorie aktuellem Basket finden
train_encoded["maxCategory"] = train_encoded.basket.map(lambda x: most_frequent(x))

#### 5.2 Anzahl der Bücher von der am höchsten vorkommenden Kategorie

In [14]:
f_eng_basket = list(train_encoded.basket) # basket shortcut

# 1. Am meisten vorkommende Kategorie
maxBasketValues =  list(train_encoded['maxCategory'])

# 2. Diese Zählen & als Neues Feature einfügen
train_encoded['maxCount'] = [basket.count(maxBasketValues[i]) for (i, basket) in enumerate(f_eng_basket) ]

train_encoded.head(15)

Unnamed: 0,basket,totalAmount,returnLabel,existing,new,b_0,b_1,b_2,b_3,b_4,b_5,highestCategory,maxCategory,maxCount
0,[3],77.0,0,1,0,0,0,0,1,0,0,3,3,1
1,"[5, 3, 0, 3]",64.0,0,1,0,1,0,0,2,0,1,5,3,2
2,"[3, 3, 1, 4]",308.0,1,0,1,0,1,0,2,1,0,4,3,2
3,[2],74.0,0,1,0,0,0,1,0,0,0,2,2,1
4,"[4, 4, 4, 4]",324.0,1,0,1,0,0,0,0,4,0,4,4,4
5,"[4, 3, 4, 4]",256.0,1,0,1,0,0,0,1,3,0,4,4,3
6,"[3, 4]",110.0,1,1,0,0,0,0,1,1,0,4,3,1
7,"[4, 4, 1, 4, 4]",100.0,1,0,1,0,1,0,0,4,0,4,4,4
8,"[1, 5, 2, 2]",240.0,0,1,0,0,1,2,0,0,1,5,2,2
9,"[5, 5, 0]",138.0,0,1,0,1,0,0,0,0,2,5,5,2


#### 5.3 Bestellungen, die überdurschnittlich viele Bücher im Warenkorb haben

> Feature erstellen, dass positiv ist wenn die Anzahl an Büchern im Warenkorb über dem Median liegt

1. Überdurschnittliche Anzahl Bücher: '1'
1. Unterdurschnittliche Anzahl Bücher: '0'

In [15]:
from statistics import median

# Calculate median from the number of items in each basket
train_encoded['basketElementCount'] = [len(basket) for basket in train_encoded['basket']]
train_encoded['basketSizeMedian'] = median(list(train_encoded['basketElementCount']))

train_encoded[['returnLabel', 'basketElementCount', 'basketSizeMedian']].head(n=20)

aboveMedian = []

for index, row in train_encoded.iterrows():
    if (row['basketElementCount'] > row['basketSizeMedian']): 
        valueToAppend = 1
    elif (row['basketElementCount'] == row['basketSizeMedian']): 
        valueToAppend = 0
    else:
        valueToAppend = -1
        
    aboveMedian.append(valueToAppend)
         
train_encoded['aboveMedian'] = aboveMedian

train_encoded['aboveMedian'].value_counts()

 1    12189
-1     9127
 0     3167
Name: aboveMedian, dtype: int64

#### 5.4 Wenn alle Kategorien im Basket gleich sind

In [16]:
def all_same(items):
    return all(x == items[0] for x in items)

train_encoded['same'] = train_encoded.basket.map(lambda x: 1 if all_same(x)  else 0)

#### 5.5 Korrelation zwischen Anzahl gekaufter Bücher und dem Gesamtpreis den man dafür bezahlt hat

In [17]:
f_eng_basket = list(train_encoded.basket) # basket shortcut
train_encoded['averagePrice'] = [round(train_encoded.totalAmount.iloc[i]/len(basket), 2) for (i, basket) in enumerate(f_eng_basket) ]
train_encoded.head()

Unnamed: 0,basket,totalAmount,returnLabel,existing,new,b_0,b_1,b_2,b_3,b_4,b_5,highestCategory,maxCategory,maxCount,basketElementCount,basketSizeMedian,aboveMedian,same,averagePrice
0,[3],77.0,0,1,0,0,0,0,1,0,0,3,3,1,1,4,-1,1,77.0
1,"[5, 3, 0, 3]",64.0,0,1,0,1,0,0,2,0,1,5,3,2,4,4,0,0,16.0
2,"[3, 3, 1, 4]",308.0,1,0,1,0,1,0,2,1,0,4,3,2,4,4,0,0,77.0
3,[2],74.0,0,1,0,0,0,1,0,0,0,2,2,1,1,4,-1,1,74.0
4,"[4, 4, 4, 4]",324.0,1,0,1,0,0,0,0,4,0,4,4,4,4,4,0,1,81.0


### 6. Skalieren Sie die Features mit einem StandardScaler.

> Use sklearn scaler

1. Drop target feature
2. scale the x results

In [18]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Drop basket feature after using it in the feature eng. phase
train_encoded = train_encoded.drop(columns=['basket'])
# Drop our target
x = train_encoded.drop(columns=['returnLabel'])

# Our x/y values
X = scaler.fit_transform(x)
y = train_encoded['returnLabel'].values

### 7. Trainieren Sie die folgenden Klassifikationsmodelle und probieren Sie die angegebenen Hyperparameter mittels Cross-Validation aus:

For all models:

1. Use grid search to find best parameter values
2. Train model

In [19]:
# Helper func to find best params
from sklearn.model_selection import GridSearchCV

def findBestParams(model, params, X, y):
    '''
    Takes a model, possible parameters to test, X and y values.
    Determines the best possible values for the model
    and returns the best values.
    '''
     
    clf = GridSearchCV(estimator=model, param_grid=params, n_jobs=-1)
    clf = clf.fit(X, y)
    
    # Get Params to extract from the estimator
    paramKeys = list(params[0].keys()) # Get Param Names = Keys
    paramsToDetermine = [f'clf.best_estimator_.{param}' for param in paramKeys] # Build strings
    # evaluate str to actual estimator val(Could be done above...but here more organised)
    paramsToDetermine = [eval(param, {}, {"clf": clf}) for param in paramsToDetermine]
    
    print(f'Best {model} values: {paramsToDetermine}')
    
    return paramsToDetermine

#### 7.1 Logistische Regression: `C :[0.1,1,4,5,6,10,30,100]`und `penalty: ["l1", "l2"]`

In [20]:
from sklearn.linear_model import LogisticRegression

lr_C, lr_penalty= findBestParams(model=LogisticRegression(), params=[{'C': [0.1,1,4,5,6,10,30,100], 'penalty': ["l1", "l2"]}], X=X, y=y)
lr_model = LogisticRegression(max_iter=1000, random_state=0, penalty=lr_penalty ,C=lr_C)
lr_model.fit(X, y)

Best LogisticRegression() values: [1, 'l2']


LogisticRegression(C=1, max_iter=1000, random_state=0)

#### 7.2 Random Forest: `n_estimators: [60,80,100,120,140]` und `max_depth: [2, 3, 4, 5]

In [21]:
from sklearn.ensemble import RandomForestClassifier

rf_n_est, rf_max_depth = findBestParams(model=RandomForestClassifier(), params=[{'n_estimators': [60,80,100,120,140], 'max_depth': [2, 3, 4, 5]}], X=X, y=y)
rf_model = RandomForestClassifier(random_state=0, max_depth=rf_max_depth, n_estimators=rf_n_est)
rf_model.fit(X,y)

Best RandomForestClassifier() values: [60, 5]


RandomForestClassifier(max_depth=5, n_estimators=60, random_state=0)

#### 7.3 Gradient Boosting Tree: gleiche Hyperparameter wie bei Random Forest

In [22]:
from sklearn.ensemble import GradientBoostingClassifier

gb_n_est, gb_max_depth = findBestParams(model=GradientBoostingClassifier(), params=[{'n_estimators': [60,80,100,120,140], 'max_depth': [2, 3, 4, 5]}], X=X, y=y)
gb_model = GradientBoostingClassifier(random_state=0, max_depth=gb_max_depth, n_estimators=gb_n_est)
gb_model.fit(X,y)

Best GradientBoostingClassifier() values: [140, 4]


GradientBoostingClassifier(max_depth=4, n_estimators=140, random_state=0)

<hr>

## TESTING 🧪

<hr>

### 8. Laden der Testdaten

> Bei den Testdaten ist der Vorgang nahezu der gleiche. Zur unterscheidbarkeit, wurde jeweils ein `test_[...]` prefix an die variablen gesetzt.

In [23]:
test_df = pd.read_csv('sample_data/test.csv')

test_df.head()

Unnamed: 0,transactionId,basket,customerType,totalAmount,returnLabel
0,9605027322,"[4, 0, 3, 4, 1, 4, 3, 4]",new,80.0,1
1,8315649406,[4],existing,26.0,0
2,5151646801,"[1, 3, 5]",existing,147.0,0
3,8101967972,[3],existing,37.0,1
4,2887044104,"[0, 0, 2, 5, 2]",existing,375.0,0


### 9. Entfernen Sie alle Zeilen mit fehlenden Werten.

1, Analysis of missing data  
2. Remove missing data

In [24]:
# Analysis of missing data

test_heading = "Total number of missing test values:"

print(getSeparator(test_heading))
print(test_heading)
print(test_df.isnull().sum()) # Check which values are null
print(getSeparator(test_heading))

# Calculate the percentage of the cases where the customer type is null
# Use this value to decide, whether this values can be "safely" removed

test_customerDataSum = test_df["customerType"].count()
test_missingCustomerTypeData = test_df["customerType"].isnull().sum()

# Calaculated values
test_customerType_isNull_percentage = round(test_missingCustomerTypeData * 100 / test_customerDataSum, 2)
test_missing_customerType_amount = round(test_customerDataSum*(test_customerType_isNull_percentage/100))

print(f'\nThe percentage of missing test values in the {test_customerDataSum} big CustomerType dataset is only {str(test_customerType_isNull_percentage).replace(".", ",")} % = {str(test_missing_customerType_amount)} missing values')
print('==> Removing missing values should not have a big impact!')

------------------------------------
Total number of missing test values:
transactionId      0
basket             0
customerType     124
totalAmount      134
returnLabel        0
dtype: int64
------------------------------------

The percentage of missing test values in the 5876 big CustomerType dataset is only 2,11 % = 124.0 missing values
==> Removing missing values should not have a big impact!


In [25]:
# Drop customerType NaN' s -> See decision why above
test_cleaned = test_df[test_df['customerType'].notna()]

# Total amount: Fill with mean's
test_totalAmount_mean = test_cleaned['totalAmount'].mean()
test_cleaned['totalAmount'].fillna(test_totalAmount_mean, inplace=True)

print(f'Total number of missing test values after cleaning up: {test_cleaned.isnull().sum().sum()}')

Total number of missing test values after cleaning up: 0


### 10. Transformieren Sie die kategorischen Features mittles One-hot-encoding

1. Actually get categorical features
2. One-Hot Encode categorical features from steps 1

#### 10.1 Get a list/Find out categorical columns

In [26]:
test_columns = test_cleaned.columns

# Columns with numerical data
test_num_cols = test_cleaned._get_numeric_data().columns

# Now Substract all columns from the numerical ones
test_categorical_columns = list(set(test_columns) - set(test_num_cols))

print("Categorical test attributes found: ", *test_categorical_columns, sep="\n* ")

Categorical test attributes found: 
* customerType
* basket


#### 10.2 Hot encode customer type

> Use pandas build in function

In [27]:
test_one_hot_customerType = pd.get_dummies(test_cleaned['customerType'])

In [28]:
print('One hot encoding for the customer type:\n\n')

test_one_hot_customerType.head()

One hot encoding for the customer type:




Unnamed: 0,existing,new
0,0,1
1,1,0
2,1,0
3,1,0
4,1,0


#### 10.3 Hot encode the basket values

> We'll do this manually

<hr>
Get the max/min values in the basket feature lists
<hr>

In [29]:
# For each basket, get the min/max
# Web get them as strings, therefore we use our helper function to convert them to int lists
test_min_basket_value = min([min(getIntListFromStringList(list)) for list in test_cleaned['basket']]) 
test_max_basket_value = max([max(getIntListFromStringList(list)) for list in test_cleaned['basket']]) 

print(f'The minimal test value basket element is {test_min_basket_value} and the max basket value is {test_max_basket_value}')

The minimal test value basket element is 0 and the max basket value is 5


<hr>
Create new features based on the elements in the test basket:
<hr>

In [30]:
test_basketElements = list(range(test_min_basket_value, test_max_basket_value+1))

# Data frame with columns: 'b_0 | b_1 | ...' for each of our basket elements
test_one_hot_basket = pd.DataFrame([], columns=[f'b_{test_basketElement}' for test_basketElement in test_basketElements])

'''
Do the one hot encoding for the basket feature.
Actually just:
    1. Check for the current basket, whether the element is present
    2. If present, set the encoding bit
'''
for test_basketElement in test_basketElements:
      test_one_hot_basket[f'b_{test_basketElement}'] = test_cleaned['basket'].apply(lambda x: x.count(str(test_basketElement)))

test_one_hot_basket.head()

Unnamed: 0,b_0,b_1,b_2,b_3,b_4,b_5
0,1,1,0,2,4,0
1,0,0,0,0,1,0
2,0,1,0,1,0,1
3,0,0,0,1,0,0
4,2,0,2,0,0,1


#### 10.4 Concatenate it back into the original dataframe

> Use pandas build in function

1. Concatenate
2. Clean up no more needed features

In [31]:
test_encoded = pd.concat([test_cleaned, test_one_hot_customerType, test_one_hot_basket], axis=1)

# We'll keep basket, as it'll probably be helpful for the feature engineering, 
# We'll only convert it to an int list(from string) for better handling
test_encoded['basket'] = [getIntListFromStringList(basket) for basket in test_cleaned.basket]

# Clean up: Drop customerType as it is no longer needed
test_encoded = test_encoded.drop(columns=['customerType'])

### 11. Test Datenattribute eliminieren

> Test Datenattribute die nicht mit Zielvariable korrelieren entfernen

Zu entfernen:
    
1. `transactionId`    

In [32]:
test_encoded = test_encoded.drop(columns=['transactionId'])
test_encoded.head()

Unnamed: 0,basket,totalAmount,returnLabel,existing,new,b_0,b_1,b_2,b_3,b_4,b_5
0,"[4, 0, 3, 4, 1, 4, 3, 4]",80.0,1,0,1,1,1,0,2,4,0
1,[4],26.0,0,1,0,0,0,0,0,1,0
2,"[1, 3, 5]",147.0,0,1,0,0,1,0,1,0,1
3,[3],37.0,1,1,0,0,0,0,1,0,0
4,"[0, 0, 2, 5, 2]",375.0,0,1,0,2,0,2,0,0,1


### Alle features hinzufügen(Siehe training part)

In [33]:
f_eng_basket = list(test_encoded.basket) # basket shortcut

# 1. Höchste Kategorie aktuellem Basket finden
test_encoded["highestCategory"] = test_encoded.basket.map(lambda x: max(x))
# 2. Am meisten vorkommende Kategorie aktuellem Basket finden
test_encoded["maxCategory"] = test_encoded.basket.map(lambda x: most_frequent(x))


In [34]:
f_eng_basket = list(test_encoded.basket) # basket shortcut

# 1. Am meisten vorkommende Kategorie
maxBasketValues =  list(test_encoded['maxCategory'])

# 2. Diese Zählen & als Neues Feature einfügen
test_encoded['maxCount'] = [basket.count(maxBasketValues[i]) for (i, basket) in enumerate(f_eng_basket) ]

test_encoded.head()

Unnamed: 0,basket,totalAmount,returnLabel,existing,new,b_0,b_1,b_2,b_3,b_4,b_5,highestCategory,maxCategory,maxCount
0,"[4, 0, 3, 4, 1, 4, 3, 4]",80.0,1,0,1,1,1,0,2,4,0,4,4,4
1,[4],26.0,0,1,0,0,0,0,0,1,0,4,4,1
2,"[1, 3, 5]",147.0,0,1,0,0,1,0,1,0,1,5,1,1
3,[3],37.0,1,1,0,0,0,0,1,0,0,3,3,1
4,"[0, 0, 2, 5, 2]",375.0,0,1,0,2,0,2,0,0,1,5,0,2


In [35]:
from statistics import median

# Calculate median from the number of items in each basket
test_encoded['basketSizeMedian'] = median([len(basket) for basket in test_encoded['basket']])
test_encoded['basketElementCount'] = [len(basket) for basket in test_encoded['basket']]
test_encoded[['returnLabel', 'basketElementCount', 'basketSizeMedian']].head(n=20)

aboveMedian = []

for index, row in test_encoded.iterrows():
    if (row['basketElementCount'] > row['basketSizeMedian']): 
        valueToAppend = 1
    elif (row['basketElementCount'] == row['basketSizeMedian']): 
        valueToAppend = 0
    else:
        valueToAppend = -1
        
    aboveMedian.append(valueToAppend)
         
test_encoded['aboveMedian'] = aboveMedian

test_encoded['aboveMedian'].value_counts()

-1    2896
 1    2224
 0     756
Name: aboveMedian, dtype: int64

In [36]:
def all_same(items):
    return all(x == items[0] for x in items)

test_encoded['same'] = test_encoded.basket.map(lambda x: 1 if all_same(x)  else 0)

In [37]:
f_eng_basket = list(test_encoded.basket) # basket shortcut

test_encoded['averagePrice'] = [round(test_encoded.totalAmount.iloc[i]/len(basket), 2) for (i, basket) in enumerate(f_eng_basket) ]
test_encoded.head()

Unnamed: 0,basket,totalAmount,returnLabel,existing,new,b_0,b_1,b_2,b_3,b_4,b_5,highestCategory,maxCategory,maxCount,basketSizeMedian,basketElementCount,aboveMedian,same,averagePrice
0,"[4, 0, 3, 4, 1, 4, 3, 4]",80.0,1,0,1,1,1,0,2,4,0,4,4,4,5.0,8,1,0,10.0
1,[4],26.0,0,1,0,0,0,0,0,1,0,4,4,1,5.0,1,-1,1,26.0
2,"[1, 3, 5]",147.0,0,1,0,0,1,0,1,0,1,5,1,1,5.0,3,-1,0,49.0
3,[3],37.0,1,1,0,0,0,0,1,0,0,3,3,1,5.0,1,-1,1,37.0
4,"[0, 0, 2, 5, 2]",375.0,0,1,0,2,0,2,0,0,1,5,0,2,5.0,5,0,0,75.0


### 12. Skalieren Sie die Test Features mit einem StandardScaler.

> Use sklearn scaler

1. Drop target feature
2. scale the x results

In [38]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Backup for the very end
test_backup = test_encoded

# Drop basket feature after using it in the feature eng. phase
test_encoded = test_encoded.drop(columns=['basket'])
# Drop our target
x_test = test_encoded.drop(columns=['returnLabel'])

# Our x/y values
X_test = scaler.fit_transform(x_test)
y_test = test_encoded['returnLabel'].values

### 13. Machen Sie eine Vorhersage auf den Testdaten mit allen drei Modellen und den jeweils besten Hyperparametern aus der Cross Validation.

#### 13.1 Vorhersage auf Logistische Regression

In [39]:
lr_prediction = lr_model.predict(X_test)

#### 13.2 Vorhersage auf Random forest

In [40]:
rf_prediction = rf_model.predict(X_test)

#### 13.3 Vorhersage auf Gradient Boosting Tree

In [41]:
gb_prediction = gb_model.predict(X_test)

### 14 Berechnen Sie für jedes der drei Modell Accuracy, Precision und Recall.

In [42]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

#### 14.2 Accuracy, Precision & Recall für das logistische regression Model

In [43]:
lr_accuracy = accuracy_score(y_test, lr_prediction)
print(f'Logistic regression accuracy: {lr_accuracy}')

lr_precision = precision_score(y_test, lr_prediction)
print(f'Logistic regression precision: {lr_precision}')

lr_recall = recall_score(y_test, lr_prediction)
print(f'Logistic regression recall: {lr_recall}')

Logistic regression accuracy: 0.9063989108236896
Logistic regression precision: 0.8997308209959624
Logistic regression recall: 0.7692750287686997


#### 14.2 Accuracy, Precision & Recall für das Random Forest Model

In [44]:
rf_accuracy = accuracy_score(y_test, rf_prediction)
print(f'Random forest accuracy: {rf_accuracy}')

rf_precision = precision_score(y_test, rf_prediction)
print(f'Random forest precision: {rf_precision}')

rf_recall = recall_score(y_test, rf_prediction)
print(f'Random forest recall: {rf_recall}')

Random forest accuracy: 0.9014635806671205
Random forest precision: 0.9245421245421246
Random forest recall: 0.7261219792865362


#### 14.3 Accuracy, Precision & Recall für das Gradient Boost Model

In [45]:
gb_accuracy = accuracy_score(y_test, gb_prediction)
print(f'Gradient Boost accuracy: {gb_accuracy}')

gb_precision = precision_score(y_test, gb_prediction)
print(f'Gradient Boost precision: {gb_precision}')

gb_recall = recall_score(y_test, gb_prediction)
print(f'Gradient Boost recall: {gb_recall}')

Gradient Boost accuracy: 0.9106535057862492
Gradient Boost precision: 0.8977049180327868
Gradient Boost recall: 0.7876869965477561


#### 14.4 Accuracy summary

In [46]:
calcPerc = lambda acc: round(acc * 100, 2)
print(f'New accuracies:\n\n1. LR: {calcPerc(lr_accuracy)} %\n2. RF: {calcPerc(rf_accuracy)} %\n3. GB: {calcPerc(gb_accuracy)} %')

New accuracies:

1. LR: 90.64 %
2. RF: 90.15 %
3. GB: 91.07 %


<hr>

## ANALYSIS 🔍

<hr>

### 15 Untersuchen Sie wie viele Datenpunkte es in den Testdaten gibt, welche von allen drei Modellen falsch klassifiziert wurden:

#### 15.1 Bestimmen Sie **für jedes der drei Modelle** die **Indizes** der **Testdatenpunkte** auf welchen das **jeweilige Modell falsch klassifiziert hat**.

> Get the indices for of the wrong gessed data points: Process similar to prediction process

In [47]:
# Helper function to find wrong predicted test data pointes(See docstring)

def getWrongClassfiedPredictions(input, predictions, labels):
    """
    Takes in the input values to predict on,
    the predictions made by the model
    and the right labels to compare the predictions on.
    
    Note: We currently do not use the input, but could be useful
    for debugging purposes
    
    Returns a set with indices corresponding to the test datapoints,
    which the model guessed wrong
    """
    
    wrong_predictions = set()
    i = 0
    for input, prediction, label in zip(input, predictions, labels):
        if prediction != label:
            wrong_predictions.add(i)
        i += 1
    return wrong_predictions

#### 15.1.1 Falsch klassifizierte Testdatenpunkte des Logistische Regression Models

In [48]:
lr_wrong_predictions = getWrongClassfiedPredictions(X_test, lr_prediction, y_test)

#### 15.1.2 Falsch klassifizierte Testdatenpunkte des Random forest Models

In [49]:
rf_wrong_predictions = getWrongClassfiedPredictions(X_test, rf_prediction, y_test)

#### 15.1.1 Falsch klassifizierte Testdatenpunkte des Gradient boost Models

In [50]:
clf_wrong_predictions = getWrongClassfiedPredictions(X_test, gb_prediction, y_test)

#### 15.2 Nutzen Sie die set-Klasse in Python um die Anzahl an Datenpunkten zu bestimmen, welche von allen drei Modellen falsch klassifiziert wurden.

In [51]:
allWrongDataPoints = lr_wrong_predictions & rf_wrong_predictions & clf_wrong_predictions

# Use this as metric to improve model
print(f'{len(allWrongDataPoints)} values found wrong categorized in common by all models')

391 values found wrong categorized in common by all models


In [52]:
# Check out missing baskets
wrongBaskets = [test_backup['basket'].iloc[index] for index in allWrongDataPoints]
wrongBaskets[:10] # Just the first 10

[[4, 1, 1, 4, 3],
 [4],
 [3],
 [1, 4, 0, 4, 3, 3],
 [4, 2, 4, 4, 4, 0],
 [3, 2, 1, 0, 4],
 [2],
 [1],
 [1],
 [4]]