### 0. Helpers

> Helper functions used in upcoming sections

In [64]:
import os 
import re # RegEx

# Tiny print helpers:
    
# Print the statement centered  ¯\_(ツ)_/¯ 
centeredPrint = lambda statement: print(statement.center(os.get_terminal_size().columns))
# Get Separator line to match whole statement length. F.ex: text -> ----
getSeparator = lambda text: len(text)*"-" # table Top/Bottom separators

"""
Convert's a string list into a regular list:
F.ex. [0, 1, 2, 3] from type **String** -> to type **Int**

Let's deconstruct this:

1. In the for loop: Convert the 'String' list into a regular list by:
    * Remove the Square brackets
    * Split the string into a list
2. For each element in the created list convert it from string to int

Könnnen es gerne vereinfachen, wenn es wärend der Präzi zu verwirrungen kommen könnte 😅

"""
getIntListFromStringList = lambda stringList: [int(listElement) for listElement in re.sub("[\[\]]", "", stringList).split(',')]

<hr>

## TRAINING 🏋️‍♀️

<hr>

### 1. Laden Sie die Trainingsdaten.

In [65]:
import pandas as pd
pd.options.mode.chained_assignment = None

# train_df = pd.read_csv('datasets/train.csv')
train_df = pd.read_csv('sample_data/train.csv')

train_df.head()

Unnamed: 0,transactionId,basket,customerType,totalAmount,returnLabel
0,7934161612,[3],existing,77.0,0
1,5308629088,"[5, 3, 0, 3]",existing,64.0,0
2,1951363325,"[3, 3, 1, 4]",new,308.0,1
3,6713597713,[2],existing,74.0,0
4,8352683669,"[4, 4, 4, 4]",new,324.0,1


### 2. Füllen Sie die fehlenden Werte in den Trainingsdaten auf

#### 2.1 Analysis of missing data

> Analyze which data is missing

In [66]:
# Analysis of missing data

heading = "Total number of missing training values:"

print(getSeparator(heading))
print(heading)
print(train_df.isnull().sum()) # Check which values are null
print(getSeparator(heading))

----------------------------------------
Total number of missing training values:
transactionId      0
basket             0
customerType     517
totalAmount      484
returnLabel        0
dtype: int64
----------------------------------------


The missing `customerType` data is of categorical value(`existing` or `new`).  

In the following, let's check whether it's worthwhile to replace missing values:

In [67]:
# Calculate the percentage of the cases where the customer type is null
# Use this value to decide, whether this values can be "safely" removed

customerDataSum = train_df["customerType"].count()
missingCustomerTypeData = train_df["customerType"].isnull().sum()

# Calaculated values
customerType_isNull_percentage = round(missingCustomerTypeData * 100 / customerDataSum, 2)
missing_customerType_amount = round(customerDataSum*(customerType_isNull_percentage/100))

print(f'\nThe percentage of missing values in the {customerDataSum} big CustomerType dataset is only {str(customerType_isNull_percentage).replace(".", ",")} % = {str(missing_customerType_amount)} missing values')
print('==> Removing missing values should not have a big impact on the resulting model!')


The percentage of missing values in the 24483 big CustomerType dataset is only 2,11 % = 517 missing values
==> Removing missing values should not have a big impact on the resulting model!


#### 2.2 Actual  filling/removing of data

1. Remove missing `customerType` values
2. Fill na's with mean for `totalAmount` feature

In [68]:
# TODO maybe as external function due to test code duplication?

# Drop customerType NaN' s -> See decision why above
train_cleaned = train_df[train_df['customerType'].notna()]

# Total amount: Fill with mean's
totalAmount_mean = train_cleaned['totalAmount'].mean()
train_cleaned['totalAmount'].fillna(totalAmount_mean, inplace=True)

print(f'Total number of missing training values after cleaning up: {train_cleaned.isnull().sum().sum()}')

Total number of missing training values after cleaning up: 0


### 3. Transformieren Sie die kategorischen Features mittles One-hot-encoding

1. Find categorical features
2. One-Hot Encode categorical features found in step 1

#### 3.1 Get a list/Find out categorical columns

> As seen in: https://stackoverflow.com/questions/29803093/check-which-columns-in-dataframe-are-categorical

In [69]:
columns = train_cleaned.columns

# Columns with numerical data
num_cols = train_cleaned._get_numeric_data().columns

# Now Substract all columns from the numerical ones
categorical_columns = list(set(columns) - set(num_cols))

print("Categorical attributes found: ", *categorical_columns, sep="\n* ")

Categorical attributes found: 
* basket
* customerType


#### 3.2 Hot encode customer type

> Use pandas build in function

In [70]:
one_hot_customerType = pd.get_dummies(train_cleaned['customerType'])

In [71]:
# TODO fix Inappropriate ioctl for device
# centeredPrint('One hot encoding for the customer type:\n\n')
print('One hot encoding for the customer type:\n\n')

one_hot_customerType.head()

One hot encoding for the customer type:




Unnamed: 0,existing,new
0,1,0
1,1,0
2,0,1
3,1,0
4,0,1


#### 3.3 Hot encode the basket values

> We'll do this manually

<hr>
Get the max/min values in the basket feature lists
<hr>

In [72]:
# For each basket, get the min/max
# Web get them as strings, therefore we use our helper function to convert them to int lists
min_basket_value = min([min(getIntListFromStringList(list)) for list in train_cleaned['basket']]) 
max_basket_value = max([max(getIntListFromStringList(list)) for list in train_cleaned['basket']]) 

print(f'The minimal value basket element is {min_basket_value} and the max basket value is {max_basket_value}')

The minimal value basket element is 0 and the max basket value is 5


<hr>
Create new features based on the elements in the basket:
<hr>

In [73]:
# NOTE: We assume, that we have all these values in the basket label array
#       We did NOT test this! Maybe test this to have more security...
# TODO check whether all values are actually in the list

# List with basket elements
basketElements = list(range(min_basket_value, max_basket_value+1))

# Data frame with columns: 'b_0 | b_1 | ...' for each of our basket elements
one_hot_basket = pd.DataFrame([], columns=[f'b_{basketElement}' for basketElement in basketElements])

'''
Do the one hot encoding for the basket feature.
Actually just:
    1. Check for the current basket, whether the element is present
    2. If present, set the encoding bit
'''
for basketElement in basketElements:
      one_hot_basket[f'b_{basketElement}'] = train_cleaned['basket'].apply(lambda x: x.count(str(basketElement)))

one_hot_basket.head()

Unnamed: 0,b_0,b_1,b_2,b_3,b_4,b_5
0,0,0,0,1,0,0
1,1,0,0,2,0,1
2,0,1,0,2,1,0
3,0,0,1,0,0,0
4,0,0,0,0,4,0


#### 3.4 Concatenate it back into the original dataframe

> Use pandas build in function

1. Concatenate
2. Clean up no more needed features

In [74]:
train_encoded = pd.concat([train_cleaned, one_hot_customerType, one_hot_basket], axis=1)

# We'll keep basket, as it'll probably be helpful for the feature engineering, 
# We'll only convert it to an int list(from string) for better handling
train_encoded['basket'] = [getIntListFromStringList(basket) for basket in train_cleaned.basket]

# Clean up: Drop customerType as it is no longer needed
train_encoded = train_encoded.drop(columns=['customerType'])

### 4. Datenattribute eliminieren

> Datenattribute die nicht mit Zielvariable korrelieren entfernen

Zu entfernen:
    
1. `transactionId`    

In [75]:
# TODO better naming to dataframes
train_encoded = train_encoded.drop(columns=['transactionId'])
train_encoded.head()

Unnamed: 0,basket,totalAmount,returnLabel,existing,new,b_0,b_1,b_2,b_3,b_4,b_5
0,[3],77.0,0,1,0,0,0,0,1,0,0
1,"[5, 3, 0, 3]",64.0,0,1,0,1,0,0,2,0,1
2,"[3, 3, 1, 4]",308.0,1,0,1,0,1,0,2,1,0
3,[2],74.0,0,1,0,0,0,1,0,0,0
4,"[4, 4, 4, 4]",324.0,1,0,1,0,0,0,0,4,0


### 5. Versuchen Sie auf Basis des Attributs basket Features zu bauen (z.B. wie oft kommt jede Kategorie im Basket vor).

> Gesucht: Features zur Hilfe der Vorhersage ob Kunde Bestellung zurückschickt

1. Wie oft kommt jede Kategorie im Basket vor(Default aus Aufgabenstellung)
2. Anzahl der bestellten Werte
3. Warenwert der Bestellung under Meadian
4. Neukunde oder bestehender
5. Wenn Kunde viele Bücher der selben Kategorie bestellt hat
6. Wenn viel oder wenig bestelllt wurde
7. TODO mehr ausdenken

Sources/Helpers:

1. Uni docuements:

* Theory:
    - General overview: https://github.com/daniel-vera-g/ml-kurs/blob/master/7_ML-Projekt_Demo.ipynb
    - Feature eng. overview: https://github.com/daniel-vera-g/ml-kurs/blob/master/3_Feature_Engineering.ipynb
    - Feature eng. PDF: file:///home/dvg/Desktop/relevant-ml/ML_Ue3_Feature_Engineering_Loesung.pdf
* Praxis:
    - Sex feature: https://github.com/daniel-vera-g/ml-kurs/blob/master/self/A2-titanic-logistic-regression.ipynb -> Refer solution
    - Embarked Feature: https://github.com/daniel-vera-g/ml-kurs/blob/master/3_Titanic_Dataset_Features.ipynb -> Refer Solution

2. Ideas:

> See: https://towardsdatascience.com/market-basket-analysis-with-pandas-246fb8ee10a5

* Plot to show relation between huge amount in basket and return value
* One common technique is association rule learning which is a machine learning method to discover relationships among variables. Apriori algorithm is a frequently used algorithm for association rule learning.
* One way is to create combinations of items in each row and count the occurrences of each combination. The itertools of python can be used to accomplish this task.
* Also: https://heartbeat.fritz.ai/a-practical-guide-to-feature-engineering-in-python-8326e40747c8

TODO use diff. Diagramms & Plots to visualize features

#### 5.0 Feature 0: Anzahl d. Elemente im Warenkorb

Unnamed: 0,basket,totalAmount,returnLabel,existing,new,b_0,b_1,b_2,b_3,b_4,b_5,basketElementCount
0,[3],77.0,0,1,0,0,0,0,1,0,0,1
1,"[5, 3, 0, 3]",64.0,0,1,0,1,0,0,2,0,1,4
2,"[3, 3, 1, 4]",308.0,1,0,1,0,1,0,2,1,0,4
3,[2],74.0,0,1,0,0,0,1,0,0,0,1
4,"[4, 4, 4, 4]",324.0,1,0,1,0,0,0,0,4,0,4


#### 5.1 Feature 1: Bestellungen, die überdurschnittlich viele Bücher im Warenkorb haben

> Feature erstellen, dass positiv ist wenn die Anzahl an Büchern im Warenkorb über dem Median liegt

1. Überdurschnittliche Anzahl Bücher: '1'
1. Unterdurschnittliche Anzahl Bücher: '0'

In [77]:
# TODO after & before test -> It does not really pay of ?

#from statistics import median
#
## TODO visualize, whether returned & above median
## Visualize relationship between number items in basket and return feature
#
## Calculate median from the number of items in each basket
#train_encoded['basketSizeMedian'] = median([len(basket) for basket in train_encoded['basket']])
#train_encoded[['returnLabel', 'basketElementCount', 'basketSizeMedian']].head(n=20)
#
#returnedAboveMedian = []
#
#for index, row in train_encoded.iterrows():
#     returnedAboveMedian.append(1 if (row['returnLabel'] == 0 and row['basketElementCount'] >= row['basketSizeMedian']) else 0)
#         
#train_encoded['returnedAboveMedian'] = returnedAboveMedian
#
#train_encoded['returnedAboveMedian'].value_counts()

### 6. Skalieren Sie die Features mit einem StandardScaler.

> Use sklearn scaler

1. Drop target feature
2. scale the x results

In [78]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Drop basket feature after using it in the feature eng. phase
train_encoded = train_encoded.drop(columns=['basket'])
# Drop our target
x = train_encoded.drop(columns=['returnLabel'])

# Our x/y values
X = scaler.fit_transform(x)
y = train_encoded['returnLabel'].values

### 7. Trainieren Sie die folgenden Klassifikationsmodelle und probieren Sie die angegebenen Hyperparameter mittels Cross-Validation aus:

TODO understand all copied from: https://github.com/daniel-vera-g/ml-kurs/blob/master/7_ML-Projekt_Demo.ipynb

For all models:

1. Use grid search to find best parameter values
2. Train model

In [79]:
# Helper func to find best params
from sklearn.model_selection import GridSearchCV

def findBestParams(model, params, X, y):
    '''
    Takes a model, possible parameters to test, X and y values.
    Determines the best possible values for the model
    and returns the best values.
    '''
     
    clf = GridSearchCV(estimator=model, param_grid=params, n_jobs=-1)
    clf = clf.fit(X, y)
    
    # Get Params to extract from the estimator
    paramKeys = list(params[0].keys()) # Get Param Names = Keys
    paramsToDetermine = [f'clf.best_estimator_.{param}' for param in paramKeys] # Build strings
    # evaluate str to actual estimator val(Yeeh, could be done above...but here more organised)
    paramsToDetermine = [eval(param, {}, {"clf": clf}) for param in paramsToDetermine]
    
    print(f'Best {model} values: {paramsToDetermine}')
    
    return paramsToDetermine

#### 7.1 Logistische Regression: `C :[0.1,1,4,5,6,10,30,100]`und `penalty: ["l1", "l2"]`

In [80]:
from sklearn.linear_model import LogisticRegression

lr_C, lr_penalty= findBestParams(model=LogisticRegression(), params=[{'C': [0.1,1,4,5,6,10,30,100], 'penalty': ["l1", "l2"]}], X=X, y=y)
lr_model = LogisticRegression(max_iter=1000, random_state=0, penalty=lr_penalty ,C=lr_C)
lr_model.fit(X, y)

Best LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False) values: [30, 'l2']


LogisticRegression(C=30, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

#### 7.2 Random Forest: `n_estimators: [60,80,100,120,140]` und `max_depth: [2, 3, 4, 5]

In [81]:
from sklearn.ensemble import RandomForestClassifier

rf_n_est, rf_max_depth = findBestParams(model=RandomForestClassifier(), params=[{'n_estimators': [60,80,100,120,140], 'max_depth': [2, 3, 4, 5]}], X=X, y=y)
rf_model = RandomForestClassifier(random_state=0, max_depth=rf_max_depth, n_estimators=rf_n_est)
rf_model.fit(X,y)

Best RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False) values: [140, 5]


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=140,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

#### 7.3 Gradient Boosting Tree: gleiche Hyperparameter wie bei Random Forest

In [82]:
from sklearn.ensemble import GradientBoostingClassifier

gb_n_est, gb_max_depth = findBestParams(model=GradientBoostingClassifier(), params=[{'n_estimators': [60,80,100,120,140], 'max_depth': [2, 3, 4, 5]}], X=X, y=y)
gb_model = GradientBoostingClassifier(random_state=0, max_depth=gb_max_depth, n_estimators=gb_n_est)
gb_model.fit(X,y)

Best GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False) values: [100, 4]


GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=4,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=0, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

<hr>

## TESTING 🧪

<hr>

### 8. Laden der Testdaten

> Bei den Testdaten ist der Vorgang nahezu der gleiche. Zur unterscheidbarkeit, wurde jeweils ein `test_[...]` prefix an die variablen gesetzt.

* TODO Export functions to only call them once & avoid code duplication between train & test procedure

In [83]:
# test_df = pd.read_csv('datasets/test.csv')
test_df = pd.read_csv('sample_data/test.csv')

test_df.head()

Unnamed: 0,transactionId,basket,customerType,totalAmount,returnLabel
0,9605027322,"[4, 0, 3, 4, 1, 4, 3, 4]",new,80.0,1
1,8315649406,[4],existing,26.0,0
2,5151646801,"[1, 3, 5]",existing,147.0,0
3,8101967972,[3],existing,37.0,1
4,2887044104,"[0, 0, 2, 5, 2]",existing,375.0,0


### 9. Entfernen Sie alle Zeilen mit fehlenden Werten.

1, Analysis of missing data  
2. Remove missing data

In [84]:
# Analysis of missing data

# TODO remove this duplicated code into own function(Reduce code duplication!)

test_heading = "Total number of missing test values:"

print(getSeparator(test_heading))
print(test_heading)
print(test_df.isnull().sum()) # Check which values are null
print(getSeparator(test_heading))

# Calculate the percentage of the cases where the customer type is null
# Use this value to decide, whether this values can be "safely" removed

test_customerDataSum = test_df["customerType"].count()
test_missingCustomerTypeData = test_df["customerType"].isnull().sum()

# Calaculated values
test_customerType_isNull_percentage = round(test_missingCustomerTypeData * 100 / test_customerDataSum, 2)
test_missing_customerType_amount = round(test_customerDataSum*(test_customerType_isNull_percentage/100))

print(f'\nThe percentage of missing test values in the {test_customerDataSum} big CustomerType dataset is only {str(test_customerType_isNull_percentage).replace(".", ",")} % = {str(test_missing_customerType_amount)} missing values')
print('==> Removing missing values should not have a big impact!')

------------------------------------
Total number of missing test values:
transactionId      0
basket             0
customerType     124
totalAmount      134
returnLabel        0
dtype: int64
------------------------------------

The percentage of missing test values in the 5876 big CustomerType dataset is only 2,11 % = 124 missing values
==> Removing missing values should not have a big impact!


In [85]:
# TODO maybe as external function due to test code duplication?
# TODO ask/check whether filling would not be more effective/better
# TODO if filling better, create function to reduce code duplication

# Drop customerType NaN' s -> See decision why above
test_cleaned = test_df[test_df['customerType'].notna()]

# Total amount: Fill with mean's
test_totalAmount_mean = test_cleaned['totalAmount'].mean()
test_cleaned['totalAmount'].fillna(test_totalAmount_mean, inplace=True)

print(f'Total number of missing test values after cleaning up: {test_cleaned.isnull().sum().sum()}')

Total number of missing test values after cleaning up: 0


### 10. Transformieren Sie die kategorischen Features mittles One-hot-encoding

1. Actually get categorical features
2. One-Hot Encode categorical features from steps 1

#### 10.1 Get a list/Find out categorical columns

> As seen in: https://stackoverflow.com/questions/29803093/check-which-columns-in-dataframe-are-categorical

In [86]:
test_columns = test_cleaned.columns

# Columns with numerical data
test_num_cols = test_cleaned._get_numeric_data().columns

# Now Substract all columns from the numerical ones
test_categorical_columns = list(set(test_columns) - set(test_num_cols))

print("Categorical test attributes found: ", *test_categorical_columns, sep="\n* ")

Categorical test attributes found: 
* basket
* customerType


#### 10.2 Hot encode customer type

> Use pandas build in function

In [87]:
test_one_hot_customerType = pd.get_dummies(test_cleaned['customerType'])

In [88]:
# TODO fix Inappropriate ioctl for device
# centeredPrint('One hot encoding for the customer type:\n\n')
print('One hot encoding for the customer type:\n\n')

test_one_hot_customerType.head()

One hot encoding for the customer type:




Unnamed: 0,existing,new
0,0,1
1,1,0
2,1,0
3,1,0
4,1,0


#### 10.3 Hot encode the basket values

> We'll do this manually

<hr>
Get the max/min values in the basket feature lists
<hr>

In [89]:
# For each basket, get the min/max
# Web get them as strings, therefore we use our helper function to convert them to int lists
test_min_basket_value = min([min(getIntListFromStringList(list)) for list in test_cleaned['basket']]) 
test_max_basket_value = max([max(getIntListFromStringList(list)) for list in test_cleaned['basket']]) 

print(f'The minimal test value basket element is {test_min_basket_value} and the max basket value is {test_max_basket_value}')

The minimal test value basket element is 0 and the max basket value is 5


<hr>
Create new features based on the elements in the test basket:
<hr>

In [90]:
# List with basket elements
# NOTE: We assume, that we have all these values in the basket label array
#       We did NOT test this! Maybe test this to have more security...
# TODO check whether all values are actually in the list
test_basketElements = list(range(test_min_basket_value, test_max_basket_value+1))

# Data frame with columns: 'b_0 | b_1 | ...' for each of our basket elements
test_one_hot_basket = pd.DataFrame([], columns=[f'b_{test_basketElement}' for test_basketElement in test_basketElements])

'''
Do the one hot encoding for the basket feature.
Actually just:
    1. Check for the current basket, whether the element is present
    2. If present, set the encoding bit
'''
for test_basketElement in test_basketElements:
      test_one_hot_basket[f'b_{test_basketElement}'] = test_cleaned['basket'].apply(lambda x: x.count(str(test_basketElement)))

test_one_hot_basket.head()

Unnamed: 0,b_0,b_1,b_2,b_3,b_4,b_5
0,1,1,0,2,4,0
1,0,0,0,0,1,0
2,0,1,0,1,0,1
3,0,0,0,1,0,0
4,2,0,2,0,0,1


#### 10.4 Concatenate it back into the original dataframe

> Use pandas build in function

1. Concatenate
2. Clean up no more needed features

In [91]:
test_encoded = pd.concat([test_cleaned, test_one_hot_customerType, test_one_hot_basket], axis=1)

# We'll keep basket, as it'll probably be helpful for the feature engineering, 
# We'll only convert it to an int list(from string) for better handling
test_encoded['basket'] = [getIntListFromStringList(basket) for basket in test_cleaned.basket]

# Clean up: Drop customerType as it is no longer needed
test_encoded = test_encoded.drop(columns=['customerType'])

### 11. Test Datenattribute eliminieren

> Test Datenattribute die nicht mit Zielvariable korrelieren entfernen

Zu entfernen:
    
1. `transactionId`    

### 11. Features hinzufügen

In [92]:
 test_encoded['basketElementCount'] = test_encoded.basket.map(lambda x: len(x))
 test_encoded.head()

Unnamed: 0,transactionId,basket,totalAmount,returnLabel,existing,new,b_0,b_1,b_2,b_3,b_4,b_5,basketElementCount
0,9605027322,"[4, 0, 3, 4, 1, 4, 3, 4]",80.0,1,0,1,1,1,0,2,4,0,8
1,8315649406,[4],26.0,0,1,0,0,0,0,0,1,0,1
2,5151646801,"[1, 3, 5]",147.0,0,1,0,0,1,0,1,0,1,3
3,8101967972,[3],37.0,1,1,0,0,0,0,1,0,0,1
4,2887044104,"[0, 0, 2, 5, 2]",375.0,0,1,0,2,0,2,0,0,1,5


In [93]:
# TODO better naming to dataframes
test_encoded = test_encoded.drop(columns=['transactionId'])
test_encoded.head()

Unnamed: 0,basket,totalAmount,returnLabel,existing,new,b_0,b_1,b_2,b_3,b_4,b_5,basketElementCount
0,"[4, 0, 3, 4, 1, 4, 3, 4]",80.0,1,0,1,1,1,0,2,4,0,8
1,[4],26.0,0,1,0,0,0,0,0,1,0,1
2,"[1, 3, 5]",147.0,0,1,0,0,1,0,1,0,1,3
3,[3],37.0,1,1,0,0,0,0,1,0,0,1
4,"[0, 0, 2, 5, 2]",375.0,0,1,0,2,0,2,0,0,1,5


### 12. Skalieren Sie die Test Features mit einem StandardScaler.

> Use sklearn scaler

1. Drop target feature
2. scale the x results

In [94]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# Backup for the very end
test_backup = test_encoded

# Drop basket feature after using it in the feature eng. phase
test_encoded = test_encoded.drop(columns=['basket'])
# Drop our target
x_test = test_encoded.drop(columns=['returnLabel'])

# Our x/y values
X_test = scaler.fit_transform(x_test)
y_test = test_encoded['returnLabel'].values

### 13. Machen Sie eine Vorhersage auf den Testdaten mit allen drei Modellen und den jeweils besten Hyperparametern aus der Cross Validation.

#### 13.1 Vorhersage auf Logistische Regression

In [95]:
lr_prediction = lr_model.predict(X_test)

#### 13.2 Vorhersage auf Random forest

In [96]:
rf_prediction = rf_model.predict(X_test)

#### 13.3 Vorhersage auf Gradient Boosting Tree

In [97]:
gb_prediction = gb_model.predict(X_test)

### 14 Berechnen Sie für jedes der drei Modell Accuracy, Precision und Recall.

In [98]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

#### 14.2 Accuracy, Precision & Recall für das logistische regression Model

In [99]:
# TODO Preeeettty sure something went wrong here...cause it's tooo god to be true ¯\_(ツ)_/¯ 

# TODO save old accuracy to compare
# lr_accuracy_old = lr_accuracy
lr_accuracy = accuracy_score(y_test, lr_prediction)
print(f'Logistic regression accuracy: {lr_accuracy}')

lr_precision = precision_score(y_test, lr_prediction)
print(f'Logistic regression precision: {lr_precision}')

lr_recall = recall_score(y_test, lr_prediction)
print(f'Logistic regression recall: {lr_recall}')

Logistic regression accuracy: 0.9108236895847516
Logistic regression precision: 0.9052069425901201
Logistic regression recall: 0.7802071346375143


#### 14.2 Accuracy, Precision & Recall für das Random Forest Model

In [100]:
# TODO Preeeettty sure something went wrong here...cause it's tooo god to be true ¯\_(ツ)_/¯ 

# TODO save old accuracy to compare
# rf_accuracy_old = rf_accuracy
rf_accuracy = accuracy_score(y_test, rf_prediction)
print(f'Random forest accuracy: {rf_accuracy}')

rf_precision = precision_score(y_test, rf_prediction)
print(f'Random forest precision: {rf_precision}')

rf_recall = recall_score(y_test, rf_prediction)
print(f'Random forest recall: {rf_recall}')

Random forest accuracy: 0.89499659632403
Random forest precision: 0.9321511179645335
Random forest recall: 0.6956271576524741


#### 14.3 Accuracy, Precision & Recall für das Gradient Boost Model

In [101]:
# TODO Preeeettty sure something went wrong here...cause it's tooo god to be true ¯\_(ツ)_/¯ 

# TODO save old accuracy to compare
# gb_accuracy_old = gb_accuracy
gb_accuracy = accuracy_score(y_test, gb_prediction)
print(f'Gradient Boost accuracy: {gb_accuracy}')

gb_precision = precision_score(y_test, gb_prediction)
print(f'Gradient Boost precision: {gb_precision}')

gb_recall = recall_score(y_test, gb_prediction)
print(f'Gradient Boost recall: {gb_recall}')

Gradient Boost accuracy: 0.9138869979577944
Gradient Boost precision: 0.90473061760841
Gradient Boost recall: 0.7922899884925202


#### 14.4 Accuracy summary

In [102]:
calcPerc = lambda acc: round(acc * 100, 2)

# print(f'Old accuracies:\n\n1. LR: {calcPerc(lr_accuracy_old)} %\n2. RF: {calcPerc(rf_accuracy_old)} %\n3. GB: {calcPerc(gb_accuracy_old)} %')

# print(f'\n---\n')

print(f'New accuracies:\n\n1. LR: {calcPerc(lr_accuracy)} %\n2. RF: {calcPerc(rf_accuracy)} %\n3. GB: {calcPerc(gb_accuracy)} %')

# print(f'\n---\n')

# print(f'The improvement/decrease of accuracy is:\n\n1. LR: {calcPerc((lr_accuracy- lr_accuracy_old))} %\n2. RF: {calcPerc((rf_accuracy- rf_accuracy_old))} %\n3. GB: {calcPerc((gb_accuracy- gb_accuracy_old))} %')

New accuracies:

1. LR: 91.08 %
2. RF: 89.5 %
3. GB: 91.39 %


<hr>

## ANALYSIS 🔍

<hr>

### 15 Untersuchen Sie wie viele Datenpunkte es in den Testdaten gibt, welche von allen drei Modellen falsch klassifiziert wurden:

#### 15.1 Bestimmen Sie **für jedes der drei Modelle** die **Indizes** der **Testdatenpunkte** auf welchen das **jeweilige Modell falsch klassifiziert hat**.

> Get the indices for of the wrong gessed data points: Process similar to prediction process

In [103]:
# Helper function to find wrong predicted test data pointes(See docstring)

def getWrongClassfiedPredictions(input, predictions, labels):
    """
    Takes in the input values to predict on,
    the predictions made by the model
    and the right labels to compare the predictions on.
    
    Returns a set with indices corresponding to the test datapoints,
    which the model guessed wrong
    """
    
    wrong_predictions = set()
    i = 0
    for input, prediction, label in zip(input, predictions, labels):
        if prediction != label:
            wrong_predictions.add(i)
        i += 1
    return wrong_predictions

#### 15.1.1 Falsch klassifizierte Testdatenpunkte des Logistische Regression Models

In [104]:
lr_wrong_predictions = getWrongClassfiedPredictions(X_test, lr_prediction, y_test)

#### 15.1.2 Falsch klassifizierte Testdatenpunkte des Random forest Models

In [105]:
rf_wrong_predictions = getWrongClassfiedPredictions(X_test, rf_prediction, y_test)

#### 15.1.1 Falsch klassifizierte Testdatenpunkte des Gradient boost Models

In [106]:
clf_wrong_predictions = getWrongClassfiedPredictions(X_test, gb_prediction, y_test)

#### 15.2 Nutzen Sie die set-Klasse in Python um die Anzahl an Datenpunkten zu bestimmen, welche von allen drei Modellen falsch klassifiziert wurden.

In [107]:
allWrongDataPoints = lr_wrong_predictions & rf_wrong_predictions & clf_wrong_predictions

# TODO use this as metric to improve model -> Bring number down!
print(f'{len(allWrongDataPoints)} values found wrong categorized in common by all models')

395 values found wrong categorized in common by all models


In [108]:
# TODO find better pattern & use this 

# TODO use this as metric to improve model -> Search patterns and feature engineer better model

# Check out missing baskets
[print(test_backup['basket'].iloc[index]) for index in allWrongDataPoints]

[4, 1, 1, 4, 3]
[4]
[3]
[1, 4, 0, 4, 3, 3]
[4, 2, 4, 4, 4, 0]
[3, 2, 1, 0, 4]
[2]
[4, 0, 3, 1, 4, 1, 4]
[3]
[1]
[1]
[4]
[2, 2, 4]
[2, 3, 4]
[0]
[0, 4, 3]
[2, 4, 0, 3, 4]
[1, 3, 1, 2, 4, 3]
[4]
[4, 1, 4, 3, 0, 2, 2]
[3]
[1, 4, 0]
[4, 4, 4, 3, 2]
[2, 0, 4]
[1, 2, 4, 2]
[3]
[4, 4, 4, 0, 4]
[3, 2, 4, 2, 4]
[3, 3, 0]
[1, 4, 2, 2]
[2, 2, 3, 4, 3, 2]
[3, 0]
[3]
[1, 4, 3, 1]
[3, 0, 1, 4]
[1]
[4]
[2]
[4]
[4, 1]
[1]
[4, 1, 3, 0]
[1]
[4]
[0]
[3]
[4]
[1, 3, 0]
[3]
[4]
[3, 2, 3, 3]
[1, 0, 2, 3, 3, 0, 4, 3]
[3]
[2, 0, 4, 4, 0]
[2, 4, 1, 0, 4, 0, 4]
[1, 4]
[4, 1, 4, 0]
[1, 3, 0, 3, 1, 4, 4, 4, 3]
[4]
[4, 1, 2, 2, 3]
[2, 4, 4, 2, 0, 3]
[4, 1]
[4]
[3, 3, 3, 2, 3]
[1, 2, 0, 2]
[3, 3, 3, 2, 2]
[4, 0, 4, 4, 1]
[3, 1, 0, 1, 0, 0]
[3, 1, 3, 1, 3]
[3, 0, 3]
[0, 4, 4, 1, 3, 0]
[3, 0]
[4, 3, 4, 4]
[4, 0]
[4, 1]
[3, 4, 3, 2, 1, 0]
[4, 2, 4, 3, 4, 4]
[3, 2]
[4]
[2, 3, 0]
[4]
[0, 4, 3, 4, 4]
[0]
[1]
[2]
[4]
[2, 3, 3, 1, 3]
[3]
[4]
[4, 3, 1, 0, 2]
[2, 3, 1, 2, 4, 0, 4, 1, 2]
[3]
[0]
[4, 2, 3, 2, 2, 3, 3, 3]
[4, 2,

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,