### 1. Laden Sie die Trainingsdaten.

Wird genutzt fuer:

* Modelltrainig
* Cross-Validation

Bestelldaten:

1. transactionId : Vom unserem Online-System zufällig generierte ID der Bestellung.
2. basket : Welche Bücher wurden bestellt (siehe nächste Folie).
3. customerType : Ist der Kunde neu oder hat er schon bei uns bestellt?
4. totalAmount : Wie hoch ist der Warenwert der Bestellung?
5. returnLabel : Wurde Artikel zurückgeschickt (= 1) oder behalten (= 0).

In [52]:
import pandas as pd
pd.options.mode.chained_assignment = None

train_df = pd.read_csv('datasets/train.csv')

train_df.head() # See description above for column description

Unnamed: 0,transactionId,basket,customerType,totalAmount,returnLabel
0,7934161612,[3],existing,77.0,0
1,5308629088,"[5, 3, 0, 3]",existing,64.0,0
2,1951363325,"[3, 3, 1, 4]",new,308.0,1
3,6713597713,[2],existing,74.0,0
4,8352683669,"[4, 4, 4, 4]",new,324.0,1


### 2. Füllen Sie die fehlenden Werte in den Trainingsdaten auf
1, Analysis of missing data  
2. Fill/Remove missing data

In [119]:
# Analysis
heading = "Total number of missing values:"
getSeparator = lambda text: len(text)*"-"

print(getSeparator(heading)) # Just print a nice separator line xD
print(heading)
print(train_df.isnull().sum())
print(getSeparator(heading))

customerType_isNull_perc = str(round(train_df["customerType"].isnull().sum() * 100 / train_df["customerType"].count(), 2)).replace('.', ',')
print(f'\nThe number of missing values in CustomerType is only {customerType_isNull_perc} % ==> Removing missing values should not have a big impact on the resulting model!')

-------------------------------
Total number of missing values:
transactionId      0
basket             0
customerType     517
totalAmount      484
returnLabel        0
dtype: int64
-------------------------------

The number of missing values in CustomerType is only 2,11 % ==> Removing missing values should not have a big impact on the resulting model!


In [86]:
# Clean up
# Customer type
train_clean = train_df[train_df['customerType'].notna()]

# Total amount
totalAmount_median = train_clean['totalAmount'].mean()
train_clean['totalAmount'].fillna(totalAmount_median, inplace=True)

print(f'Total number of missing values after replacing median: {train_clean.isnull().sum().sum()}')

Total number of missing values after replacing median: 0


### 3. Transformieren Sie die kategorischen Features mittles One-hot-encoding

1. Actually get categorical features
2. One-Hot Encode categorical features from steps 1

In [135]:
# Get a list/Find out categorical columns
# As seen in: https://stackoverflow.com/questions/29803093/check-which-columns-in-dataframe-are-categorical

columns = train_clean.columns
# Columns with numerical data
num_cols = train_clean._get_numeric_data().columns
# Now Substract all columns from the numerical ones(Pretty cool :D)
categorical_columns = list(set(columns) - set(num_cols))

print("Categorical attributes found: ", *categorical_columns, sep="\n* ") # This printing also nice  ʘ‿ʘ 

Categorical attributes found: 
* customerType
* basket


In [136]:
# Transform with one hot encoding
# We'll use python list comprehension here, cause why not
one_hot_features = [pd.get_dummies(train_clean[str(attribute)]) for attribute in categorical_columns]

In [140]:
# Just a tiny helper for getting the width and print the statement centered  ¯\_(ツ)_/¯ 
import os 
centeredPrint = lambda statement: print(statement.center(os.get_terminal_size().columns))

In [141]:
centeredPrint('One hot encoding for the customer type:\n\n')
one_hot_features[0].head()

                                                                  One hot encoding for the customer type:

                                                                  


Unnamed: 0,existing,new
0,1,0
1,1,0
2,0,1
3,1,0
4,0,1


In [142]:
centeredPrint('One hot encoding for the basket:\n\n')
one_hot_features[1].head()

                                                                      One hot encoding for the basket:

                                                                     


Unnamed: 0,"[0, 0, 0, 0, 1, 0]","[0, 0, 0, 0, 2, 1]","[0, 0, 0, 0, 4, 0]","[0, 0, 0, 0, 4, 2, 3, 2, 2]","[0, 0, 0, 0, 4, 5, 2, 2]","[0, 0, 0, 0, 5]","[0, 0, 0, 0]","[0, 0, 0, 1, 0, 4, 2, 3, 4]","[0, 0, 0, 1, 1, 1]","[0, 0, 0, 1, 1, 3]",...,"[5, 5, 5, 4]","[5, 5, 5, 5, 0, 2, 1, 3]","[5, 5, 5, 5, 2, 3, 2]","[5, 5, 5, 5, 4, 0, 5, 5]","[5, 5, 5, 5, 4, 3]","[5, 5, 5, 5, 4, 4, 5, 5]","[5, 5, 5, 5]","[5, 5, 5]","[5, 5]",[5]
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


1. Each Column in the customer types, corresponds to one customer type
2. Each column in the basket one hot encoding, corresponds to a basket representation
...You don't say... ఠ_ఠ

Now concatenate it back into the original dataframe:

In [143]:
train_all = pd.concat([train_clean, one_hot_features[0], one_hot_features[1]], axis=1)
train_all.head()

Unnamed: 0,transactionId,basket,customerType,totalAmount,returnLabel,existing,new,"[0, 0, 0, 0, 1, 0]","[0, 0, 0, 0, 2, 1]","[0, 0, 0, 0, 4, 0]",...,"[5, 5, 5, 4]","[5, 5, 5, 5, 0, 2, 1, 3]","[5, 5, 5, 5, 2, 3, 2]","[5, 5, 5, 5, 4, 0, 5, 5]","[5, 5, 5, 5, 4, 3]","[5, 5, 5, 5, 4, 4, 5, 5]","[5, 5, 5, 5]","[5, 5, 5]","[5, 5]",[5]
0,7934161612,[3],existing,77.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,5308629088,"[5, 3, 0, 3]",existing,64.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1951363325,"[3, 3, 1, 4]",new,308.0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,6713597713,[2],existing,74.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,8352683669,"[4, 4, 4, 4]",new,324.0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 4. Versuchen Sie auf Basis des Attributs basket Features zu bauen (z.B. wie oft kommt jede Kategorie im Basket vor).

In [None]:
# DOING


#df["newFeature"] = df.basket.map(lambda x: len(x))

In [None]:
#x,y Split
X_train = train.drop(columns = ["returnLabel"])#maybe not id?
y_train = train.drop(columns = ["transactionId","basket","customerType","totalAmount"])