## Child Phase 3 (Data Preparation): Equal Size Sampling, Bootstrap, SMOTE 

* Autorin: Anna (i3-Versicherung)
* Webseite: [Data Science Training - Kapitel 16](https://data-science.training/kapitel-16/)
* Datum: 23.03.2023

Wir erstellen neue Datenversionen, um der ungleichen Verteilung entgegenzuwirken.

* Version 4: Equal Size Sampling bzw. Random Under Sampling
* Version 5: Bootstrap bzw. Random Over Sampling
* Version 6: SMOTE

In [4]:
# imbalanced-learn (0.10.1 => 0.11.0)
#!pip install --upgrade imbalanced-learn
# ...
# Successfully installed imbalanced-learn-0.11.0

In [5]:
# Pandas Paket (Package) importieren
#  Datenstrukturen und Datenanalyse, I/O
#  https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd
# NumPy Paket (Package) importieren
#  Mehrdimensionale Datenstrukturen (Vektoren, Matrizen, Tensoren, Arrays), Lineare Algebra
#  https://numpy.org/doc/
import numpy as np
# Klassen der imbalanced-learn Module importieren
#  Umgang mit ungleichen Datenverteilungen
#  https://imbalanced-learn.org/stable/introduction.html
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE

In [6]:
# Training-, Test-, und Anwendungsdaten als Pandas Data Frame (df) aus Excel-Dateien laden
#  (KNIME: "Excel Reader")
df_train = pd.read_excel('../../data/titanic/age/training_v3.xlsx')
df_test  = pd.read_excel('../../data/titanic/age/test_v3.xlsx')
df_app   = pd.read_excel('../../data/titanic/age/application_v3.xlsx')

In [7]:
count_class = df_train['Child'].value_counts()
print(count_class)

Child
0    762
1     74
Name: count, dtype: int64


In [8]:
# Informationen anzeigen
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 836 entries, 0 to 835
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   SibSp             836 non-null    int64  
 1   Parch             836 non-null    int64  
 2   Fare              836 non-null    float64
 3   TicketCount       836 non-null    int64  
 4   LogFare           836 non-null    float64
 5   FirstnameMeanAge  836 non-null    float64
 6   Pclass_2          836 non-null    int64  
 7   Pclass_3          836 non-null    int64  
 8   Sex_male          836 non-null    int64  
 9   Embarked_Q        836 non-null    int64  
 10  Embarked_S        836 non-null    int64  
 11  Title_Master      836 non-null    int64  
 12  Title_Miss        836 non-null    int64  
 13  Title_Mrs         836 non-null    int64  
 14  Title_Rare        836 non-null    int64  
 15  Child             836 non-null    int64  
dtypes: float64(3), int64(13)
memory usage: 104.6

In [9]:
# Beschreibende Attribute extrahieren (ohne Child)
X = df_train.iloc[:,0:15].values
# Klassenattribut (Child) extrahieren
y = df_train.iloc[:,15].values

In [10]:
#display(X)
#display(y)

In [11]:
# v4: Random Under Sampling
#  (KNIME: "Equal Size Sampling")
#
rus = RandomUnderSampler(random_state=0)
X_res, y_res = rus.fit_resample(X, y)
#
Xy = np.append(X_res, y_res.reshape(-1,1), axis=1)
df_train_v4 = pd.DataFrame(data=Xy, columns=df_train.columns)
#
df_train_v4['Child'] = df_train_v4['Child'].astype('int') 
#display(df_train_v4)
#
count_class = df_train_v4['Child'].value_counts()
print(count_class)

Child
0    74
1    74
Name: count, dtype: int64


In [12]:
# v5: Random Over Sampler
#  (KNIME: "Boostrap Sampling")
#
ros = RandomOverSampler(random_state=0)
X_res, y_res = ros.fit_resample(X, y)
#
Xy = np.append(X_res, y_res.reshape(-1,1), axis=1)
df_train_v5 = pd.DataFrame(data=Xy, columns=df_train.columns)
#
df_train_v5['Child'] = df_train_v5['Child'].astype('int')
#display(df_train_v5)
#
count_class = df_train_v5['Child'].value_counts()
print(count_class)

Child
0    762
1    762
Name: count, dtype: int64


In [13]:
# v6: SMOTE
#  (KNIME: "SMOTE")
#
sm = SMOTE(random_state=0)
X_res, y_res = sm.fit_resample(X, y)
#
Xy = np.append(X_res, y_res.reshape(-1,1), axis=1)
df_train_v6 = pd.DataFrame(data=Xy, columns=df_train.columns)
#
df_train_v6['Child'] = df_train_v6['Child'].astype('int')
#display(df_train_v6)
#
count_class = df_train_v6['Child'].value_counts()
print(count_class)

Child
0    762
1    762
Name: count, dtype: int64


In [14]:
# Daten als Excel-Dateien speichern
#  (KNIME: "Excel Writer")
# v4
df_train_v4.to_excel('../../data/titanic/age/training_v4.xlsx', index=False)
df_test.to_excel('../../data/titanic/age/test_v4.xlsx', index=False)
df_app.to_excel('../../data/titanic/age/application_v4.xlsx', index=False)
# v5
df_train_v5.to_excel('../../data/titanic/age/training_v5.xlsx', index=False)
df_test.to_excel('../../data/titanic/age/test_v5.xlsx', index=False)
df_app.to_excel('../../data/titanic/age/application_v5.xlsx', index=False)
# v6
df_train_v6.to_excel('../../data/titanic/age/training_v6.xlsx', index=False)
df_test.to_excel('../../data/titanic/age/test_v6.xlsx', index=False)
df_app.to_excel('../../data/titanic/age/application_v6.xlsx', index=False)