## Phase 6 (Deployment): Preparation (v4: Int)

* Autorin: Anna (i3-Versicherung)
* Webseite: [Data Science Training - Kapitel 6](https://data-science.training/kapitel-6/)
* Datum: 23.03.2023

Auch die Anwendungsdaten, die wir in der Phase 6 (Deployment) benutzen möchten, müssen in Anlehnung an Phase 3 (Data Preparation) entsprechend vorverarbeitet werden.

In [4]:
# Pandas Paket (Package) importieren
#  Datenstrukturen und Datenanalyse, I/O
#  https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd

In [5]:
# Anwendungsdaten als Pandas Data Frame (df) aus CSV-Datei laden
#  (KNIME: "CSV Reader")
df = pd.read_csv('../../data/titanic/original/application.csv')

In [6]:
# Datentypen automatisch konvertieren
df = df.convert_dtypes()

In [7]:
# Fehlende Werte prüfen
df.isnull().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [8]:
# Neues Feature: KnownCabin
#  (KNIME: "Rule Engine")
df['KnownCabin'] = (df['Cabin'].notna()).astype('int')

In [9]:
# Neues Feature: Child
#  (KNIME: "Rule Engine")
df['Child'] = (df['Age'] < 12).fillna(False).astype('int')

In [10]:
# Neues Feature Title
#  (KNIME: "Cell Splitter", "Column Renamer", "Table Creator", "Cell Replacer")
df['Title'] = df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
df['Title'] = df['Title'].replace(['Ms', 'Mlle'], 'Miss')
df['Title'] = df['Title'].replace(['Mme', 'Lady', 'the Countess', 'Dona'], 'Mrs')
df['Title'] = df['Title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Rare')

In [11]:
# Neues Feature FamilySizeBinned
#  (KNIME: "Math Formula", "Table Creator", "Binner (Dictionary)")
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
bins   = [0, 2, 5, 99]
labels = ['No', 'Small', 'Large']
df['FamilySizeBinned'] = pd.cut(df['FamilySize'], bins, right=False, labels=labels)

In [12]:
# Neues Feature FareBinned
#  (KINME: "Table Creator", "Binner (Dictionary)")
bins   = [-1, 8, 16, 32, 1024]
labels = ['Low', 'Medium', 'High', 'VeryHigh']
df['FareBinned'] = pd.cut(df['Fare'], bins, right=False, labels=labels)

In [13]:
# Ergebnis des Feature Engineering anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   PassengerId       23 non-null     Int64   
 1   Pclass            23 non-null     Int64   
 2   Name              23 non-null     string  
 3   Sex               23 non-null     string  
 4   Age               23 non-null     Int64   
 5   SibSp             23 non-null     Int64   
 6   Parch             23 non-null     Int64   
 7   Ticket            23 non-null     string  
 8   Fare              23 non-null     Float64 
 9   Cabin             23 non-null     string  
 10  Embarked          23 non-null     string  
 11  KnownCabin        23 non-null     int32   
 12  Child             23 non-null     int32   
 13  Title             23 non-null     string  
 14  FamilySize        23 non-null     Int64   
 15  FamilySizeBinned  23 non-null     category
 16  FareBinned        23 non-nul

In [14]:
# One Hot Encoding => Dummy-Variablen
#  für Pclass, Sex, Embarked, Title, FamilySizeBinned, FareBinned
cols  = ['Pclass', 'Sex', 'Embarked', 'Title', 'FamilySizeBinned', 'FareBinned']
df = pd.get_dummies(df, columns=cols, dtype=int)
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   PassengerId             23 non-null     Int64  
 1   Name                    23 non-null     string 
 2   Age                     23 non-null     Int64  
 3   SibSp                   23 non-null     Int64  
 4   Parch                   23 non-null     Int64  
 5   Ticket                  23 non-null     string 
 6   Fare                    23 non-null     Float64
 7   Cabin                   23 non-null     string 
 8   KnownCabin              23 non-null     int32  
 9   Child                   23 non-null     int32  
 10  FamilySize              23 non-null     Int64  
 11  Pclass_1                23 non-null     int32  
 12  Pclass_2                23 non-null     int32  
 13  Pclass_3                23 non-null     int32  
 14  Sex_female              23 non-null     int3

In [15]:
# Aufräumen: Attribute (manuell) herausfiltern
#  (KNIME "Column Filter")
# Attribute, die durch neue Features ersetzt wurden oder die irrelevant für die Analyse sind
df = df.drop(['Name', 'Age', 'SibSp', 'Parch', 'Fare' , 'Cabin', 'FamilySize', 'Ticket'], axis=1)
# Attribute, die wegen starker Korrelationen in den Tranings- und Testdaten herausgefiltert wurden
df = df.drop(['Sex_female', 'Title_Mr', 'FamilySizeBinned_Small', 'Embarked_C', 'Pclass_1'], axis=1)
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   PassengerId             23 non-null     Int64
 1   KnownCabin              23 non-null     int32
 2   Child                   23 non-null     int32
 3   Pclass_2                23 non-null     int32
 4   Pclass_3                23 non-null     int32
 5   Sex_male                23 non-null     int32
 6   Embarked_Q              23 non-null     int32
 7   Embarked_S              23 non-null     int32
 8   Title_Master            23 non-null     int32
 9   Title_Miss              23 non-null     int32
 10  Title_Mrs               23 non-null     int32
 11  Title_Rare              23 non-null     int32
 12  FamilySizeBinned_No     23 non-null     int32
 13  FamilySizeBinned_Large  23 non-null     int32
 14  FareBinned_Low          23 non-null     int32
 15  FareBinned_Medium       2

In [16]:
display(df.head())

Unnamed: 0,PassengerId,KnownCabin,Child,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mrs,Title_Rare,FamilySizeBinned_No,FamilySizeBinned_Large,FareBinned_Low,FareBinned_Medium,FareBinned_High,FareBinned_VeryHigh
0,1310,1,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0
1,1311,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0
2,1312,1,0,1,0,1,0,1,0,0,0,0,1,0,0,0,1,0
3,1313,1,0,0,0,1,0,1,0,0,0,1,1,0,1,0,0,0
4,1314,1,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0


In [17]:
# Daten als Excel-Dateien speichern
#  (KNIME: "Excel Writer")
df.to_excel('../../data/titanic/new/application_v4.xlsx', index=False)