## Phase 6 (Deployment): Preparation (v5: Float)

* Autorin: Anna (i3-Versicherung)
* Webseite: [Data Science Training - Kapitel 6](https://data-science.training/kapitel-6/)
* Datum: 23.03.2023

Auch die Anwendungsdaten, die wir in der Phase 6 (Deployment) benutzen möchten, müssen in Anlehnung an Phase 3 (Data Preparation) entsprechend vorverarbeitet werden.

In [4]:
# Pandas Paket (Package) importieren
#  Datenstrukturen und Datenanalyse, I/O
#  https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd
# NumPy Paket (Package) importieren
#  Mehrdimensionale Datenstrukturen (Vektoren, Matrizen, Tensoren, Arrays), Lineare Algebra
#  https://numpy.org/doc/
import numpy as np

In [5]:
# Anwendungsdaten als Pandas Data Frame (df) aus CSV-Datei laden
#  (KNIME: "CSV Reader")
df = pd.read_csv('../../data/titanic/original/application.csv')

In [6]:
# Datentypen automatisch konvertieren
df = df.convert_dtypes()

In [7]:
# Fehlende Werte prüfen
df.isnull().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [8]:
# Neues Feature: KnownCabin
#  (KNIME: "Rule Engine")
df['KnownCabin'] = (df['Cabin'].notna()).astype('int')

In [9]:
# Neues Feature: Child
#  (KNIME: "Rule Engine")
df['Child'] = (df['Age'] < 12).fillna(False).astype('int')

In [10]:
# Neues Feature Title
#  (KNIME: "Cell Splitter", "Column Renamer", "Table Creator", "Cell Replacer")
df['Title'] = df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
df['Title'] = df['Title'].replace(['Ms', 'Mlle'], 'Miss')
df['Title'] = df['Title'].replace(['Mme', 'Lady', 'the Countess', 'Dona'], 'Mrs')
df['Title'] = df['Title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Rare')

In [11]:
# Neues Feature "FamilySize"
#  (KNIME: "Math Formula")
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

In [12]:
# Hilfsgröße TicketCount
#  (KNIME: "GroupBy", "Joiner", "Column Renamer")
ticketCount = df.groupby('Ticket', as_index=False)['PassengerId'].count()
ticketCount = ticketCount.rename(columns={'PassengerId': 'TicketCount'})
df = df.merge(ticketCount, how='left', on='Ticket')

In [13]:
# Neues Feature "LogFare"
#  (KNIME: "Math Formula")
df['LogFare'] = np.log( 1 + df['Fare'] / df['TicketCount'] )

In [14]:
# Ergebnis des Feature Engineering anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  23 non-null     Int64  
 1   Pclass       23 non-null     Int64  
 2   Name         23 non-null     string 
 3   Sex          23 non-null     string 
 4   Age          23 non-null     Int64  
 5   SibSp        23 non-null     Int64  
 6   Parch        23 non-null     Int64  
 7   Ticket       23 non-null     string 
 8   Fare         23 non-null     Float64
 9   Cabin        23 non-null     string 
 10  Embarked     23 non-null     string 
 11  KnownCabin   23 non-null     int32  
 12  Child        23 non-null     int32  
 13  Title        23 non-null     string 
 14  FamilySize   23 non-null     Int64  
 15  TicketCount  23 non-null     Int64  
 16  LogFare      23 non-null     Float64
dtypes: Float64(2), Int64(7), int32(2), string(6)
memory usage: 3.2 KB


In [15]:
# One Hot Encoding => Dummy-Variablen
#  für Pclass, Embarked, Title
cols  = ['Pclass', 'Embarked', 'Title']
df = pd.get_dummies(df, columns=cols, dtype=int)
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   23 non-null     Int64  
 1   Name          23 non-null     string 
 2   Sex           23 non-null     string 
 3   Age           23 non-null     Int64  
 4   SibSp         23 non-null     Int64  
 5   Parch         23 non-null     Int64  
 6   Ticket        23 non-null     string 
 7   Fare          23 non-null     Float64
 8   Cabin         23 non-null     string 
 9   KnownCabin    23 non-null     int32  
 10  Child         23 non-null     int32  
 11  FamilySize    23 non-null     Int64  
 12  TicketCount   23 non-null     Int64  
 13  LogFare       23 non-null     Float64
 14  Pclass_1      23 non-null     int32  
 15  Pclass_2      23 non-null     int32  
 16  Pclass_3      23 non-null     int32  
 17  Embarked_C    23 non-null     int32  
 18  Embarked_Q    23 non-null     in

In [16]:
# Aufräumen: Attribute (manuell) herausfiltern
#  (KNIME "Column Filter")
df = df.drop(['Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare' , 'Cabin', 'Ticket', 'TicketCount'], axis=1)
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   23 non-null     Int64  
 1   KnownCabin    23 non-null     int32  
 2   Child         23 non-null     int32  
 3   FamilySize    23 non-null     Int64  
 4   LogFare       23 non-null     Float64
 5   Pclass_1      23 non-null     int32  
 6   Pclass_2      23 non-null     int32  
 7   Pclass_3      23 non-null     int32  
 8   Embarked_C    23 non-null     int32  
 9   Embarked_Q    23 non-null     int32  
 10  Embarked_S    23 non-null     int32  
 11  Title_Master  23 non-null     int32  
 12  Title_Miss    23 non-null     int32  
 13  Title_Mr      23 non-null     int32  
 14  Title_Mrs     23 non-null     int32  
 15  Title_Rare    23 non-null     int32  
dtypes: Float64(1), Int64(2), int32(13)
memory usage: 1.9 KB


In [17]:
display(df.head())

Unnamed: 0,PassengerId,KnownCabin,Child,FamilySize,LogFare,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Rare
0,1310,1,0,1,0.0,1,0,0,1,0,0,0,0,1,0,0
1,1311,1,0,1,3.411148,1,0,0,1,0,0,0,0,1,0,0
2,1312,1,0,1,2.917771,0,1,0,0,0,1,0,0,1,0,0
3,1313,1,0,1,0.0,1,0,0,0,0,1,0,0,0,0,1
4,1314,1,0,2,0.916291,1,0,0,0,0,1,0,0,1,0,0


In [18]:
# Daten als Excel-Dateien speichern
#  (KNIME: "Excel Writer")
df.to_excel('../../data/titanic/new/application_v5.xlsx', index=False)