## Phase 3 (Data Preparation): v5 (Float)

* Autorin: Anna (i3-Versicherung)
* Webseite: [Data Science Training - Kapitel 5](https://data-science.training/kapitel-5/)
* Datum: 23.03.2023

Wir führen eine verbesserte Datenvorbereitung durch.

* Der Cabin-Wert zur PassengerId 873 wird gelöscht
* Neues Feature: KnownCabin (1 = Kabine ist bekannt, 0 = Kabine ist unbekannt [Cabin: Missing Value bzw. NaN])
* Neues Feature: Child (Age < 12)
* Schätzen von Embarked Missing Values: Modus 'S'
* Schätzen von Fare Missing Values: Konstanter Wert 7,896
* Neues Feature: Title (aus Name)
* Neues Feature: FamilySize (aus SibSp und Parch)
* Neues Feature: LogFare (aus Fare und TicketCount, das selbst zuvor aus Ticket berechnet wurde)

Unser Ziel von Version 5 ist es, Attribute (bzw. Features) zu erstellen, die numerisch sind, also vom Datentyp Int bzw. Float.

Wie in Version 4 benutzen wir One Hot Encoding, um sogenannte Dummy-Variablen als neue Features zu erzeugen.

* One Hot Encoding: Pclass, Sex, Embarked, Title => binäre Dummy-Variablen

Auf eine Korrelationsanalyse verzichten wir in dieser Version.

In [4]:
# Pandas Paket (Package) importieren
#  Datenstrukturen und Datenanalyse, I/O
#  https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd
# NumPy Paket (Package) importieren
#  Mehrdimensionale Datenstrukturen (Vektoren, Matrizen, Tensoren, Arrays), Lineare Algebra
#  https://numpy.org/doc/
import numpy as np

In [5]:
# Trainings- und Testdaten als Pandas Data Frame (df) aus CSV-Dateien laden
#  (KNIME: "CSV Reader")
df_train = pd.read_csv('../../data/titanic/original/train.csv')
df_test  = pd.read_csv('../../data/titanic/original/test.csv')

In [6]:
# Trainings- und Testdaten zusammenführen
#  (KNIME "Concatenate")
df = pd.concat([df_train, df_test], ignore_index=True)

In [7]:
# Datentypen automatisch konvertieren
df = df.convert_dtypes()

In [8]:
# Fehlende Werte prüfen
df.isnull().sum()

PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

In [9]:
# Falsche Kabinennummer löschen
#  (KNIME: "Rule Engine")
display(df[df['Cabin'] == 'B51 B53 B55'])
display(df[df['PassengerId'] == 873])
df.loc[872, 'Cabin'] = np.nan
display(df[df['PassengerId'] == 873])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
679,680,1.0,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
872,873,0.0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0,B51 B53 B55,S
1234,1235,,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0,1,PC 17755,512.3292,B51 B53 B55,C


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0,B51 B53 B55,S


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0,,S


In [10]:
# Neues Feature: KnownCabin
#  (KNIME: "Rule Engine")
df['KnownCabin'] = (df['Cabin'].notna()).astype('int')

In [11]:
# Neues Feature: Child
#  (KNIME: "Rule Engine")
df['Child'] = (df['Age'] < 12).fillna(False).astype('int')

In [12]:
# Fehlende Werte behandeln (d.h. schätzen) - Teil 1
#  (KNIME: "Missing Values")
# Embarked (Nominalskala): 2 fehlende Werte => Benutze den Modus (häufigster Wert)
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].dropna().mode()[0])

In [13]:
# Fehlende Werte behandeln (d.h. schätzen) - Teil 2
#  (KNIME: "Missing Values")
# Fare (Kardinalskala): 1 fehlender Wert => Benutze den konstanten Wert 7,896
#df['Fare'] = df['Fare'].fillna(7.896)
display(df[df['Fare'].isna()])
df.loc[1043, 'Fare'] = 7.896
display(df[df['PassengerId'] == 1044])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,KnownCabin,Child
1043,1044,,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,0,0


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,KnownCabin,Child
1043,1044,,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,7.896,,S,0,0


In [14]:
# Hinweis: Wir löschen später die Features Age und Cabin.
# Deshalb ersetzen wir zu diesen Features keine fehlenden Werte.

In [15]:
# Neues Feature Title
#  (KNIME: "Cell Splitter", "Column Renamer", "Table Creator", "Cell Replacer")
df['Title'] = df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
df['Title'] = df['Title'].replace(['Ms', 'Mlle'], 'Miss')
df['Title'] = df['Title'].replace(['Mme', 'Lady', 'the Countess', 'Dona'], 'Mrs')
df['Title'] = df['Title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Rare')

In [16]:
# Neues Feature "FamilySize"
#  (KNIME: "Math Formula")
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

In [17]:
# Hilfsgröße TicketCount
#  (KNIME: "GroupBy", "Joiner", "Column Renamer")
ticketCount = df.groupby('Ticket', as_index=False)['PassengerId'].count()
ticketCount = ticketCount.rename(columns={'PassengerId': 'TicketCount'})
df = df.merge(ticketCount, how='left', on='Ticket')

In [18]:
# Neues Feature "LogFare"
#  (KNIME: "Math Formula")
df['LogFare'] = np.log( 1 + df['Fare'] / df['TicketCount'] )

In [19]:
# Ergebnis des Feature Engineering anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   Int64  
 1   Survived     891 non-null    Int64  
 2   Pclass       1309 non-null   Int64  
 3   Name         1309 non-null   string 
 4   Sex          1309 non-null   string 
 5   Age          1046 non-null   Float64
 6   SibSp        1309 non-null   Int64  
 7   Parch        1309 non-null   Int64  
 8   Ticket       1309 non-null   string 
 9   Fare         1309 non-null   Float64
 10  Cabin        294 non-null    string 
 11  Embarked     1309 non-null   string 
 12  KnownCabin   1309 non-null   int32  
 13  Child        1309 non-null   int32  
 14  Title        1309 non-null   string 
 15  FamilySize   1309 non-null   Int64  
 16  TicketCount  1309 non-null   Int64  
 17  LogFare      1309 non-null   Float64
dtypes: Float64(3), Int64(7), int32(2), string(6)
mem

In [20]:
# One Hot Encoding => Dummy-Variablen
#  für Pclass, Embarked, Title
cols  = ['Pclass', 'Embarked', 'Title']
df = pd.get_dummies(df, columns=cols, dtype=int)
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   1309 non-null   Int64  
 1   Survived      891 non-null    Int64  
 2   Name          1309 non-null   string 
 3   Sex           1309 non-null   string 
 4   Age           1046 non-null   Float64
 5   SibSp         1309 non-null   Int64  
 6   Parch         1309 non-null   Int64  
 7   Ticket        1309 non-null   string 
 8   Fare          1309 non-null   Float64
 9   Cabin         294 non-null    string 
 10  KnownCabin    1309 non-null   int32  
 11  Child         1309 non-null   int32  
 12  FamilySize    1309 non-null   Int64  
 13  TicketCount   1309 non-null   Int64  
 14  LogFare       1309 non-null   Float64
 15  Pclass_1      1309 non-null   int32  
 16  Pclass_2      1309 non-null   int32  
 17  Pclass_3      1309 non-null   int32  
 18  Embarked_C    1309 non-null 

### Zwischenergebnis

Wir haben viele neue Features erstellt. Nun werden wir die Attribute herausfiltern, die als Basis für diese neuen Features benutzt wurden. Also:

* Name   (wird durch Title ersetzt)
* Sex    (wird durch Title ersetzt: Indirekt enthält der Titel auch das Geschlecht)
* Age    (wird durch Child ersetzt und hat außerdem fehlende Werte)
* SibSp  (wird durch FamilySize ersetzt)
* Parch  (wird durch FamilySize ersetzt)
* Fare   (wird durch FareBinned ersetzt)
* Cabin  (wird durch KnownCabin ersetzt und hat außerdem fehlende Werte)

Schließlich hat das Attribut Ticket keine Bedeutung für die Datenanalyse, es wird also als irrelevant betrachtet und kann ebenfalls herausgefiltert werden:

* Ticket (irrelevant)
* TicketCount (Hilfsgröße)

In [22]:
# Aufräumen: Attribute (manuell) herausfiltern
#  (KNIME "Column Filter")
df = df.drop(['Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare' , 'Cabin', 'Ticket', 'TicketCount'], axis=1)
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   1309 non-null   Int64  
 1   Survived      891 non-null    Int64  
 2   KnownCabin    1309 non-null   int32  
 3   Child         1309 non-null   int32  
 4   FamilySize    1309 non-null   Int64  
 5   LogFare       1309 non-null   Float64
 6   Pclass_1      1309 non-null   int32  
 7   Pclass_2      1309 non-null   int32  
 8   Pclass_3      1309 non-null   int32  
 9   Embarked_C    1309 non-null   int32  
 10  Embarked_Q    1309 non-null   int32  
 11  Embarked_S    1309 non-null   int32  
 12  Title_Master  1309 non-null   int32  
 13  Title_Miss    1309 non-null   int32  
 14  Title_Mr      1309 non-null   int32  
 15  Title_Mrs     1309 non-null   int32  
 16  Title_Rare    1309 non-null   int32  
dtypes: Float64(1), Int64(3), int32(13)
memory usage: 112.6 KB


In [23]:
# Daten wieder aufteilen
#  (KNIME: "Row Splitter")
df_train = df[df['Survived'].notna()]
df_test  = df[df['Survived'].isna()]

In [24]:
# Irrelevante Attribute filtern
#  (KNIME: "Column Filter")
# Trainingsdaten: PassengerId
df_train = df_train.drop(['PassengerId'], axis=1)
# Testdaten: Survived
df_test = df_test.drop(['Survived'], axis=1)

In [25]:
display(df_train.head())

Unnamed: 0,Survived,KnownCabin,Child,FamilySize,LogFare,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Rare
0,0,0,0,2,2.110213,0,0,1,0,0,1,0,0,1,0,0
1,1,1,0,2,3.601186,1,0,0,1,0,0,0,0,0,1,0
2,1,0,0,1,2.188856,0,0,1,0,0,1,0,1,0,0,0
3,1,1,0,2,3.316003,1,0,0,0,0,1,0,0,0,1,0
4,0,0,0,1,2.202765,0,0,1,0,0,1,0,0,1,0,0


In [26]:
display(df_test.head())

Unnamed: 0,PassengerId,KnownCabin,Child,FamilySize,LogFare,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Rare
891,892,0,0,1,2.178064,0,0,1,0,1,0,0,0,1,0,0
892,893,0,0,2,2.079442,0,0,1,0,0,1,0,0,0,1,0
893,894,0,0,1,2.369075,0,1,0,0,1,0,0,0,1,0,0
894,895,0,0,1,2.268252,0,0,1,0,0,1,0,0,1,0,0
895,896,0,0,3,1.966238,0,0,1,0,0,1,0,0,0,1,0


In [27]:
# Daten als Excel-Dateien speichern
#  (KNIME: "Excel Writer")
# Trainingsdaten
df_train.to_excel('../../data/titanic/new/training_v5.xlsx', index=False)
# Testdaten
df_test.to_excel('../../data/titanic/new/test_v5.xlsx', index=False)