## Phase 3 (Data Preparation): v3 (string)

* Autorin: Anna (i3-Versicherung)
* Webseite: [Data Science Training - Kapitel 5](https://data-science.training/kapitel-5/)
* Datum: 23.03.2023

Wir führen eine verbesserte Datenvorbereitung durch.

* Der Cabin-Wert zur PassengerId 873 wird gelöscht
* Neues Feature: KnownCabin (1 = Kabine ist bekannt, 0 = Kabine ist unbekannt [Cabin: Missing Value bzw. NaN])
* Neues Feature: Child (Age < 12)
* Schätzen von Embarked Missing Values: Modus 'S'
* Schätzen von Fare Missing Values: Konstanter Wert 7,896
* Neues Feature: Title (aus Name)
* Neues Feature: FamilySizeBinned (aus SibSp und Parch)
* Neues Feature: FareBinned (aus Fare)

Unser Ziel von Version 3 ist es, Attribute (bzw. Features) zu erstellen, die kategorisch sind (d.h. nominal- oder ordinalskaliert) und somit vom Datentyp "string".

Mit Hilfe einer Korrelationsanalyse werden wir schließlich noch Attribute (bzw. Features) filtern, deren Korrelationskoeffizient die Schwelle von 0,75 überschreiten. In diesem Fall sind die Attribute (bzw. Features) nämlich stark abhängig voneinander.

In [4]:
# Pandas Paket (Package) importieren
#  Datenstrukturen und Datenanalyse, I/O
#  https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd
# NumPy Paket (Package) importieren
#  Mehrdimensionale Datenstrukturen (Vektoren, Matrizen, Tensoren, Arrays), Lineare Algebra
#  https://numpy.org/doc/
import numpy as np
# Eigene Module importieren
#  zur Berechnung der Korrelationskoeffizienten
import sys
sys.path.append('../00_DST_Module/')
import dst_correlation_functions as cf

In [5]:
# Trainings- und Testdaten als Pandas Data Frame (df) aus CSV-Dateien laden
#  (KNIME: "CSV Reader")
df_train = pd.read_csv('../../data/titanic/original/train.csv')
df_test  = pd.read_csv('../../data/titanic/original/test.csv')

In [6]:
# Trainings- und Testdaten zusammenführen
#  (KNIME "Concatenate")
df = pd.concat([df_train, df_test], ignore_index=True)

In [7]:
# Datentypen automatisch konvertieren
df = df.convert_dtypes()

In [8]:
# Fehlende Werte prüfen
df.isnull().sum()

PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

In [9]:
# Falsche Kabinennummer löschen
#  (KNIME: "Rule Engine")
display(df[df['Cabin'] == 'B51 B53 B55'])
display(df[df['PassengerId'] == 873])
df.loc[872, 'Cabin'] = np.nan
display(df[df['PassengerId'] == 873])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
679,680,1.0,1,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,0,1,PC 17755,512.3292,B51 B53 B55,C
872,873,0.0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0,B51 B53 B55,S
1234,1235,,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0,1,PC 17755,512.3292,B51 B53 B55,C


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0,B51 B53 B55,S


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0,,S


In [10]:
# Neues Feature: KnownCabin
#  (KNIME: "Rule Engine")
df['KnownCabin'] = (df['Cabin'].notna()).astype('int')

In [11]:
# Neues Feature: Child
#  (KNIME: "Rule Engine")
df['Child'] = (df['Age'] < 12).fillna(False).astype('int')

In [12]:
# Fehlende Werte behandeln (d.h. schätzen) - Teil 1
#  (KNIME: "Missing Values")
# Embarked (Nominalskala): 2 fehlende Werte => Benutze den Modus (häufigster Wert)
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].dropna().mode()[0])

In [13]:
# Fehlende Werte behandeln (d.h. schätzen) - Teil 2
#  (KNIME: "Missing Values")
# Fare (Kardinalskala): 1 fehlender Wert => Benutze den konstanten Wert 7,896
#df['Fare'] = df['Fare'].fillna(7.896)
display(df[df['Fare'].isna()])
df.loc[1043, 'Fare'] = 7.896
display(df[df['PassengerId'] == 1044])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,KnownCabin,Child
1043,1044,,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,0,0


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,KnownCabin,Child
1043,1044,,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,7.896,,S,0,0


In [14]:
# Hinweis: Wir löschen später die Features Age und Cabin.
# Deshalb ersetzen wir zu diesen Features keine fehlenden Werte.

In [15]:
# Neues Feature Title
#  (KNIME: "Cell Splitter", "Column Rename", "Table Creator", "Cell Replacer")
df['Title'] = df['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
df['Title'] = df['Title'].replace(['Ms', 'Mlle'], 'Miss')
df['Title'] = df['Title'].replace(['Mme', 'Lady', 'the Countess', 'Dona'], 'Mrs')
df['Title'] = df['Title'].replace(['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Rare')

In [16]:
# Neues Feature FamilySizeBinned
#  (KNIME: "Math Formula", "Table Creator", "Binner (Dictionary)")
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
bins   = [0, 2, 5, 99]
labels = ['No', 'Small', 'Large']
df['FamilySizeBinned'] = pd.cut(df['FamilySize'], bins, right=False, labels=labels)

In [17]:
# Neues Feature FareBinned
#  (KINME: "Table Creator", "Binner (Dictionary)")
bins   = [-1, 8, 16, 32, 1024]
labels = ['Low', 'Medium', 'High', 'VeryHigh']
df['FareBinned'] = pd.cut(df['Fare'], bins, right=False, labels=labels)

In [18]:
# Ergebnis des Feature Engineering anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   PassengerId       1309 non-null   Int64   
 1   Survived          891 non-null    Int64   
 2   Pclass            1309 non-null   Int64   
 3   Name              1309 non-null   string  
 4   Sex               1309 non-null   string  
 5   Age               1046 non-null   Float64 
 6   SibSp             1309 non-null   Int64   
 7   Parch             1309 non-null   Int64   
 8   Ticket            1309 non-null   string  
 9   Fare              1309 non-null   Float64 
 10  Cabin             294 non-null    string  
 11  Embarked          1309 non-null   string  
 12  KnownCabin        1309 non-null   int32   
 13  Child             1309 non-null   int32   
 14  Title             1309 non-null   string  
 15  FamilySize        1309 non-null   Int64   
 16  FamilySizeBinned  1309 n

### Zwischenergebnis

Wir haben viele neue Features erstellt. Nun werden wir die Attribute herausfiltern, die als Basis für diese neuen Features benutzt wurden. Also:

* Name   (wird durch Title ersetzt)
* Age    (wird durch Child ersetzt und hat außerdem fehlende Werte)
* SibSp  (wird durch FamilySize ersetzt)
* Parch  (wird durch FamilySize ersetzt)
* Fare   (wird durch FareBinned ersetzt)
* Cabin  (wird durch KnownCabin ersetzt und hat außerdem fehlende Werte)

Für ein neu gebildetes Feature gilt das aber auch:

* FamilySize (wird durch FamilySizeBinned ersetzt)

Schließlich hat das Attribut Ticket keine Bedeutung für die Datenanalyse, es wird also als irrelevant betrachtet und kann ebenfalls herausgefiltert werden:

* Ticket (irrelevant)

In [20]:
# Aufräumen: Attribute (manuell) herausfiltern
#  (KNIME "Column Filter")
df = df.drop(['Name', 'Age', 'SibSp', 'Parch', 'Fare' , 'Cabin', 'FamilySize', 'Ticket'], axis=1)
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   PassengerId       1309 non-null   Int64   
 1   Survived          891 non-null    Int64   
 2   Pclass            1309 non-null   Int64   
 3   Sex               1309 non-null   string  
 4   Embarked          1309 non-null   string  
 5   KnownCabin        1309 non-null   int32   
 6   Child             1309 non-null   int32   
 7   Title             1309 non-null   string  
 8   FamilySizeBinned  1309 non-null   category
 9   FareBinned        1309 non-null   category
dtypes: Int64(3), category(2), int32(2), string(3)
memory usage: 78.4 KB


In [21]:
# Version 3: Datentyp string
df = df.astype('string')
df['PassengerId'] = df['PassengerId'].astype('int') # Ausnahme: Primärschlüsselattribut
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   PassengerId       1309 non-null   int32 
 1   Survived          891 non-null    string
 2   Pclass            1309 non-null   string
 3   Sex               1309 non-null   string
 4   Embarked          1309 non-null   string
 5   KnownCabin        1309 non-null   string
 6   Child             1309 non-null   string
 7   Title             1309 non-null   string
 8   FamilySizeBinned  1309 non-null   string
 9   FareBinned        1309 non-null   string
dtypes: int32(1), string(9)
memory usage: 97.3 KB


In [22]:
# Korrelationen: Kategorische Attribute
corr_matrix = cf.dst_categorical_correlation_matrix(df)
display(corr_matrix)
#
corr_measures = cf.dst_correlation_measures_filtered(corr_matrix)
display(corr_measures)

Unnamed: 0,Survived,Pclass,Sex,Embarked,KnownCabin,Child,Title,FamilySizeBinned,FareBinned
Survived,1.0,0.195107,0.445849,0.105109,0.262437,0.096575,0.330078,0.167747,0.167388
Pclass,0.195107,1.0,0.118532,0.276939,0.776096,0.11639,0.181315,0.178457,0.578515
Sex,0.445849,0.118532,0.998333,0.114465,0.134244,0.049993,0.997245,0.282279,0.220821
Embarked,0.105109,0.276939,0.114465,1.0,0.275706,0.031544,0.158759,0.139615,0.274292
KnownCabin,0.262437,0.776096,0.134244,0.275706,0.997805,0.041604,0.192596,0.210096,0.598643
Child,0.096575,0.11639,0.049993,0.031544,0.041604,0.99409,0.669007,0.384224,0.219413
Title,0.330078,0.181315,0.997245,0.158759,0.192596,0.669007,1.0,0.387857,0.199751
FamilySizeBinned,0.167747,0.178457,0.282279,0.139615,0.210096,0.384224,0.387857,1.0,0.388928
FareBinned,0.167388,0.578515,0.220821,0.274292,0.598643,0.219413,0.199751,0.388928,1.0


Survived    Survived      1.000000
Pclass      Pclass        1.000000
Sex         Sex           0.998333
KnownCabin  KnownCabin    0.997805
Sex         Title         0.997245
Child       Child         0.994090
Pclass      KnownCabin    0.776096
dtype: float64

### Schlussfolgerungen

Zwischen den Attributen (bzw. Features) Sex und Title sowie zwischen Pclass und KnownCabin gibt es einen starken Zusammenhang. Wir können also eines dieser Feature eliminieren. Wir entscheiden uns dafür die ursprünglichen Attribute (Sex und Pclass) zu benutzen und die neuen Features (Title und KnownCabin) herauszufiltern.

In [24]:
# Aufräumen: Attribute (manuell) herausfiltern
#  (KNIME "Column Filter")
df = df.drop(['Title', 'KnownCabin'], axis=1)
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   PassengerId       1309 non-null   int32 
 1   Survived          891 non-null    string
 2   Pclass            1309 non-null   string
 3   Sex               1309 non-null   string
 4   Embarked          1309 non-null   string
 5   Child             1309 non-null   string
 6   FamilySizeBinned  1309 non-null   string
 7   FareBinned        1309 non-null   string
dtypes: int32(1), string(7)
memory usage: 76.8 KB


In [25]:
# Daten wieder aufteilen
#  (KNIME: "Row Splitter")
df_train = df[df['Survived'].notna()]
df_test  = df[df['Survived'].isna()]

In [26]:
# Irrelevante Attribute filtern
#  (KNIME: "Column Filter")
# Trainingsdaten: PassengerId
df_train = df_train.drop(['PassengerId'], axis=1)
# Testdaten: Survived
df_test = df_test.drop(['Survived'], axis=1)

In [27]:
display(df_train.head())

Unnamed: 0,Survived,Pclass,Sex,Embarked,Child,FamilySizeBinned,FareBinned
0,0,3,male,S,0,Small,Low
1,1,1,female,C,0,Small,VeryHigh
2,1,3,female,S,0,No,Low
3,1,1,female,S,0,Small,VeryHigh
4,0,3,male,S,0,No,Medium


In [28]:
display(df_test.head())

Unnamed: 0,PassengerId,Pclass,Sex,Embarked,Child,FamilySizeBinned,FareBinned
891,892,3,male,Q,0,No,Low
892,893,3,female,S,0,Small,Low
893,894,2,male,Q,0,No,Medium
894,895,3,male,S,0,No,Medium
895,896,3,female,S,0,Small,Medium


In [29]:
# Daten als Excel-Dateien speichern
#  (KNIME: "Excel Writer")
# Trainingsdaten
df_train.to_excel('../../data/titanic/new/training_v3.xlsx', index=False)
# Testdaten
df_test.to_excel('../../data/titanic/new/test_v3.xlsx', index=False)