## Phase 3 (Data Preparation): v1

* Autorin: Anna (i3-Versicherung)
* Webseite: [Data Science Training - Kapitel 3](https://data-science.training/kapitel-3/)
* Datum: 23.03.2023

Wir bereiten die Daten für die nächsten Phasen vor. Dabei schätzen wir fehlenden Werte folgendermaßen ab:

* Embarked: Modus
* Fare: Medium
* Age: Mittelwert
* Cabin: 'Unknown' (als fester Wert)

In [4]:
# Pandas Paket (Package) importieren
#  Datenstrukturen und Datenanalyse, I/O
#  https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd

In [5]:
# Trainings-, Test- und Anwendungsdaten als Pandas Data Frame (df) aus CSV-Dateien laden
#  (KNIME: "CSV Reader")
df_train = pd.read_csv('../../data/titanic/original/train.csv')
df_test  = pd.read_csv('../../data/titanic/original/test.csv')
df_app   = pd.read_csv('../../data/titanic/original/application.csv')

In [6]:
# Trainings- und Testdaten zusammenführen
#  (KNIME "Concatenate")
df = pd.concat([df_train, df_test], ignore_index=True)

In [7]:
# Datentypen automatisch konvertieren
df = df.convert_dtypes()

In [8]:
# Fehlende Werte prüfen
df.isnull().sum()

PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

In [9]:
# Fehlende Werte behandeln
#  (KNIME: "Missing Values")
#
# Embarked (Nominalskala): 2 fehlende Werte => Benutze den Modus (häufigster Wert)
embarked_mode  = df['Embarked'].dropna().mode()[0]
df['Embarked'] = df['Embarked'].fillna(embarked_mode)
print('Embarked : ', embarked_mode)
#
# Fare (Kardinalskala): 1 fehlender Wert => Benutze den Median
fare_median = df['Fare'].dropna().median()
df['Fare']  = df['Fare'].fillna(fare_median)
print('Fare     : ', fare_median)
#
# Age (Kardinalskala): 263 fehlende Werte => Benutze den Mittelwert
age_mean  = df['Age'].dropna().mean()
df['Age'] = df['Age'].fillna(age_mean)
print('Age      : ', age_mean)
#
# Cabin (Nominalskala): 1014 fehlende Werte => Benutze den festen Wert 'Unknown'
cabin_fix = 'Unknown'
df['Cabin'] = df['Cabin'].fillna(cabin_fix)
print('Cabin    : ', cabin_fix)

Embarked :  S
Fare     :  14.4542
Age      :  29.881137667304014
Cabin    :  Unknown


In [10]:
# Fehlende Werte prüfen
df.isnull().sum()

PassengerId      0
Survived       418
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin            0
Embarked         0
dtype: int64

In [11]:
# Daten wieder aufteilen
#  (KNIME: "Row Splitter")
df_train = df[df['Survived'].notna()]
df_test  = df[df['Survived'].isna()]

In [12]:
# Irrelevante Attribute filtern
#  (KNIME: "Column Filter")
#
# Trainingsdaten: PassengerId
df_train = df_train.drop(['PassengerId'], axis=1)
#
# Testdaten: Survived
df_test = df_test.drop(['Survived'], axis=1)

In [13]:
display(df_train)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Unknown,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Unknown,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Unknown,S
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,Unknown,S
887,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.881138,1,2,W./C. 6607,23.45,Unknown,S
889,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C


In [14]:
display(df_test)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
891,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Unknown,Q
892,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,Unknown,S
893,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Unknown,Q
894,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,Unknown,S
895,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,Unknown,S
...,...,...,...,...,...,...,...,...,...,...,...
1304,1305,3,"Spector, Mr. Woolf",male,29.881138,0,0,A.5. 3236,8.05,Unknown,S
1305,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C
1306,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,Unknown,S
1307,1308,3,"Ware, Mr. Frederick",male,29.881138,0,0,359309,8.05,Unknown,S


In [15]:
# Daten als Excel-Dateien speichern
#  (KNIME: "Excel Writer")
#
# Trainingsdaten
df_train.to_excel('../../data/titanic/new/training_v1.xlsx', index=False)
#
# Testdaten
df_test.to_excel('../../data/titanic/new/test_v1.xlsx', index=False)
#
# Anwendungsdaten
df_app.to_excel('../../data/titanic/new/application_v1.xlsx', index=False)