## Phase 6 (Deployment): Preparation (v3: string)

* Autorin: Anna (i3-Versicherung)
* Webseite: [Data Science Training - Kapitel 6](https://data-science.training/kapitel-6/)
* Datum: 23.03.2023

Auch die Anwendungsdaten, die wir in der Phase 6 (Deployment) benutzen möchten, müssen in Anlehnung an Phase 3 (Data Preparation) entsprechend vorverarbeitet werden.

#### Hinweis

Die Features KnownCabin und Title werden wir gar nicht erstellen, weil sie später sowieso wieder herausgefiltert werden (siehe Phase 3). 

In [4]:
# Pandas Paket (Package) importieren
#  Datenstrukturen und Datenanalyse, I/O
#  https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd

In [5]:
# Anwendungsdaten als Pandas Data Frame (df) aus CSV-Datei laden
#  (KNIME: "CSV Reader")
df = pd.read_csv('../../data/titanic/original/application.csv')

In [6]:
# Datentypen automatisch konvertieren
df = df.convert_dtypes()

In [7]:
# Fehlende Werte prüfen
df.isnull().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [8]:
# Neues Feature: Child
#  (KNIME: "Rule Engine")
df['Child'] = (df['Age'] < 12).fillna(False).astype('int')

In [9]:
# Neues Feature FamilySizeBinned
#  (KNIME: "Math Formula", "Table Creator", "Binner (Dictionary)")
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
bins   = [0, 2, 5, 99]
labels = ['No', 'Small', 'Large']
df['FamilySizeBinned'] = pd.cut(df['FamilySize'], bins, right=False, labels=labels)

In [10]:
# Neues Feature FareBinned
#  (KINME: "Table Creator", "Binner (Dictionary)")
bins   = [-1, 8, 16, 32, 1024]
labels = ['Low', 'Medium', 'High', 'VeryHigh']
df['FareBinned'] = pd.cut(df['Fare'], bins, right=False, labels=labels)

In [11]:
# Ergebnis des Feature Engineering anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   PassengerId       23 non-null     Int64   
 1   Pclass            23 non-null     Int64   
 2   Name              23 non-null     string  
 3   Sex               23 non-null     string  
 4   Age               23 non-null     Int64   
 5   SibSp             23 non-null     Int64   
 6   Parch             23 non-null     Int64   
 7   Ticket            23 non-null     string  
 8   Fare              23 non-null     Float64 
 9   Cabin             23 non-null     string  
 10  Embarked          23 non-null     string  
 11  Child             23 non-null     int32   
 12  FamilySize        23 non-null     Int64   
 13  FamilySizeBinned  23 non-null     category
 14  FareBinned        23 non-null     category
dtypes: Float64(1), Int64(6), category(2), int32(1), string(5)
memory usage: 2.9 

In [12]:
# Aufräumen: Attribute (manuell) herausfiltern
#  (KNIME "Column Filter")
df = df.drop(['Name', 'Age', 'SibSp', 'Parch', 'Fare' , 'Cabin', 'FamilySize', 'Ticket'], axis=1)
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   PassengerId       23 non-null     Int64   
 1   Pclass            23 non-null     Int64   
 2   Sex               23 non-null     string  
 3   Embarked          23 non-null     string  
 4   Child             23 non-null     int32   
 5   FamilySizeBinned  23 non-null     category
 6   FareBinned        23 non-null     category
dtypes: Int64(2), category(2), int32(1), string(2)
memory usage: 1.4 KB


In [13]:
# Version 3: Datentyp string
df = df.astype('string')
df['PassengerId'] = df['PassengerId'].astype('int') # Ausnahme: Primärschlüsselattribut
# Ergebnis anzeigen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   PassengerId       23 non-null     int32 
 1   Pclass            23 non-null     string
 2   Sex               23 non-null     string
 3   Embarked          23 non-null     string
 4   Child             23 non-null     string
 5   FamilySizeBinned  23 non-null     string
 6   FareBinned        23 non-null     string
dtypes: int32(1), string(6)
memory usage: 1.3 KB


In [14]:
display(df.head())

Unnamed: 0,PassengerId,Pclass,Sex,Embarked,Child,FamilySizeBinned,FareBinned
0,1310,1,male,C,0,No,Low
1,1311,1,male,C,0,No,High
2,1312,2,male,S,0,No,High
3,1313,1,male,S,0,No,Low
4,1314,1,male,S,0,Small,Low


In [15]:
# Daten als Excel-Dateien speichern
#  (KNIME: "Excel Writer")
df.to_excel('../../data/titanic/new/application_v3.xlsx', index=False)