## Age Phase 3 (Data Preparation): Fare(s)

* Autorin: Anna (i3-Versicherung)
* Webseite: [Data Science Training - Kapitel 12](https://data-science.training/kapitel-12/)
* Datum: 23.03.2023

Wir betrachten das kardinalskalierte Attribut Fare und führen hierzu ein Feature Engineering durch. Diese Features analysieren wir dann in Bezug auf lineare Korrelationen zum Attribut Age.

In [4]:
# Pandas Paket (Package) importieren
#  Datenstrukturen und Datenanalyse, I/O
#  https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd
# NumPy Paket (Package) importieren
#  Mehrdimensionale Datenstrukturen (Vektoren, Matrizen, Tensoren, Arrays), Lineare Algebra
#  https://numpy.org/doc/
import numpy as np

In [5]:
# Trainings- und Testdaten als Pandas Data Frame (df) aus CSV-Dateien laden
#  (KNIME: "CSV Reader")
df_train = pd.read_csv('../../data/titanic/original/train.csv')
df_test  = pd.read_csv('../../data/titanic/original/test.csv')

In [6]:
# Trainings- und Testdaten zusammenführen
#  (KNIME "Concatenate")
df = pd.concat([df_train, df_test], ignore_index=True)

In [7]:
# Datentypen automatisch konvertieren
df = df.convert_dtypes()

In [8]:
# Fehlende Werte behandeln (d.h. schätzen)
#  (KNIME: "Missing Values")
# Fare (Kardinalskala): 1 fehlender Wert => Benutze den konstanten Wert 7,896
#df['Fare'] = df['Fare'].fillna(7.896)
display(df[df['Fare'].isna()])
df.loc[1043, 'Fare'] = 7.896
display(df[df['PassengerId'] == 1044])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1043,1044,,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1043,1044,,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,7.896,,S


In [9]:
# Hilfsgröße TicketCount
#  (KNIME: "GroupBy", "Joiner", "Column Rename")
ticketCount = df.groupby('Ticket', as_index=False)['PassengerId'].count()
ticketCount = ticketCount.rename(columns={'PassengerId': 'TicketCount'})
df = df.merge(ticketCount, how='left', on='Ticket')

In [10]:
# Neues Feature "FamilySize"
#  (KNIME: "Math Formula")
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

In [11]:
# Neues Features "FarePerTicketCount" und "FarePerFamilySize"
#  (KNIME: "Math Formula")
df['FarePerTicketCount'] = df['Fare'] / df['TicketCount']
df['FarePerFamilySize']  = df['Fare'] / df['FamilySize']

In [12]:
# Neues Features "LogFare", "LogFareTC" und "LogFareFS"
#  (KNIME: "Math Formula")
df['LogFare']   = np.log( 1 + df['Fare']               )
df['LogFareTC'] = np.log( 1 + df['FarePerTicketCount'] )
df['LogFareFS'] = np.log( 1 + df['FarePerFamilySize']  )

In [13]:
# Korrelationsmatrix mit den linearen Korrelationskoeffizienten nach Pearson
#  (KNIME: "Linear Correlation")
def dst_correlation_matrix(df):
    # Nur numerische Attribute auswählen
    df1 = df.select_dtypes(include=[np.number])
    # Korrelationsmatrix berechnen
    corr_matrix = df1.corr(method='pearson')
    # Rückgabe
    return corr_matrix

In [14]:
# Korrelationsmatrix berechnen und anzeigen
corr_matrix = dst_correlation_matrix(df)['Age'].sort_values()
display(corr_matrix)

Pclass               -0.408106
SibSp                -0.243699
FamilySize           -0.240229
TicketCount          -0.185284
Parch                -0.150917
Survived             -0.077221
PassengerId           0.028814
Fare                  0.177280
LogFare               0.192259
FarePerFamilySize     0.192356
LogFareFS             0.342118
FarePerTicketCount    0.359813
LogFareTC             0.411531
Age                   1.000000
Name: Age, dtype: float64

### Ergebnisse der Korrelationsanalyse der neuen Features zum Attribut Age

| Attribut           | R      |
|--------------------|--------|
| LogFareTC          | +0,412 |
| FarePerTicketCount | +0,360 |
| LogFareFS          | +0,342 |
| FarePerFamilySize  | +0,192 |
| LogFare            | +0,192 |
| Fare               | +0,177 |

Das neue Feature LogFareTC zeigt die stärkste lineare Korrelation zum Attribut Age. Dieses werden wir somit verwenden.