## Age Phase 3 (Data Preparation): Firstname, MeanAge, MedianAge

* Autorin: Anna (i3-Versicherung)
* Webseite: [Data Science Training - Kapitel 12](https://data-science.training/kapitel-12/)
* Datum: 23.03.2023

Wir betrachten wieder das nominalskalierte Attribut Name, genauer gesagt einen Teil davon, nämlich das Feature Firstname und davon abgeleitet zwei weitere Features. Diese Features analysieren wir dann in Bezug auf lineare Korrelationen zum Attribut Age.

In [4]:
# Pandas Paket (Package) importieren
#  Datenstrukturen und Datenanalyse, I/O
#  https://pandas.pydata.org/pandas-docs/stable/
import pandas as pd
# NumPy Paket (Package) importieren
#  Mehrdimensionale Datenstrukturen (Vektoren, Matrizen, Tensoren, Arrays), Lineare Algebra
#  https://numpy.org/doc/
import numpy as np

In [5]:
# Trainings- und Testdaten als Pandas Data Frame (df) aus CSV-Dateien laden
#  (KNIME: "CSV Reader")
df_train = pd.read_csv('../../data/titanic/original/train.csv')
df_test  = pd.read_csv('../../data/titanic/original/test.csv')

In [6]:
# Trainings- und Testdaten zusammenführen
#  (KNIME "Concatenate")
df = pd.concat([df_train, df_test], ignore_index=True)

In [7]:
# Datentypen automatisch konvertieren
df = df.convert_dtypes()

In [8]:
# Fehlende Werte behandeln (d.h. schätzen)
#  (KNIME: "Missing Values")
# Fare (Kardinalskala): 1 fehlender Wert => Benutze den konstanten Wert 7,896
#df['Fare'] = df['Fare'].fillna(7.896)
display(df[df['Fare'].isna()])
df.loc[1043, 'Fare'] = 7.896
display(df[df['PassengerId'] == 1044])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1043,1044,,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1043,1044,,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,7.896,,S


In [9]:
# Neues Feature Firstname
#  (KNIME: "Cell Splitter", "Column Rename", "Table Creator", "Cell Replacer")
df['Firstname'] = df['Name'].str.split(', ', expand=True)[1].str.split(' ', expand=True)[1]
df['Firstname'] = df['Firstname'].str.replace('(', '', regex=False)
df['Firstname'] = df['Firstname'].str.replace(')', '', regex=False)

In [10]:
# Neues Feature MeanAge und MedianAge berechnen
#  (KNIME: "GroupBy", "Joiner", "Missing Value", "Column Rename")
#
mean_age = df.groupby('Firstname', as_index=False)['Age'].mean()
mean_age = mean_age.rename(columns={'Age': 'MeanAge'})
#display(mean_age)
# 
median_age = df.groupby('Firstname', as_index=False)['Age'].median()
median_age = median_age.rename(columns={'Age': 'MedianAge'})
#display(median_age)
#
df = df.merge(  mean_age, how='left', on='Firstname')
df = df.merge(median_age, how='left', on='Firstname')
#
#df['MeanAge']   = df['MeanAge'].fillna(-1)
#df['MedianAge'] = df['MedianAge'].fillna(-1)
#
display(df)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Firstname,MeanAge,MedianAge
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Owen,20.0,20.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,John,36.178571,36.5
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Laina,26.0,26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Jacques,36.0,36.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,William,32.127119,30.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S,Woolf,,
1305,1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C,Fermina,39.0,39.0
1306,1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S,Simon,38.5,38.5
1307,1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S,Frederick,34.875,34.0


In [11]:
# Korrelationsmatrix mit den linearen Korrelationskoeffizienten nach Pearson
#  (KNIME: "Linear Correlation")
def dst_correlation_matrix(df):
    # Nur numerische Attribute auswählen
    df1 = df.select_dtypes(include=[np.number])
    # Korrelationsmatrix berechnen
    corr_matrix = df1.corr(method='pearson')
    # Rückgabe
    return corr_matrix

In [12]:
# Korrelationsmatrix berechnen und anzeigen
corr_matrix = dst_correlation_matrix(df)['Age'].sort_values()
display(corr_matrix)

Pclass        -0.408106
SibSp         -0.243699
Parch         -0.150917
Survived      -0.077221
PassengerId    0.028814
Fare           0.177280
MedianAge      0.712820
MeanAge        0.726015
Age            1.000000
Name: Age, dtype: float64

### Ergebnisse der Korrelationsanalyse der neuen Features zum Attribut Age

| Attribut  | R      |
|-----------|--------|
| MeanAge   | +0,726 |
| MedianAge | +0,713 |
