In [5]:
%load_ext autoreload
%autoreload 2

import fd_imputer
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
import itertools
import matplotlib.pyplot as plt
from sklearn import metrics 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Set up all paths and labels needed in this notebook

In [6]:
DATA_PATH = 'MLFD_fd_detection/backend/WEB-INF/classes/inputData/adult.csv'
SPLITS_PATH = 'MLFD_fd_detection/data/'
METANOME_DATA_PATH = 'MLFD_fd_detection/backend/WEB-INF/classes/inputData/'
FD_PATH = 'MLFD_fd_detection/results/HyFD-1.2-SNAPSHOT.jar2019-05-07T082200_fds'
DATA_TITLE = 'adult'

In [7]:
df_train = pd.read_csv(SPLITS_PATH+'test/'+DATA_TITLE+'_test.csv', header=None)
df_test = pd.read_csv(SPLITS_PATH+'train/'+DATA_TITLE+'_train.csv', header=None)
fds = fd_imputer.read_fds(FD_PATH)
impute_column = str(9)
df_test = df_test.replace('noValueSetHere123156456', np.nan)
df_train = df_train.replace('noValueSetHere123156456', np.nan)

In [64]:
df_test.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,11358,58,State-gov,123329,HS-grad,9,Never-married,Adm-clerical,Not-in-family,White,Female,0,0,16,United-States,<=50K
1,10859,23,Local-gov,23438,HS-grad,9,Never-married,Prof-specialty,Not-in-family,White,Female,0,0,40,United-States,<=50K
2,30948,41,Private,433989,Assoc-voc,11,Married-civ-spouse,Sales,Husband,White,Male,4386,0,60,United-States,>50K
3,29811,58,Private,183810,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,<=50K
4,18408,47,Self-emp-inc,181130,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,99999,0,50,United-States,>50K


In [63]:
# Waehle eine beliebige FD, auf der datawig regressiert
rhs = list(fds.keys())[0]
lhs = fds[rhs][0]
relevant_cols = lhs + [rhs]

In [62]:
df_train.iloc[:, relevant_cols].head()

Unnamed: 0,0,3
0,22855,113309
1,18912,412156
2,17454,139347
3,14966,223811
4,12289,48859


In [59]:
df_test.iloc[:, relevant_cols].head()

Unnamed: 0,0,3
0,11358,123329
1,10859,23438
2,30948,433989
3,29811,183810
4,18408,181130


Das Problem bzgl. eines geeigneten Fehlermaßes zeigt sich zeigt sich in den folgenden beiden Zellen.

In [55]:
# ml_imputer() ist ein wrapper fuer datawig.SimpleImputer()
df_imputed = fd_imputer.ml_imputer(df_train.iloc[:,relevant_cols], df_test.iloc[:,relevant_cols], rhs)

y_pred = df_imputed.loc[:, str(rhs)+'_imputed']
y_true = df_imputed.loc[:, str(rhs)]

2019-06-04 16:12:13,919 [INFO]  
2019-06-04 16:12:14,149 [INFO]  Epoch[0] Batch [0-184]	Speed: 13148.13 samples/sec	cross-entropy=0.478446	3-accuracy=0.000000
2019-06-04 16:12:14,356 [INFO]  Epoch[0] Train-cross-entropy=0.243092
2019-06-04 16:12:14,357 [INFO]  Epoch[0] Train-3-accuracy=0.000000
2019-06-04 16:12:14,358 [INFO]  Epoch[0] Time cost=0.436
2019-06-04 16:12:14,363 [INFO]  Saved checkpoint to "imputer_model/model-0000.params"
2019-06-04 16:12:14,394 [INFO]  Epoch[0] Validation-cross-entropy=0.002516
2019-06-04 16:12:14,395 [INFO]  Epoch[0] Validation-3-accuracy=0.000000
2019-06-04 16:12:14,622 [INFO]  Epoch[1] Batch [0-184]	Speed: 13176.01 samples/sec	cross-entropy=0.002311	3-accuracy=0.000000
2019-06-04 16:12:14,839 [INFO]  Epoch[1] Train-cross-entropy=0.001421
2019-06-04 16:12:14,840 [INFO]  Epoch[1] Train-3-accuracy=0.000000
2019-06-04 16:12:14,842 [INFO]  Epoch[1] Time cost=0.446
2019-06-04 16:12:14,848 [INFO]  Saved checkpoint to "imputer_model/model-0001.params"
2019-06-

In [60]:
df_imputed.head()

Unnamed: 0,0,3,3_imputed
0,11358,123329,123864.584355
1,10859,23438,24113.20915
2,30948,433989,433814.71468
3,29811,183810,183545.171394
4,18408,181130,181156.385755


In [56]:
{
'precision': metrics.precision_score(y_true, y_pred, average='weighted'),
'recall': metrics.recall_score(y_true, y_pred, average='weighted'),
'f1': metrics.f1_score(y_true, y_pred, average='weighted')
}

ValueError: Classification metrics can't handle a mix of multiclass and continuous targets

Das Problem ließe sich umgehen, indem Werte aus y_pred gerundet werden:

In [36]:
metrics.f1_score(y_true, y_pred.apply(lambda x: round(x)), average='weighted')

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


0.0011578136578136579

Aber das ist glaube ich nicht wirklich sinnvoll - bei Regression geht es ja nicht darum, einen exakten Wert auszugeben, sondern vielmehr darum, den mittleren Fehler der einzelnen Vorhersagen zu minimieren.
So würde ein gutes Modell, dessen Vorhersagen aber immer um ±0.5 neben dem korrekten Wert liegen, einen f1-Score von 0 haben.

Alternativ könnte man relative oder absolute Fehler berechnen.

In [54]:
average_rel_error = abs((y_true.mean() - y_pred.mean()) / y_true.mean())
print("{:10.4f}".format(100*average_rel_error)+'%')

    0.0663%
