<h1>Solution to exercise 2</h1>

In [None]:
# Importing necessary libraries
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.impute import SimpleImputer
import pandas as pd

# Use 42 as Random_Variable to produce the same results
RANDOM_VARIABLE = 42

In [None]:
# Loading train and test set
income_train = pd.read_csv("income.csv")
income_test = pd.read_csv("income_test.csv")

# Droping examples with outlier values
# There are only 159 examples with capital-gain>90999, 44 examples with capital-loss>2500, 146 with hours-per-week>85
# and 142 with fnlwgt>6*10^5 out of 32561 examples in the train dataset.
indexNames = income_train[
    (income_train['capital-gain'] > 90999) |
    (income_train['capital-loss'] > 2500) |
    (income_train['hours-per-week'] > 85) |
    (income_train['fnlwgt'] > 6e+5)
    ].index
income_train.drop(indexNames , inplace=True, axis=0)

# Separate the income attribute to y_train and y_test.
X_train = income_train.drop("income", axis=1)
y_train = income_train["income"]

X_test = income_test.drop("income", axis=1)
y_test = income_test["income"]

<i>In the following cell there is the whole data preprocess with Pipelines, SimpleImputer, StandardScaler, OneHotEncoder, ColumnTransformer so the data will be ready for a DecisionTreeClassifier. </i>

In [None]:
# Fill the missing numerical attributes with the median value and then scale them. 
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Fill the missing categorical attributes with the string "missing". Finally, encode the categorical attributes with
# OneHotEncoder, so DecisionTreeClassifier can run.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

# Create a ColumnTransformer and assign the two previous transformers to numerical and categorical attributes respectively.
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, list(X_train.select_dtypes(exclude="object").columns)),
        ('cat', categorical_transformer, list(X_train.select_dtypes(include="object").columns)),
    ])

# Final pipeline for preprocessing the data and use a DecisionTreeClassifier
pipe = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('decision_tree', DecisionTreeClassifier(random_state = RANDOM_VARIABLE))
])

<i>Finally, the use of gridsearch is helping us to find the best parameters for the classifier. We check some parameters of DecisionTreeClassifier such as max_depth, min_samples_split, min_samples leaf, then calculate the accuracy their combination has, with cross validation method. With grid_search.best_estimator_ we get the best estimator. Fit data, predict, calculate and predict the scores. 

In [None]:
# We will examine the compination of the following parameters to grid_search in order to find the best_estimator.
testing_parameters= [{
    'decision_tree__max_depth': [4, 8],
    'decision_tree__min_samples_split': [2,4,8],
    'decision_tree__min_samples_leaf' : [4, 8, 16],
}]

grid_search = GridSearchCV(
    pipe,
    testing_parameters,
    cv=3
)

# Fit the train data to grid_search, find the best_estimator_, fit train data to it, calculate the predictions
grid_search.fit(X_train, y_train)
best_esti = grid_search.best_estimator_
best_esti.fit(X_train,y_train)
predictions = best_esti.predict(X_test)

#Calculate and print the f1_score and accuracy score that best_estimator achieves.
fScore = f1_score(predictions,y_test,average='micro')
accScore = accuracy_score(y_test, predictions)
print('f1 score: ',fScore)
print('accuracy score: ', accScore)