## Exercises, Part 1: `Decision Tree`
1. Read the Scikit-learn documentation on random forests [(see here)](https://scikit-learn.org/stable/modules/tree.html).
2. Train / Test the [decision tree model from sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) to classify the  iris dataset, testing how the different parameters affect the  performance. You should test at least the parameters criterion  and max depth.
3. Test the decision tree on the [UCI Car dataset](https://archive.ics.uci.edu/ml/datasets/car+evaluation).
4. `(Optional)` Test the decision tree on other datasets of your interest.
5. Compare the results obtained on the different datasets and with  different parameters. You can use the random state parameter to  compare the different configurations.
6. `(Optional)` Use train/test split and/or k-fold cross-validation to better evaluate the result of your model.

In [56]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [57]:
iris_dataset = load_iris()

In [58]:
dir(iris_dataset)

In [59]:
target_names_list = iris_dataset.target_names[iris_dataset.target]

In [60]:
ir_df = pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)
ir_df['target'] = target_names_list
ir_df.shape

In [61]:
(X_train, X_test, y_train, y_test) = train_test_split(ir_df.iloc[:,:-1], ir_df.iloc[:,-1], random_state=0, test_size=0.90)
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
(dtc.predict(X_test) == y_test).sum()/y_test.shape[0]

In [62]:
(X_train, X_test, y_train, y_test) = train_test_split(ir_df.iloc[:,:-1], ir_df.iloc[:,-1], random_state=0, test_size=0.10)
dtc = DecisionTreeClassifier(max_depth=1)
dtc.fit(X_train, y_train)
(dtc.predict(X_test) == y_test).sum()/y_test.shape[0]

# CAR

In [63]:
col_names = [
    "buying",
    "maint",
    "doors",
    "persons",
    "lug_boot",
    "safety",
    "class"
]
car_df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data', names = col_names, header=None, index_col = False)


# https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.c45-names
# https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data
# https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.names

In [64]:
car_dict = {
    "buying":   ["vhigh", "high", "med", "low"],
    "maint":    ["vhigh", "high", "med", "low"],
    "doors":    ["2", "3", "4", "5more"],
    "persons":  ["2", "4", "more"],
    "lug_boot": ["small", "med", "big"],
    "safety":   ["low", "med", "high"],
    "class":    ["unacc", "acc", "good", "vgood"]
}

In [65]:
def translate(dataframe, dictionary):
    new_df = dataframe.copy()
    for key in dictionary.keys():
        for (index, val) in enumerate(dictionary[key]):
            new_df[key].loc[new_df[key] == val] = index
    new_df = new_df.apply(pd.to_numeric)
    return new_df

In [66]:
car_numeric = translate(car_df, car_dict)

In [67]:
X = car_numeric.drop(["class"], axis=1)
y = car_numeric['class']

In [68]:
(X_train, X_test, y_train, y_test) = train_test_split(X, y, random_state=0, test_size=0.10)
dcf = DecisionTreeClassifier(random_state=12)
dcf = dcf.fit(X_train, y_train)
X_train.shape[0], X_test.shape[0], y_train.shape[0], y_test.shape[0]

(1555, 173, 1555, 173)

In [69]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

In [70]:
(np.array(y_train) == dcf.predict(X_train)).sum(), y_train.shape[0]

(1555, 1555)

In [71]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=12)
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=12, verbose=0,
                       warm_start=False)

In [72]:
(rfc.predict(X_test) == np.array(y_test)).sum(), y_test.shape[0]

(164, 173)