<a href="https://colab.research.google.com/github/abelowska/dataPy/blob/main/Classes_04_DT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Nonlinear regressors: Decision Tree

Today we are going to use our own dataset.

The dataset consists of data on **personality** (Big Five assesed with [NEO FFI](https://sjdm.org/dmidi/NEO-FFI.html)) and **cognitive religious belief styles** ([The Post-Critical Belief Scale](https://theo.kuleuven.be/apps/press/ecsi/files/2019/03/4.-Pollefeyt-Bouwens-PCB-Melb-Vict-for-dummies-EN.pdf)) from 342 individuals. We will be interested wheter it is possible to predict  cognitive religious belief style from personality traits. Make sure you downloaded the dataset from github repository [here](https://github.com/abelowska/dataPy/blob/main/data_neo-ffi_religion.csv) and uploaded it into Colabolatory *Files*.

Imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
import pandas as pd
import seaborn as sns
sns.set_theme(style="whitegrid", palette="deep")

import io
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import median_absolute_error, r2_score
from sklearn.metrics import PredictionErrorDisplay, median_absolute_error
from sklearn.preprocessing import power_transform

plt.rcParams["figure.figsize"] = (10,7)

In [None]:
# constans
test_size=0.2
random_state=42

In [None]:
def compute_score(y_true, y_pred):
  '''
  Helper function for printing scores.

  Parameters:
  y_true: ndarray of y values from original dataset.
  y_pred: ndarray of y values predicted with given model.

  Return:
  dictionary object that consists of R2 and median absolute error scores.

  '''
  return {
        "R2": f"{r2_score(y_true, y_pred):.3f}",
        "MedianAE": f"{median_absolute_error(y_true, y_pred):.3f}",
}

In [None]:
def plot_prediction_error(y_test, y_pred, scores):
  _, ax = plt.subplots(figsize=(5, 5))

  y_test = y_test.to_numpy() if isinstance(y_test, pd.DataFrame) else y_test

  display_ = PredictionErrorDisplay.from_predictions(
      y_test,
      y_pred,
      kind="actual_vs_predicted",
      ax=ax,
      scatter_kwargs={"alpha": 0.5}
  )

  ax.set_title("Linear model")
  for name, score in scores.items():
      ax.plot([], [], " ", label=f"{name}: {score}")
  ax.legend(loc="upper left")
  plt.tight_layout()

## Load dataset

In [None]:
df = pd.read_csv('data_neo-ffi_religion.csv')
df.head()

Inspect the dataset

In [None]:
df.describe()

## Decision Trees

Now, we are going to create our model

*Orthodoxy ~ Extraversion + Agreeableness + Openness + Neuroticism + Conscientiousness*

using decision trees and compare this model to linear regression and KNN. Lets's take a look on the simples DT model.

In [None]:
X = df[[
    'Extraversion',
    'Agreeableness',
    'Conscientiousness',
    'Openness',
    'Neuroticism']]

y = df[['Orthodoxy']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

# create object of DT estimator
dt = DecisionTreeRegressor()

dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

scores = compute_score(y_test, y_pred)
scores

In [None]:
plot_prediction_error(y_test, y_pred, scores)

**Oops! Something is clearly wrong. Do you have any idea what happened?**

### Exercise 3

Try to plot the distribution of y. Does it look like a normal distribution, or something else?

Decision Trees (DTs) try to find the best dividing lines for the data by assessing the quality of these divisions using cost functions (which are based on the data variance). We need to "fix" our variances to make them more comparable, so we can realize the full potential of decision trees.

Create the model with the DT estimator, but before fitting, transform y to have a more Gaussian-like distribution. You can:

1.   Apply an appropriate transformation to y manually.
2.   Use built-in methods such as the [`power_transform()`](https://scikit-learn.org/1.5/modules/generated/sklearn.preprocessing.power_transform.html) function.

In [None]:
# Your code here

In [None]:
plot_prediction_error(y_test, y_pred, scores)

### (Exercise 3.1)

Decision trees have a lot of adjustable parameters. Especially interesting are: `criterion`, `max_depth`, `min_samples_split`, and `min_samples_leaf`. Read about them in the documentation (and in the internet) and see how the performance of the model changes with the change of various parameters. You may want to create a graph of performance from model complexity to see if decision trees overfit easily.

In [None]:
# Your code here

### Exercise 4

And now - the most interesting thing! We can analyze the structure of our fitted decision tree. We have to save the tree into `.dot` file and then we can use the [WebGraphviz](http://www.webgraphviz.com) tool to visualize the tree. You should copy the content of the `.dot` file (saved to the *Files* directory in Colab) to the input area on the [WebGraphviz](http://www.webgraphviz.com).

In [None]:
from sklearn.tree import export_graphviz
# export the decision tree model to a tree_structure.dot file
# paste the contents of the file to webgraphviz.com
export_graphviz(
    dt,
    out_file ='tree_structure.dot',
    feature_names = X.columns.to_numpy()
)

Make a simple linear regression using the same data and take a look into the estimated slopes. Do the conclusions drawn from the linear regression coincide with DT?