In [None]:
#!git clone https://github.com/darioka/impactdeal-2022.git
#%cd impactdeal-2022
#!pip install -r requirements.txt
#!pip install .

# Decision Trees - EPC Rating

In this notebook you will fit a `DecisionTreeClassifier` on the EPC rating dataset. You will experience for yourself what it means to work with real world data!

## Loading the data

As we did for the exploration, let's load the dataset `known_epc_ratings.csv.gz`. If you remember, there is a column with datetime values. It could be useful to convert it to the correct datetime data type during loading. Take a look at `parse_dates` and `infer_datetime_format` of the `pandas`'s documentation about the `read_csv` command.

In [None]:
# Write your code here!

# ...

# raw_df =

## The problem

We remember from the exploration here we don't have a binary classification problem, but a **multiclass classification**. Fortunately, there will be no difference in how we train a `DecisionTreeClassifier`. There will be some differences though in the way we interpret the performance metrics.

Let's now discuss more deeply about the EPC rating dataset. There is small discrepancy between what we want to do, that is to predict the EPC rating *for each property*, and what the dataset is: a collection of data *for each EPC assessment*. In other words, **the dataset may contain multiple EPC ratings for each property**. There are several reasons why we must take that into account:
1. Older EPC ratings may have been updated and could contain incorrect data.
2. If a property has two EPC ratings, say A and B, what should the model output for that property? A or B?
3. If a property has two EPC ratings, when we split the data, one of the sample could be in the training set and the other in the test set. This may lead to an overestimation of the performance of a model on the test set.

In the following cells, show that there are multiple EPC ratings for the same `BUILDING_REFERENCE_NUMBER`. Then create a new dataframe with only the most recent assessments for each property.

In [None]:
# Write your code here!

# ...

# df =

## Cleaning

During exploration we saw that the dataset has some issues with data quality. Some cleaning functions have been provided for your convenience. You can find them in the `cleaning` module of the package `impactdeal`.

In [None]:
from impactdeal.config.column_names import TARGET, NUMERICAL, CATEGORICAL
from impactdeal import cleaning

# all missing values in the categorical columns will be replaced with np.nan
df = cleaning.normalize_missing(df, CATEGORICAL)

# specific cleaning steps for some of the features
df = cleaning.clean_age_band(df)
df = cleaning.clean_floor_level(df)
df = cleaning.clean_mainheat(df)

## Preprocessing

Here we start we the preprocessing steps that will lead us to model training.

### Train-test split

First of all, let's define `X` as a dataframe with the features we want to use, namely `NUMERICAL` and `CATEGORICAL`, and `y` as the series with the `TARGET`.

Then split `X` and `y` in train (70%) and test (30%), stratifying on `y`.

In [None]:
# Write your code here!

# ...

# X = 
# y = 

# X_train, X_test, y_train, y_test =

### Dropping columns

As we saw during exploration, some of the features we have hardly convey information. This is the case with columns that mostly contain missing data. Additionally, we noticed that there are multiple variables with information about lighting in the property, that they are correlated and some of them have many missing values.

For simplicity, let's also discard all textual columns (the ones whose name ends with `DESCRIPTION`). Being free text, if we treat them as simple categories we could have a huge amount of classes. This is not helpful for any model. There are are methods to treat text, that we will see in future modules.

In [None]:
# Write your code here!

# ...

# numerical_cols = [...]
# categorical_cols = [...]

# X_train = 
# X_test = 

### Missing values

In this dataset, missing values can be found both in numerical and categorical columns. We saw that a good way to encode missing values for categorical columns is to treat them as another category. For numerical values instead our solution will be [imputation](https://en.wikipedia.org/wiki/Imputation_(statistics)).

A common imputation method is mean imputation, where one replaces missing data in a column with the mean value of that column. This is an assumption, of course, and sometimes better assumptions could be made trying to understand the data collection process.

In our case, for example, if `MULTI_GLAZE_PROPORTION` is null it seems reasonable to think that the property has no multiple glazing, which would be equivalent to set `MULTI_GLAZE_PROPORTION=0`. This could be a more conservative assumption compared to the mean.

In the following cells, decide for which features it would be better to use a constant value imputation and for which ones a mean imputation. Then apply the transformations using `scikit-lean` [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html).

In [None]:
# Write your code here!

# ...

# zero_imputed_cols = [...]
# mean_imputed_cols = [...]

# ...

# X_train[zero_imputed_cols] = 
# X_train[mean_imputed_cols] = 
# X_test[zero_imputed_cols] = 
# X_test[mean_imputed_cols] = 

# ...

### Categorical encoding

Now we will encode categorical variables. However...

One problem with categorical variables in real world datasets is rare classes, for several reasons:
1. Why are there so few samples with those classes? Instead of valid classes, they could be just errors in the data collection process.
2. If only few samples belongs to a category, every machine learning model will struggle to learn anything from them. Having many rare classes is never beneficial for models.
3. Having many rare classes does not help interpretation.
4. If there are very few examples with a certain class, they might end up in just one of the train o test dataset.

For these reasons, before the `OrdinalEncoder` we will add one more preprocessing step that converts rare classes to another constant value, for example `"other"` or `np.nan`.

Read the docstring of the object `impactdeal.CategoryReducer`, then use it followed by `OrdinalEncoder` to encode all categorical columns in our dataset.

In [None]:
# Write your code here!

# ...

# X_train[categorical_cols] = 
# X_test[categorical_cols] = 

# ...

## Model

Finally, we can now train our decision tree. In the following cell, fit a `DecisionTreeClassifier`, then score the model on the test set the visualize the predictions with a confusion matrix.

In [None]:
# Write your code here!

