# Mini Project: Tree-Based Algorithms

## The "German Credit" Dataset

### Dataset Details

This dataset has two classes (these would be considered labels in Machine Learning terms) to describe the worthiness of a personal loan: "Good" or "Bad". There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration, installment rate in percentage of disposable income, personal information, other debtors/guarantors, residence duration, property, age, other installment plans, housing, number of existing credits, job information, number of people being liable to provide maintenance for, telephone, and foreign worker status.

Many of these predictors are discrete and have been expanded into several 0/1 indicator variables (a.k.a. they have been one-hot-encoded).

This dataset has been kindly provided by Professor Dr. Hans Hofmann of the University of Hamburg, and can also be found on the UCI Machine Learning Repository.

## Decision Trees

 As we have learned in the previous lectures, Decision Trees as a family of algorithms (irrespective to the particular implementation) are powerful algorithms that can produce models with a predictive accuracy higher than that produced by linear models, such as Linear or Logistic Regression. Primarily, this is due to the fact the DT's can model nonlinear relationships, and also have a number of tuning paramters, that allow for the practicioner to achieve the best possible model. An added bonus is the ability to visualize the trained Decision Tree model, which allows for some insight into how the model has produced the predictions that it has. One caveat here, to keep in mind, is that sometimes, due to the size of the dataset (both in the sense of the number of records, as well as the number of features), the visualization might prove to be very large and complex, increasing the difficulty of interpretation.

To give you a very good example of how Decision Trees can be visualized and interpreted, we would strongly recommend that, before continuing on with solving the problems in this Mini Project, you take the time to read this fanstastic, detailed and informative blog post: http://explained.ai/decision-tree-viz/index.html

## Building Your First Decision Tree Model

So, now it's time to jump straight into the heart of the matter. Your first task, is to build a Decision Tree model, using the aforementioned "German Credit" dataset, which contains 1,000 records, and 62 columns (one of them presents the labels, and the other 61 present the potential features for the model.)

For this task, you will be using the scikit-learn library, which comes already pre-installed with the Anaconda Python distribution. In case you're not using that, you can easily install it using pip.

Before embarking on creating your first model, we would strongly encourage you to read the short tutorial for Decision Trees in scikit-learn (http://scikit-learn.org/stable/modules/tree.html), and then dive a bit deeper into the documentation of the algorithm itself (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). 

Also, since you want to be able to present the results of your model, we suggest you take a look at the tutorial for accuracy metrics for classification models (http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report) as well as the more detailed documentation (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

Finally, an *amazing* resource that explains the various classification model accuracy metrics, as well as the relationships between them, can be found on Wikipedia: https://en.wikipedia.org/wiki/Confusion_matrix

(Note: as you've already learned in the Logistic Regression mini project, a standard practice in Machine Learning for achieving the best possible result when training a model is to use hyperparameter tuning, through Grid Search and k-fold Cross Validation. We strongly encourage you to use it here as well, not just because it's standard practice, but also becuase it's not going to be computationally to intensive, due to the size of the dataset that you're working with. Our suggestion here is that you split the data into 70% training, and 30% testing. Then, do the hyperparameter tuning and Cross Validation on the training set, and afterwards to a final test on the testing set.)

### Now we pass the torch onto you! You can start building your first Decision Tree model! :)

In [None]:
# !pip install --upgrade matplotlib
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [None]:
def to_bool(x):
    
    if x == 'Good': return 1
    elif x == 'Bad': return 0
    

In [None]:
# Your code here! :)
GermanCredit = pd.read_csv("GermanCredit.csv.zip")
x = GermanCredit[GermanCredit.columns.drop('Class')]
y = GermanCredit['Class']
y = y.apply(to_bool)
# print(y2)
Xtrain, Xtest, Ytrain, Ytest = train_test_split(x.values, y.values,train_size = 0.75,random_state=5)
# print(len(list(GermanCredit.columns)))
# print(GermanCredit['Class'])

In [None]:
params = {
    'max_depth': [2, 3, 5, 10, 20],
    'min_samples_leaf': [5, 10, 20, 50, 100],
    'criterion': ["gini", "entropy"]
}
dt = DecisionTreeClassifier()
cm2 = GridSearchCV(estimator=dt, param_grid=params, cv=4, n_jobs=-1, verbose = 1,scoring = "accuracy")

cm2.fit(Xtrain, Ytrain)
YPred = cm2.predict(Xtest) 
accuracy = accuracy_score(Ytest,YPred)
print(accuracy)



Fitting 4 folds for each of 50 candidates, totalling 200 fits
0.748


In [None]:
dt_best = cm2.best_estimator_

### After you've built the best model you can, now it's time to visualize it!

Rememeber that amazing blog post from a few paragraphs ago, that demonstrated how to visualize and interpret the results of your Decision Tree model. We've seen that this can perform very well, but let's see how it does on the "German Credit" dataset that we're working on, due to it being a bit larger than the one used by the blog authors.

First, we're going to need to install their package. If you're using Anaconda, this can be done easily by running:

In [None]:
# ! pip install dtreeviz

Collecting dtreeviz
  Downloading dtreeviz-1.3.3.tar.gz (61 kB)
[K     |████████████████████████████████| 61 kB 2.4 kB/s eta 0:00:011
[?25hCollecting graphviz>=0.9
  Downloading graphviz-0.19.1-py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 6.1 MB/s  eta 0:00:01
Collecting colour
  Downloading colour-0.1.5-py2.py3-none-any.whl (23 kB)
Building wheels for collected packages: dtreeviz
  Building wheel for dtreeviz (setup.py) ... [?25ldone
[?25h  Created wheel for dtreeviz: filename=dtreeviz-1.3.3-py3-none-any.whl size=67113 sha256=ef64d05da901f27b340bf749e34f998eb8a9419a92ab50b33bbcae99f1c1b06a
  Stored in directory: /root/.cache/pip/wheels/58/9d/65/e57deb90bf5440945d74bc4c19ebb14a0de2ed2b508c609673
Successfully built dtreeviz
Installing collected packages: graphviz, colour, dtreeviz
Successfully installed colour-0.1.5 dtreeviz-1.3.3 graphviz-0.19.1


If for any reason this way of installing doesn't work for you straight out of the box, please refer to the more detailed documentation here: https://github.com/parrt/dtreeviz

Now you're ready to visualize your Decision Tree model! Please feel free to use the blog post for guidance and inspiration!

In [None]:
# ! conda uninstall graphviz
# ! xcode-select --install
# ! brew reinstall graphviz
# ! rm ~/anaconda3/bin/dot
# ! rm ~/anaconda3/bin/dot
# ! pip install graphviz
# ! type pip

# ! conda install anaconda --all -y

ERROR: unknown command "reinstall" - maybe you meant "install"


In [None]:
# Your code here! :)
from dtreeviz.trees import *

In [None]:
viz = dtreeviz(dt_best, Xtrain, Ytrain, target_name='Class', feature_names=list(x.columns))
viz

ExecutableNotFound: failed to execute 'dot', make sure the Graphviz executables are on your systems' PATH

<dtreeviz.trees.DTreeViz at 0x7f94948ded10>

## Random Forests

As discussed in the lecture videos, Decision Tree algorithms also have certain undesireable properties. Mainly the have low bias, which is good, but tend to have high variance - which is *not* so good (more about this problem here: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).

Noticing these problems, the late Professor Leo Breiman, in 2001, developed the Random Forests algorithm, which mitigates these problems, while at the same time providing even higher predictive accuracy than the majority of Decision Tree algorithm implementations. While the curriculum contains two excellent lectures on Random Forests, if you're interested, you can dive into the original paper here: https://link.springer.com/content/pdf/10.1023%2FA%3A1010933404324.pdf.

In the next part of this assignment, your are going to use the same "German Credit" dataset to train, tune, and measure the performance of a Random Forests model. You will also see certain functionalities that this model, even though it's a bit of a "black box", provides for some degree of interpretability.

First, let's build a Random Forests model, using the same best practices that you've used for your Decision Trees model. You can reuse the things you've already imported there, so no need to do any re-imports, new train/test splits, or loading up the data again.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Your code here! :)


As mentioned, there are certain ways to "peek" into a model created by the Random Forests algorithm. The first, and most popular one, is the Feature Importance calculation functionality. This allows the ML practitioner to see an ordering of the importance of the features that have contributed the most to the predictive accuracy of the model. 

You can see how to use this in the scikit-learn documentation (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_). Now, if you tried this, you would just get an ordered table of not directly interpretable numeric values. Thus, it's much more useful to show the feature importance in a visual way. You can see an example of how that's done here: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py

Now you try! Let's visualize the importance of features from your Random Forests model!

In [None]:
# Your code here

A final method for gaining some insight into the inner working of your Random Forests models is a so-called Partial Dependence Plot. The Partial Dependence Plot (PDP or PD plot) shows the marginal effect of a feature on the predicted outcome of a previously fit model. The prediction function is fixed at a few values of the chosen features and averaged over the other features. A partial dependence plot can show if the relationship between the target and a feature is linear, monotonic or more complex. 

In scikit-learn, PDPs are implemented and available for certain algorithms, but at this point (version 0.20.0) they are not yet implemented for Random Forests. Thankfully, there is an add-on package called **PDPbox** (https://pdpbox.readthedocs.io/en/latest/) which adds this functionality to Random Forests. The package is easy to install through pip.

In [None]:
# ! pip install pdpbox

While we encourage you to read the documentation for the package (and reading package documentation in general is a good habit to develop), the authors of the package have also written an excellent blog post on how to use it, showing examples on different algorithms from scikit-learn (the Random Forests example is towards the end of the blog post): https://briangriner.github.io/Partial_Dependence_Plots_presentation-BrianGriner-PrincetonPublicLibrary-4.14.18-updated-4.22.18.html

So, armed with this new knowledge, feel free to pick a few features, and make a couple of Partial Dependence Plots of your own!

In [None]:
# Your code here!

## (Optional) Advanced Boosting-Based Algorithms

As explained in the video lectures, the next generation of algorithms after Random Forests (that use Bagging, a.k.a. Bootstrap Aggregation) were developed using Boosting, and the first one of these were Gradient Boosted Machines, which are implemented in scikit-learn (http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting).

Still, in recent years, a number of variations on GBMs have been developed by different research amd industry groups, all of them bringing improvements, both in speed, accuracy and functionality to the original Gradient Boosting algorithms.

In no order of preference, these are:
1. **XGBoost**: https://xgboost.readthedocs.io/en/latest/
2. **CatBoost**: https://tech.yandex.com/catboost/
3. **LightGBM**: https://lightgbm.readthedocs.io/en/latest/

If you're using the Anaconda distribution, these are all very easy to install:

In [None]:
# ! conda install -c anaconda py-xgboost --all -y

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/springboard

  added / updated specs:
    - py-xgboost


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |             main           3 KB  anaconda
    _openmp_mutex-4.5          |            1_gnu          22 KB
    _py-xgboost-mutex-2.0      |            cpu_0           9 KB  anaconda
    _tflow_select-2.1.0        |              gpu           2 KB  anaconda
    absl-py-0.10.0             |           py37_0         169 KB  anaconda
    astor-0.8.1                |           py37_0          45 KB  anaconda
    backports-1.0              |             py_2         139 KB  anaconda
    backports.functools_lru_cache-1.6.1|             py_0          11 KB  anaconda
    backports.tempfile-1.0     |             py_1       

requests-2.24.0      | 54 KB     | ##################################### | 100% 
expat-2.2.10         | 192 KB    | ##################################### | 100% 
libgfortran5-9.3.0   | 1.5 MB    | ##################################### | 100% 
libpng-1.6.37        | 364 KB    | ##################################### | 100% 
h5py-2.10.0          | 1.1 MB    | ##################################### | 100% 
google-pasta-0.2.0   | 44 KB     | ##################################### | 100% 
python-libarchive-c- | 50 KB     | ##################################### | 100% 
pcre-8.44            | 269 KB    | ##################################### | 100% 
libxgboost-0.90      | 3.8 MB    | ##################################### | 100% 
idna-2.10            | 56 KB     | ##################################### | 100% 
libssh2-1.9.0        | 346 KB    | ##################################### | 100% 
markupsafe-1.1.1     | 26 KB     | ##################################### | 100% 
intel-openmp-2020.2  | 947 K

click-7.1.2          | 67 KB     | ##################################### | 100% 
icu-58.2             | 22.7 MB   | ##################################### | 100% 
_libgcc_mutex-0.1    | 3 KB      | ##################################### | 100% 
liblief-0.10.1       | 2.0 MB    | ##################################### | 100% 
chardet-3.0.4        | 173 KB    | ##################################### | 100% 
wheel-0.35.1         | 36 KB     | ##################################### | 100% 
gast-0.2.2           | 137 KB    | ##################################### | 100% 
lz4-c-1.9.2          | 203 KB    | ##################################### | 100% 
zlib-1.2.11          | 120 KB    | ##################################### | 100% 
markdown-3.3.2       | 123 KB    | ##################################### | 100% 
future-0.18.2        | 720 KB    | ##################################### | 100% 
libarchive-3.4.2     | 1.6 MB    | ##################################### | 100% 
pyopenssl-19.1.0     | 47 KB

In [None]:
# ! conda install -c conda-forge catboost --all -y

Collecting package metadata (current_repodata.json): done
Solving environment: / 
  - anaconda/linux-64::_tflow_select-2.1.0-gpu, anaconda/linux-64::gast-0.2.2-py37_0
  - anaconda/linux-64::_tflow_select-2.1.0-gpu, defaults/linux-64::gast-0.2.2-py37_0
  - anaconda/linux-64::gast-0.2.2-py37_0, defaults/linux-64::_tflow_select-2.1.0-gpu
  - defaults/linux-64::_tflow_select-2.1.0-gpu, defaults/linux-64::gast-0.2.2-py37done

## Package Plan ##

  environment location: /opt/conda/envs/springboard

  added / updated specs:
    - catboost


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _libgcc_mutex-0.1          |      conda_forge           3 KB  conda-forge
    _openmp_mutex-4.5          |           1_llvm           5 KB  conda-forge
    _py-xgboost-mutex-2.0      |            cpu_0           8 KB  conda-forge
    absl-py-0.12.0             |     pyhd8ed1ab_0          96 KB  conda-forge
    

beautifulsoup4-4.10. | 77 KB     | ##################################### | 100% 
pyqt-impl-5.12.3     | 5.9 MB    | ##################################### | 100% 
harfbuzz-3.4.0       | 2.2 MB    | ##################################### | 100% 
libssh2-1.10.0       | 233 KB    | ##################################### | 100% 
py-lief-0.11.5       | 1.4 MB    | ##################################### | 100% 
libgfortran-ng-11.2. | 19 KB     | ##################################### | 100% 
dbus-1.13.6          | 604 KB    | ##################################### | 100% 
cryptography-36.0.1  | 1.7 MB    | ##################################### | 100% 
xorg-libsm-1.2.3     | 26 KB     | ##################################### | 100% 
cffi-1.15.0          | 225 KB    | ##################################### | 100% 
pillow-9.0.1         | 44.4 MB   | ##################################### | 100% 
pyqtchart-5.12       | 257 KB    | ##################################### | 100% 
pthread-stubs-0.4    | 5 KB 

keras-applications-1 | 30 KB     | ##################################### | 100% 
freetype-2.10.4      | 890 KB    | ##################################### | 100% 
font-ttf-dejavu-sans | 388 KB    | ##################################### | 100% 
libnghttp2-1.47.0    | 808 KB    | ##################################### | 100% 
markupsafe-2.1.0     | 22 KB     | ##################################### | 100% 
typing_extensions-4. | 29 KB     | ##################################### | 100% 
libgfortran5-11.2.0  | 1.7 MB    | ##################################### | 100% 
ripgrep-13.0.0       | 1.7 MB    | ##################################### | 100% 
requests-2.27.1      | 53 KB     | ##################################### | 100% 
pip-22.0.3           | 1.5 MB    | ##################################### | 100% 
pcre-8.45            | 253 KB    | ##################################### | 100% 
fontconfig-2.13.96   | 372 KB    | ##################################### | 100% 
hdf5-1.12.1          | 3.6 M

libstdcxx-ng-11.2.0  | 4.2 MB    | ##################################### | 100% 
mysql-common-8.0.28  | 1.8 MB    | ##################################### | 100% 
absl-py-0.12.0       | 96 KB     | ##################################### | 100% 
soupsieve-2.3.1      | 33 KB     | ##################################### | 100% 
font-ttf-source-code | 684 KB    | ##################################### | 100% 
matplotlib-base-3.5. | 7.4 MB    | ##################################### | 100% 
libxcb-1.13          | 391 KB    | ##################################### | 100% 
libopenblas-0.3.18   | 9.6 MB    | ##################################### | 100% 
_libgcc_mutex-0.1    | 3 KB      | ##################################### | 100% 
libwebp-base-1.2.2   | 824 KB    | ##################################### | 100% 
xorg-xextproto-7.3.0 | 28 KB     | ##################################### | 100% 
pycurl-7.44.1        | 70 KB     | ##################################### | 100% 
Preparing transaction: done


In [None]:
# ! conda install -c conda-forge lightgbm --all -y

Collecting package metadata (current_repodata.json): done
Solving environment: - 
  - anaconda/linux-64::_tflow_select-2.1.0-gpu, anaconda/linux-64::gast-0.2.2-py37_0
  - anaconda/linux-64::_tflow_select-2.1.0-gpu, defaults/linux-64::gast-0.2.2-py37_0
  - anaconda/linux-64::gast-0.2.2-py37_0, defaults/linux-64::_tflow_select-2.1.0-gpu
  - defaults/linux-64::_tflow_select-2.1.0-gpu, defaults/linux-64::gast-0.2.2-py37done

## Package Plan ##

  environment location: /opt/conda/envs/springboard

  added / updated specs:
    - lightgbm


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    lightgbm-3.3.2             |   py37hcd2ae1e_0         1.8 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         1.8 MB

The following NEW packages will be INSTALLED:

  lightgbm           conda-forge/linux-64::lightgbm-3.3.2-py

Your task in this optional section of the mini project is to read the documentation of these three libraries, and apply all of them to the "German Credit" dataset, just like you did in the case of Decision Trees and Random Forests.

The final deliverable of this section should be a table (can be a pandas DataFrame) which shows the accuracy of all the five algorthms taught in this mini project in one place.

Happy modeling! :)