# Part2 - Nonlinear Models
## Goal:
1. Compare the nonlinear models including:
* Decision Tree Regressor
* Random Forest
* XGBoost
* Neural Network (if we have time)
2. What more you should know but we will not cover today:
* LightGBM
* TensorFlow Decision Forests: TensorFlow Decision Forests
3. Load all models and conduct feature importance analysis
* permutation importance [Link](https://scikit-learn.org/stable/modules/permutation_importance.html)
* SHAP package [Link](https://shap.readthedocs.io/en/latest/) (if we have time)

### Readings to review:
1. Intro to decision tree [Link] (https://medium.com/@MrBam44/decision-trees-91f61a42c724#:~:text=A%20decision%20tree%20is%20a,Bagging%2C%20and%20Boosted%20Decision%20Trees.)

# Introduction to Decision Tree (in 5 seconds)
<p align="center">
<img src="../asset/decision_tree.PNG" alt="decision_tree" style="width:50%; border:0;">
</p>

* Root Nodes — It is the node present at the beginning of a decision tree from this node the population starts dividing according to various features.
* Decision Nodes — the nodes we get after splitting the root nodes are called Decision Node
* Leaf Nodes — the nodes where further splitting is not possible are called leaf nodes or terminal nodes
* Branch/Sub-tree — just like a small portion of a graph is called sub-graph similarly a sub-section of this decision tree is called sub-tree.
* Pruning — cutting down some nodes to stop overfitting.
* [Source](https://medium.com/@MrBam44/decision-trees-91f61a42c724#:~:text=A%20decision%20tree%20is%20aBagging%2C%20and%20Boosted%20Decision%20Trees)

# Tutorials

### 1. Data Preprocessing
* Here we simply repeat the previosu steps

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import math
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
import warnings
pd.set_option('display.max_columns', None)
warnings.filterwarnings("ignore")

In [None]:
# load data
df_train = pd.read_csv('./data/train.csv', index_col='Id')
df_train.head()