# Architecture

## Decision Tree

Going with XGBoost Ensemble

**The steps for building a decision tree are as follows:**
- Start with all examples at the root node
- Calculate information gain for splitting on all possible features, and pick the one with the highest information gain
- Split dataset according to the selected feature, and create left and right branches of the tree
- Keep repeating splitting process until stopping criteria is met
  

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from public_tests import *
from utils import *

%matplotlib inline

ModuleNotFoundError: No module named 'public_tests'

## Decision Tree: `HistGradientBoostingClassifier`

For tree-based models, the handling of numerical and categorical variables is
simpler than for linear models:
* we do **not need to scale the numerical features**
* using an **ordinal encoding for the categorical variables** is fine even if
  the encoding results in an arbitrary ordering

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)],
    remainder="passthrough")

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

## Logistic Regression

### Import Data

In [None]:
df = pd.read_csv("<PATH>")
# drop the duplicated column
df = df.drop(columns="<DUPLICATED>")

target_name = "<TARGET COLUMN>"
target = df[target_name]

data = df.drop(columns=[target_name])

### Separate Data Types

First look at columns to make sure things are not misclassified...

In [None]:
from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

### Setup `dtype` Specific Preprocessing

We first define the columns depending on their data type:

* **one-hot encoding** will be applied to categorical columns. Besides, we
  use `handle_unknown="ignore"` to solve the potential issues due to rare
  categories.
* **numerical scaling** numerical features which will be standardized.

First, create the preprocessors for the numerical and categorical components.

In [5]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

Now, create the transformer and associate weach of these preprocessors with their respective data columns

In [6]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numerical_preprocessor, numerical_columns)])

NameError: name 'categorical_columns' is not defined

A `ColumnTransformer` does the following:

* It **splits the columns** of the original dataset based on the column names
  or indices provided. We will obtain as many subsets as the number of
  transformers passed into the `ColumnTransformer`.
* It **transforms each subsets**. A specific transformer is applied to
  each subset: it will internally call `fit_transform` or `transform`. The
  output of this step is a set of transformed datasets.
* It then **concatenates the transformed datasets** into a single dataset.

**Note** you can also feed piplelines into the `ColumnTransformer`, for example if you need to impute values:

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler(),
    )])

categorical_transformer = OneHotEncoder(handle_unknown='ignore') #this means missing values will be assigned a vector of 0s

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

NameError: name 'numeric_features' is not defined

`ColumnTransformer` is like any other scikit-learn transformer. In particular it can be combined with a classifier
in a `Pipeline`:

### Create Pipeline

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
model

Once this is set up follow the standard process:

- the `fit` method is called to preprocess the data and then train the
  classifier of the preprocessed data;
- the `predict` method makes predictions on new data;
- the `score` method is used to predict on the test data and compare the
  predictions to the expected test labels to compute the accuracy.

**Alt** method of creating pipeline
> Not sure about the difference between `make_pipeline` and `Pipeline`

In [None]:
from sklearn.linear_model import LogisticRegression

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
]

#### Visualize Pipeline

In [None]:
from sklearn import set_config

set_config(display='diagram')
model

### Split Data

In [None]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

### Cross Validation

Cross validation is combining the fit, predict, and scoring steps

In [None]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=5)
cv_results

**Alt**

Shuffle split randomly selects for test set membership

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=40, test_size=0.3, random_state=0)
cv_results = cross_validate(
    regressor, data, target, cv=cv, scoring="neg_mean_absolute_error")

### Score

In [None]:
scores = cv_results["test_score"]
print("The mean cross-validation accuracy is: "
      f"{scores.mean():.3f} ± {scores.std():.3f}")

# Training

# Diagnostics

## Bias/Variance

[Andrew Ng Video](https://www.coursera.org/learn/advanced-learning-algorithms/lecture/L6SHx/diagnosing-bias-and-variance)

## Error Analysis

[Andrew Ng Video](https://www.coursera.org/learn/advanced-learning-algorithms/lecture/FaPgS/error-analysis)