<a href="https://colab.research.google.com/github/al025/Machine-Learning-Study-Notes/blob/master/c4_chap1_3_CART.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Application
- predict whether a tumor is benign or malignant 
- predict miles per gallon(mpg) for a car 

#### Objectives
- use DecisionTreeClassifier to infer class labels
- use DecisionTreeRegressor to predict continuous target values

## Classification and Regression Tree (CART)
- an internal node is split to maximize **information gain**
- information gain is measured using *gini index* or *entropy*
- sequence of if-else questions about individual features
###### Advantages:
- easy to interpret
- no need for scaling
###### Limitations:
- (classifier) othogonal decision boundary 
- without constraints may lead to high variance (overfitting)
- sensitive to small variations in the training datasets 

### Demo 1: use DecisionTreeClassifier to decide a tumor is benign or malignant

- the **decision boundary** where examples of different classes are separated into different subspace in the feature space is linear for linear models (such as LogisticRegression, LinearSVC) but **nonlinear** & **orthogonal** for DecisionTreeClassifier



In [1]:
from google.colab import drive
drive.mount('/content/drive')
dir = '/content/drive/My Drive/MachineLearningDatacampCareerTrack/c4_tree_based_models/data'

Mounted at /content/drive


In [38]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

# load data 
fname = dir + '/wbc.csv'
df = pd.read_csv(fname)
# convert str class labels to numerical values ## this is unnecessary, scikit-learn accepts categorical values as targets (although it does not accept categorical features)
# num_label_dict = dict(zip(['M', 'B'], [1, 0])) 
# df.diagnosis = df['diagnosis'].map(num_label_dict, na_action='ignore')
# set up feature matrix and class label vector
X, y = df.drop('diagnosis', axis=1).values, df['diagnosis'].values
# replace nan with the mean of this feature
X = SimpleImputer().fit_transform(X)
# split into 80% training set, 20% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

parameters = {'max_depth':[3, 6, 9, 12]}
dt = DecisionTreeClassifier()
cv = GridSearchCV(dt, parameters, cv=5)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

print("Tuned hyperparameters: {}".format(cv.best_params_))
print("Best cross validation score in training: {}".format(cv.best_score_))
print("Classification report in test:\n {}".format(classification_report(y_test, y_pred)))


Tuned hyperparameters: {'max_depth': 6}
Best cross validation score in training: 0.9472527472527472
Classification report in test:
               precision    recall  f1-score   support

           B       0.95      0.97      0.96        72
           M       0.95      0.90      0.93        42

    accuracy                           0.95       114
   macro avg       0.95      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



In [17]:
unconstrained_dt = DecisionTreeClassifier()
unconstrained_dt.fit(X_train, y_train)
y_train_pred = unconstrained_dt.predict(X_train)
print("training set classification report\n {}".format(classification_report(y_train, y_train_pred)))
y_pred = unconstrained_dt.predict(X_test)
print("test set classification report\n {}".format(classification_report(y_test, y_pred)))
print("depth of the decision tree: {}".format(unconstrained_dt.get_depth()))
print("number of leaves of the decision tree: {}".format(unconstrained_dt.get_n_leaves()))
# overfitting ???

training set classification report
               precision    recall  f1-score   support

           B       1.00      1.00      1.00       288
           M       1.00      1.00      1.00       167

    accuracy                           1.00       455
   macro avg       1.00      1.00      1.00       455
weighted avg       1.00      1.00      1.00       455

test set classification report
               precision    recall  f1-score   support

           B       0.97      0.88      0.92        69
           M       0.84      0.96      0.90        45

    accuracy                           0.91       114
   macro avg       0.91      0.92      0.91       114
weighted avg       0.92      0.91      0.91       114

depth of the decision tree: 6
number of leaves of the decision tree: 14


### Demo 2: Use DecisionTreeRegressor to predict car mpg 

In [2]:
import pandas as pd 
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from scipy.stats import uniform, randint
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error as MSE
from sklearn.linear_model import LinearRegression

# load data
fname = dir + '/auto.csv'
df = pd.read_csv(fname)
# covert categorical feature origin to numerical values
df = pd.get_dummies(df, drop_first=True)
X, y = df.drop('mpg', axis=1).values, df.mpg
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

# use RandomizedSearchCV to find the best hyperparameters for DecisionTreeRegressor
params = {'max_depth': randint(1,12),
      'min_samples_leaf': uniform(loc=0, scale=0.5)}
dt = DecisionTreeRegressor()
cv = RandomizedSearchCV(dt, params, cv=5)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
# report root of mean squared error of dt 
rmse_dt = MSE(y_test, y_pred) ** 0.5
print("RMSE of decision tree regressor: {:.2f}".format(rmse_dt))

# compare with a LinearRegression model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

rmse_lr = MSE(y_test, y_pred_lr) ** 0.5
print("RMSE of linear regreesion: {:.2f}".format(rmse_lr))

# in this example, LinearRegression outperforms DecisionTreeRegressor even if 
# lr doesn't have data scaling and hyperparameter tunning

RMSE of decision tree regressor: 4.45
RMSE of linear regreesion: 4.40


In [8]:
unconstrained_dt = DecisionTreeRegressor()
unconstrained_dt.fit(X_train, y_train)
y_pred = unconstrained_dt.predict(X_train)
rmse_undt = MSE(y_train, y_pred)**0.5
print("Unconstrained dtr RMSE: {:.2f}".format(rmse_dt))

print(unconstrained_dt.tree_.node_count)

Unconstrained dtr RMSE: 4.45
537


### Chap 2 Objectives
- understand the bias-variance tradeoff when we talk about model complexity, their relationship with underfitting and overfitting
- use sklearn.ensemble.VotingClassifier

### ensemble
- different models trained on the same dataset
- let each model make their own prediction
- final prediction of the ensemble aggregates individual predictions, (hopefully) is more robust & less prone to errors

#### demo 3: ensembles outperforms all its component individual classifiers

In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

# load data
fname = dir + '/indian_liver_patient/indian_liver_patient_preprocessed.csv'
df = pd.read_csv(fname)
X, y = df.drop('Liver_disease', axis=1), df.Liver_disease
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7, stratify=y)

# initiate each individual classifier 
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=7)
knn = KNeighborsClassifier(n_neighbors=27)
lr = LogisticRegression(random_state=7)

classifiers = [('decision tree', dt), ('KNN', knn), ('logistic regression', lr)]
for name, clf in classifiers:
  clf.fit(X_train, y_train)
  print("{:s} {:.3f}".format(name, clf.score(X_test, y_test)))

vc = VotingClassifier(estimators=classifiers)
vc.fit(X_train, y_train)
print("voting classifier {:.3f}".format(vc.score(X_test, y_test)))

### in my experiments, the ensemble did not outperform all individual classifiers every time

decision tree 0.690
KNN 0.713
logistic regression 0.672
voting classifier 0.678


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## bagging
- one model, different subsets of training data (sampled with replacement from the training set)
- **bootstrap** aggregation
- on average, 63% of the training sets are sampled 
- the remaining called **out-of-bag (OOB)** samples can be used to evaluated the model's generalization error. This is known as OOB evaluation

In [37]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# load data 
fname = dir + '/wbc.csv'
df = pd.read_csv(fname)
# set up feature matrix and class label vector
X, y = df.drop('diagnosis', axis=1).values, df['diagnosis'].values
# replace nan with the mean of this feature
X = SimpleImputer().fit_transform(X)
# split into 80% training set, 20% test set
SEED = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED, stratify=y)

dt = DecisionTreeClassifier(min_samples_leaf=8, random_state=SEED)
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, oob_score=True, random_state=SEED)

bc.fit(X_train, y_train)
print("bagging classifier test set accuracy: {:.3f}".format(bc.score(X_test, y_test)))
print('bagging classifier oob score: {:.3f}'.format(bc.oob_score_))
### bagging classifier does not outperform an individual decision tree classifier why???
### oob score is not close to test accuracy when random_state =2, 5, 7... why??? 

bagging classifier test set accuracy: 0.942
bagging classifier oob score: 0.947
