<a href="https://colab.research.google.com/github/zerotodeeplearning/ztdl-masterclasses/blob/master/solutions_do_not_open/Classification_with_Scikit_Learn_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learn with us: www.zerotodeeplearning.com

Copyright © 2021: Zero to Deep Learning ® Catalit LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Documentation links:

- [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb)
- [Numpy](https://docs.scipy.org/doc/)
- [Pandas](https://pandas.pydata.org/docs/getting_started/index.html)
- [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Matplotlib](https://matplotlib.org/)
- [Matplotlib Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)
- [Seaborn](https://seaborn.pydata.org/)
- [Scikit-learn](https://scikit-learn.org/stable/user_guide.html)
- [Scikit-learn Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf)
- [Scikit-learn Flow Chart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

# Classification with Scikit Learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Load data

In [None]:
url = "https://raw.githubusercontent.com/zerotodeeplearning/ztdl-masterclasses/master/data/"

In [None]:
df = pd.read_csv(url + 'geoloc_elev.csv')

### Exploration

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df['source'].value_counts()

In [None]:
df['target'].value_counts()

In [None]:
sns.pairplot(df, hue='target');

### Features and Labels

In [None]:
y = df['target']

In [None]:
raw_features = df.drop('target', axis=1)

# 1-hot encoding of categorical column "source"
X = pd.get_dummies(raw_features)

X.head()

### Model training and evaluation

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size = 0.3, random_state=0)

In [None]:
model = DecisionTreeClassifier(max_depth=3, random_state=0)
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)

pd.DataFrame(cm,
             index=["Miss", "Hit"],
             columns=['pred_Miss', 'pred_Hit'])

In [None]:
print(classification_report(y_test, y_pred))

### Decision Boundary

In [None]:
def plot_decision_boundary(model):
  hticks = np.linspace(-2, 2, 101)
  vticks = np.linspace(-2, 2, 101)

  aa, bb = np.meshgrid(hticks, vticks)
  a_flat = aa.ravel()
  b_flat = bb.ravel()
  N = len(a_flat)

  zeros = np.zeros((N, 4))
  ab = np.c_[a_flat, b_flat, zeros]

  c = model.predict(ab)

  cc = c.reshape(aa.shape)
  plt.contourf(aa, bb, cc, cmap='bwr', alpha=0.2)

In [None]:
df.plot(kind='scatter', c='target', x='lat', y='lon', cmap='bwr')

plot_decision_boundary(model)

## Exercise 1


Iterate and improve on the decision tree model. Now you have a basic pipeline example. How can you improve the score? Try some of the following:

1. change some of the initialization parameters of the decision tree re run the code.
    - Does the score change?
    - Does the decision boundary change?
2. try some other model like Logistic Regression, Random Forest, SVM, Naive Bayes or any other model you like from [here](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)
3. what's the highest score you can get?

An easy way to achieve all of the above is to define a function that trains and evaluates the model like this one:


```python
def train_eval(model):
  # YOUR CODE HERE
  
```

and then loop over a list of models:

```python
models = [...]

for model in models:
  train_eval(model)
```

Bonus points if you also measure the time it takes for each model to train

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from time import time

In [None]:
def pretty_cm(y_true, y_pred):
  cm = confusion_matrix(y_true, y_pred)

  cmdf = pd.DataFrame(cm,
                      index=["Miss", "Hit"],
                      columns=['pred_Miss', 'pred_Hit'])
  return cmdf

def train_eval(model):
  t0 = time()
  model.fit(X_train.values, y_train)
  t1 = time()

  y_pred_train = model.predict(X_train.values)
  y_pred_test = model.predict(X_test.values)

  train_acc = accuracy_score(y_train, y_pred_train)
  test_acc = accuracy_score(y_test, y_pred_test)

  cmdf = pretty_cm(y_test, y_pred_test)
  res = "{}: train: {:0.3}, test: {:0.3}, time (ms): {:0.3}".format(
      model.__class__.__name__, 
      train_acc,
      test_acc,
      1000*(t1 - t0))
  
  df.plot(kind='scatter', c='target', x='lat', y='lon', cmap='bwr')
  
  plot_decision_boundary(model)

  plt.title(res)
  
  plt.text(2, -2,
           str(cmdf),
           horizontalalignment='right',
           bbox={'facecolor':'white'})
  plt.show()


train_eval(model)

In [None]:
models = [DecisionTreeClassifier(max_depth=3),
          DecisionTreeClassifier(max_depth=6),
          RandomForestClassifier(),
          ExtraTreesClassifier(),
          GaussianNB(),
          LogisticRegression(),
          XGBClassifier(),
          SVC(),
          MLPClassifier(),
          ]

In [None]:
for model in models:
  train_eval(model)

### Exercise 2


- load the churn dataset `churn.csv`
- assign the `Churn` column to a variable called `y`
- assign the other columns to a variable called `features`
- select numerical columns with `features.select_dtypes` and asign them to a variable called `X`
- split data into train/test with test_size=0.3 and random_state=42
- modify the `train_eval` function defined earlier to test and compare different models and hyperparameters combinations.

You can find a list of models available [here](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html).


In [None]:
df = pd.read_csv(url + 'churn.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
y = df['Churn'] == 'Yes'
features = df.drop('Churn', axis=1)

In [None]:
X = features.select_dtypes(include=['number'])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

In [None]:
def train_eval(model):

  model.fit(X_train.values, y_train)
  y_pred_train = model.predict(X_train.values)
  y_pred_test = model.predict(X_test.values)

  train_acc = accuracy_score(y_train, y_pred_train)
  test_acc = accuracy_score(y_test, y_pred_test)

  print("{: <25} Train: {: 0.4} Test: {: 0.4}".format(
      model.__class__.__name__,
      train_acc,
      test_acc))
  

for model in models:
  train_eval(model)

### Exercise 3

Define a new function that also keeps track of the time required to train the model. Your new function will look like:

```python
def train_eval_time(model):
  # YOUR CODE HERE
  
  return model, train_acc, test_acc, dt
```

In [None]:
from time import time

In [None]:
def train_eval_time(model):
  t0 = time()
  model.fit(X_train.values, y_train)
  dt = time() - t0
  y_pred_train = model.predict(X_train.values)
  y_pred_test = model.predict(X_test.values)

  train_acc = accuracy_score(y_train, y_pred_train)
  test_acc = accuracy_score(y_test, y_pred_test)

  return model, train_acc, test_acc, dt
  

for model in models:
  model, train_acc, test_acc, dt = train_eval_time(model)
  print("{: <25} Train: {: 0.4}\tTest: {: 0.4}\tTrain Time: {: 0.4}".format(
      model.__class__.__name__,
      train_acc,
      test_acc,
      dt))

In [None]:
%%timeit
train_eval_time(models[0])