## 1.05 Picking a Model

Choosing the right estimator/ algotithm for our problem...

Scikit-learn uses estimator as another term for machine learning model or algorithm.

* Classification - predicting whether a sample is one thing or another.
* Regression - predicting a number.

<img src='images/sklearn-ml-map.png' alt='' height='500'>

See [Interactive Sklearn ML Map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) for more details.

In [1]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

___
### Picking a machine learning model for a regression problem.

>Note: `load_boston` is depreciated in 1.0 and will be removed in 1.2 due to ethical concerns. See [here](https://scikit-learn.org/1.0/modules/generated/sklearn.datasets.load_boston.html) for more info. Notes are updated below with a [New Regression Model](#new-regression-model) using Sklearn's dataset on california housing.

In [None]:
# Import Boston housing dataset (built-in dataset from scikit-learn)
from sklearn.datasets import load_boston
boston = load_boston()
boston; # imports as a dictionary with keys 'data', 'target', and 'feature_names'

In [None]:
# Create a DataFrame from the boston dataset...

# Take data key from the boston dictionary and set the column names equal to the feature names
boston_df = pd.DataFrame(boston['data'], columns=boston['feature_names'])

# Create a target column (what we are trying to predict) by setting it to pd.Series and taking the target key from the boston dictionary
boston_df['target'] = pd.Series(boston['target'])

# View first five rows of the boston DataFrame
boston_df.head()

# pd.Series() is a one-dimensional array-like object containing a sequence of elements.
# pd.DataFrame() is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

**About the DataSet:**
* This dataset is made up of a bunch of different parameters about different towns in boston.
* Each row is a town and each column is a parameter.
* Price (or target column) is in the thousands.

>Note: For more information on the different features in this dataset [click here](https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-dataset)

#### We want to use the features to predict the median house price (regression problem).
over 50 samples > predicting a category > predicting a quantity > less than 100K samples > few features should be important?

<img src='images/sklearn-ml-map.png' alt='' height='500'>

See [Interactive Sklearn ML Map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) for more details.

In [None]:
# How many samples are there?
len(boston_df)

In [None]:
# Let's try the Ridge Regression model
from sklearn.linear_model import Ridge

# Setup random seed
np.random.seed(42)

# Create the data
x = boston_df.drop('target', axis=1)
y = boston_df['target']

# Split into train and test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiate Ridge model
model = Ridge()
model.fit(x_train, y_train)

# Check the score of the Ridge model on test data
model.score(x_test, y_test)

How do we improve this score?

What if Ridge wasn't working?

In [None]:
# Let's try the Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
x = boston_df.drop('target', axis=1)
y = boston_df['target']

# Split into train and test sets
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiate Random Forest model
rf = RandomForestRegressor()
rf.fit(x_train, y_train)

# Check the score of the Random Forest model on the test data
rf.score(x_test, y_test)

___
### New Regression Model

<span style="color:hotpink">Alternative dataset for picking a regression model...</span>

Let's use the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html).

In [2]:
# Get California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [3]:
# Let's add our data to a data frame
housing_df = pd.DataFrame(housing["data"])
housing_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [4]:
# We need to include the feature names
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [5]:
# Let's add our target column as well...
housing_df["MedHouseVal"] = housing["target"]
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


We can gain more insight on the data by looking at the [user guide](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset).

* MedInc median income in block group

* HouseAge median house age in block group

* AveRooms average number of rooms per household

* AveBedrms average number of bedrooms per household

* Population block group population

* AveOccup average number of household members

* Latitude block group latitude

* Longitude block group longitude

The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.

In [6]:
# Import algorithm/estimator --> https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html -->
# data>50 samples --> predicting a category --> predicting a quantity --> data<100K samples --> few features should be important???

# Let's try RidgeRegression
from sklearn.linear_model import Ridge

# Setup random seed
np.random.seed(42)

# Let's create our data
X = housing_df.drop("MedHouseVal", axis=1)
y = housing_df["MedHouseVal"] # median house price in $100,000s

# Split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate and fit model (on training set)
model = Ridge()
model.fit(X_train, y_train)

# Check the score of the model (on the test set)
model.score(X_test, y_test) # returns coefficient of determination of the prediction (r^2).

0.5758549611440122

What if `Ridge` didn't work or the score didn't fit our needs?

Let's try an [ensemble model](https://scikit-learn.org/stable/modules/ensemble.html) (an ensemble is a combination of smaller models to try and make better predictions than just a single model).

[Random  Forest](https://scikit-learn.org/stable/modules/ensemble.html#random-forests-and-other-randomized-tree-ensembles) is a very common/powerful machine learning algorithm that leverages the ensemble technique.

>Note: Here are some useful video's on [Random Forest](https://www.youtube.com/watch?v=v6VJ2RO66Ag) and [Decision Trees](https://www.youtube.com/watch?v=ZVR2Way4nwQ).

In [7]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("MedHouseVal", axis=1)
y = housing_df["MedHouseVal"]

# Split into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate and fit the model (on training set)
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Check the score of the model (on the test set)
model.score(X_test, y_test)

0.8066196804802649

___
### Choosing an estimator for a classification problem.

In [8]:
# Import heart disease dataset
heart_disease = pd.read_csv('data/heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


<img src='images/sklearn-ml-map.png' alt='' height='500'>

See [Interactive Sklearn ML Map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) for more details.

In [9]:
# Check length of dataset
len(heart_disease)

303

We want to use heart disease perameters to determine if they are likely to have heart disease (classification problem)...

over 50 samples > predicting a category > labeled data > less than 100 K samples > linear SVC

In [10]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC

# Setup random seed
np.random.seed(42)

# Make the data
x = heart_disease.drop('target', axis=1)
y = heart_disease['target']

# Split into train and test data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiate LinearSVC
clf = LinearSVC(max_iter=10000) # increasing the number of iterations does not change warning
clf.fit(x_train, y_train)

# Evaluate LinearSVC
clf.score(x_test, y_test)



0.8688524590163934

In [11]:
# Try using the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

x = heart_disease.drop('target', axis=1)
y = heart_disease['target']

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

clf = RandomForestClassifier()
clf.fit(x_train, y_train)

clf.score(x_test, y_test)

0.8524590163934426

Note:
1. If you have structured data (like tables or data frames), use ensemble methods
2. If you have unstructured data (images, audio, text...), use deep learning or transfer learning