### Why Scikit-Learn?
- Built on Numpy and Matplotlib (and Python)
- Has many in-built machine learning models
- Methods to evaluate your machine learning models
- Very well-designed API

### What we are going to cover?
1. Get data ready
2. Pick a model (to suit your problem)
3. Fit a model to the data and make a prediction
4. Evaluate the model
5. Improve through experimentation
6. Save and reload your trained model

### In-depth overview of this section 

- An end-to-end Scikit-Learn workflow
- Getting data ready (to be used with machine learning models)
- Choosing a machine learning model/estimator/algorithm for our problem
- Fitting a model/estimator/algorithm to the data (learning patterns)
- Making predictions with the model (using learned patterns)
- Evaluating model predictions
- Improving model predictions
- Saving and loading models
- Putting it all together!

### Introduction to Scikit-Learn (sklearn)
This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn library.

### 0. An end-to-end Scikit-Learn workflow 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import pickle

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

from sklearn.impute import SimpleImputer

In [2]:
heart_disease = pd.read_csv('../data/heart-disease.csv')
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


#### Check for missing values 

In [3]:
heart_disease.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

#### Check for data types

In [4]:
heart_disease.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

In [5]:
# Create X (features matrix)
X = heart_disease.drop('target', axis=1)

# Create y (labels)
y = heart_disease['target']

In [6]:
# 2. Choose the right model and hyperparameters
clf = RandomForestClassifier() # Keep the default hyperparameters for now

clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [7]:
# 3. Fit the model to the training data and make predictions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% of the data will be used for training

In [8]:
clf.fit(X_train, y_train);

In [9]:
y_preds = clf.predict(X_test)
y_preds, y_preds.shape

(array([0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1,
        1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,
        0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0], dtype=int64),
 (61,))

In [10]:
y_test

243    0
81     1
234    0
244    0
212    0
      ..
58     1
204    0
123    1
261    0
176    0
Name: target, Length: 61, dtype: int64

In [11]:
# 4. Evaluate the model on the training data and test data
clf.score(X_train, y_train)

1.0

In [12]:
clf.score(X_test, y_test)

0.819672131147541

In [13]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.90      0.76      0.83        34
           1       0.75      0.89      0.81        27

    accuracy                           0.82        61
   macro avg       0.82      0.83      0.82        61
weighted avg       0.83      0.82      0.82        61



In [14]:
print(confusion_matrix(y_test, y_preds))

[[26  8]
 [ 3 24]]


In [15]:
print(accuracy_score(y_test, y_preds)) # same as clf.score(X_test, y_test)

0.819672131147541


In [16]:
# 5. Improve (tune) the model
np.random.seed(42)
for i in range(10, 110, 10):
    print(f'Trying RandomForestClassifier with {i} estimators...')
    clf = RandomForestClassifier(n_estimators=i)
    clf.fit(X_train, y_train)
    print(f'Model accuracy on test set: {clf.score(X_test, y_test)*100:.2f}%')
    print()

Trying RandomForestClassifier with 10 estimators...
Model accuracy on test set: 78.69%

Trying RandomForestClassifier with 20 estimators...
Model accuracy on test set: 81.97%

Trying RandomForestClassifier with 30 estimators...
Model accuracy on test set: 78.69%

Trying RandomForestClassifier with 40 estimators...
Model accuracy on test set: 80.33%

Trying RandomForestClassifier with 50 estimators...
Model accuracy on test set: 81.97%

Trying RandomForestClassifier with 60 estimators...
Model accuracy on test set: 80.33%

Trying RandomForestClassifier with 70 estimators...
Model accuracy on test set: 77.05%

Trying RandomForestClassifier with 80 estimators...
Model accuracy on test set: 81.97%

Trying RandomForestClassifier with 90 estimators...
Model accuracy on test set: 81.97%

Trying RandomForestClassifier with 100 estimators...
Model accuracy on test set: 83.61%



In [17]:
# 6. Save a model and load it
pickle.dump(clf, open('random_forest_model_1.pkl', 'wb')) # wb means write binary

In [18]:
loaded_model = pickle.load(open('random_forest_model_1.pkl', 'rb')) # rb means read binary
loaded_model.score(X_test, y_test)

0.8360655737704918

## Deep dive into steps for a Machine Learning Project 

### 1. Getting the data ready to be used with machine learning
Three main things we have to do:
1. Split the data into features and labels (usually `X` and `y`)
2. Filling (also called imputing) or disregarding missing values.
3. Converting non-numerical values to numerical values (also called feature encoding)

In [19]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [20]:
X = heart_disease.drop('target', axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [21]:
y = heart_disease['target']
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [23]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

### 1.1 Make sure it's all numerical 

In [24]:
car_sales = pd.read_csv('../data/car-sales-extended.csv')

#### Check for missing values 

In [25]:
car_sales.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [26]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [27]:
len(car_sales)

1000

In [28]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [29]:
# Split into X,y
X = car_sales.drop('Price', axis=1)
y = car_sales['Price']

# Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [30]:
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Toyota'

#### Got the above error because RandomForest expects numerical values as inputs 

In [31]:
car_sales['Doors'].value_counts()

4    856
5     79
3     65
Name: Doors, dtype: int64

#### The above cell suggests that even though Doors is an int, it can be treated as a categorical variable because of its extremely limited value counts 

In [32]:
categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                one_hot,
                                categorical_features)],
                               remainder='passthrough')

In [33]:
transformed_X_train = transformer.fit_transform(X_train)
transformed_X_train

array([[0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 1.0000e+00, 0.0000e+00,
        1.4645e+04],
       [0.0000e+00, 0.0000e+00, 1.0000e+00, ..., 1.0000e+00, 0.0000e+00,
        9.0110e+04],
       [0.0000e+00, 1.0000e+00, 0.0000e+00, ..., 1.0000e+00, 0.0000e+00,
        9.4941e+04],
       ...,
       [0.0000e+00, 1.0000e+00, 0.0000e+00, ..., 1.0000e+00, 0.0000e+00,
        9.8523e+04],
       [0.0000e+00, 1.0000e+00, 0.0000e+00, ..., 1.0000e+00, 0.0000e+00,
        2.2490e+05],
       [0.0000e+00, 1.0000e+00, 0.0000e+00, ..., 1.0000e+00, 0.0000e+00,
        1.8738e+05]])

In [34]:
transformed_X_test = transformer.transform(X_test)
transformed_X_test

array([[1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.02773e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 7.13060e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 5.13280e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.33450e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 9.97610e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.05837e+05]])

# Always use fit_transform on X_Train and transform and X_test. We must remain blind to the test set at all times to avoid bias. 

In [35]:
np.random.seed(42)
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.07644657901460727

### 1.2 What if there were missing values?
1. Fill them with some value (also known as imputing).
2. Remove the samples missing data altogether.

In [36]:
# Import car sales missing data
car_sales_missing = pd.read_csv('../data/car-sales-extended-missing-data.csv')
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [37]:
car_sales_missing.isna().sum() # Shows number of missing values per column

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [38]:
# Dropping rows with missing labels
car_sales_missing.dropna(subset=['Price'], inplace=True)

In [39]:
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [40]:
# Create X and y
X = car_sales_missing.drop('Price', axis = 1)
y = car_sales_missing['Price']

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [42]:
# Let's try and convert our data to numbers
categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                one_hot,
                                categorical_features)],
                               remainder='passthrough')

transformed_X_train = transformer.fit_transform(X_train)
transformed_X_train

ValueError: Input contains NaN

### OneHot Encoding cannot be applied to dataframe if it has Nan values in columns that are being encoded 

### Dealing with missing data 

#### Option1: Fill missing data with Pandas 

In [43]:
car_sales_missing = pd.read_csv('../data/car-sales-extended-missing-data.csv')

In [44]:
# Remove rows with missing labels
car_sales_missing.dropna(subset=['Price'],inplace=True)

In [45]:
X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [46]:
X.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

In [47]:
y.isna().sum()

0

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [49]:
X_train['Make'].fillna('missing', inplace=True)
X_train['Colour'].fillna('missing', inplace=True)
X_train['Doors'].fillna(4, inplace=True)
X_train['Odometer (KM)'].fillna(X_train['Odometer (KM)'].mean(), inplace=True)
print(X_train.isna().sum())

X_test['Make'].fillna('missing', inplace=True)
X_test['Colour'].fillna('missing', inplace=True)
X_test['Doors'].fillna(4, inplace=True)
X_test['Odometer (KM)'].fillna(X_train['Odometer (KM)'].mean(), inplace=True) # Notice that we used X_train.mean(), this is similar to appyling fit transform on train followed by transform on test
print(X_test.isna().sum())

Make             0
Colour           0
Odometer (KM)    0
Doors            0
dtype: int64
Make             0
Colour           0
Odometer (KM)    0
Doors            0
dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


In [50]:
categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                one_hot,
                                categorical_features)],
                               remainder='passthrough')

transformed_X_train = transformer.fit_transform(X_train)
transformed_X_test = transformer.transform(X_test)

In [51]:
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test,y_test)

0.24748582992543233

### Option 2: Fill missing data with Scikit-Learn 

#### Always deal with missing values first, then apply onehotencoding 

In [69]:
car_sales_missing = pd.read_csv('../data/car-sales-extended-missing-data.csv')
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [70]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [71]:
car_sales_missing.dtypes

Make              object
Colour            object
Odometer (KM)    float64
Doors            float64
Price            float64
dtype: object

In [72]:
# Remove rows that don't have target variable (label)
car_sales_missing.dropna(subset=['Price'], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [73]:
# Split into X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing['Price']

In [74]:
X.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

In [75]:
y.isna().sum()

0

In [76]:
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [77]:
# Fill missing values with Scikit-Learn

# Fill categorical values with 'missing' and numerical values with mean
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
door_imputer = SimpleImputer(strategy='constant', fill_value=4)
num_imputer = SimpleImputer(strategy='mean')

# Define columns
cat_features = ['Make', 'Colour']
door_feature = ['Doors']
num_feature = ['Odometer (KM)']

# Create an transformer
transformer = ColumnTransformer([
    ('cat_imputer', cat_imputer, cat_features),
    ('door_imputer', door_imputer, door_feature),
    ('num_imputer', num_imputer, num_feature),
]
)


# Transform the data
filled_X_train = transformer.fit_transform(X_train)
filled_X_test = transformer.transform(X_test)

In [78]:
X_train_filled = pd.DataFrame(filled_X_train, columns=['Make', 'Colour', 'Doors', 'Odometer (KM)'])
X_train_filled.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Toyota,Black,4,86333
1,Honda,missing,4,108415
2,Toyota,Blue,4,126078
3,Honda,White,4,25729
4,Nissan,White,4,131542


In [79]:
X_test_filled = pd.DataFrame(filled_X_test, columns=['Make', 'Colour', 'Doors', 'Odometer (KM)'])
X_test_filled.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,Green,4,130076
1,Toyota,White,4,189194
2,Nissan,Blue,3,146430
3,BMW,White,5,232696
4,Toyota,White,4,203804


In [80]:
print(X_train_filled.isna().sum())
print(X_train_filled.dtypes)

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64
Make             object
Colour           object
Doors            object
Odometer (KM)    object
dtype: object


In [81]:
print(X_test_filled.isna().sum())
print(X_train_filled.dtypes)

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64
Make             object
Colour           object
Doors            object
Odometer (KM)    object
dtype: object


In [82]:
# Turn the categories into numbers

categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                one_hot,
                                categorical_features)],
                               remainder='passthrough')

transformed_X_train = transformer.fit_transform(X_train_filled)
transformed_X_test = transformer.transform(X_test_filled)

In [83]:
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.26526630805531304

In [90]:
what_were_covering = [
    '0. An end-to-end Scikit-Learn workflow',
    '1. Getting the data ready',
    '2. Choose the right estimator/algorithm for our problems',
    '3. Fit the model/algorithm and use it to make predictions on our data',
    '4. Evaluating a model',
    '5. Improve a model',
    '6. Save and load a trained model',
    '7. Putting it all together'
]

### Choosing the right estimator/algorithm for our problem

Scikit-Learn uses estimator as another term for machine learning model or algorithm.

* Classification - predictind whether a sample is one thing or another
* Regression - predicting a number


* Step 1 - Refer the ML-map below

<img src='sklearn-ml-map.png'>

### 2.1 Picking a machine learning model for a regression problem 

In [95]:
# Import Boston housing dataset
from sklearn.datasets import load_boston
boston = load_boston()
boston;

In [100]:
boston_df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
boston_df['target'] = pd.Series(boston['target'])
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [101]:
# How many samples?
len(boston_df)

506

In [102]:
# Missing values?
boston_df.isna().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
target     0
dtype: int64

In [103]:
# Data types
boston_df.dtypes

CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
target     float64
dtype: object

#### According to the ml-map, ridge regression looks promising 

In [104]:
# Let's try the Ridge Regression model
from sklearn.linear_model import Ridge

# Setup random seed
np.random.seed(42)

# Create the data
X = boston_df.drop('target', axis=1)
y = boston_df['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Ridge model
model = Ridge()
model.fit(X_train, y_train)

# Check the score of the ridge model on test data
model.score(X_test, y_test)

0.6662221670168519

How do we improve this score?

What if Ridge wasn't working?

Refer - https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [106]:
# Let's try the Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Create the data
X = boston_df.drop('target', axis=1)
y = boston_df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Instantiate Random Forest Regressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

# Evaluate the Random Forest Regressor
rf.score(X_test, y_test)

0.8471696005277883

In [107]:
pd.read_csv('../data/heart-disease.csv')

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


### 2.2 Choosing an estimator for a classification problem 

In [109]:
heart_disease = pd.read_csv('../data/heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [110]:
len(heart_disease)

303

### According to the Scikit Learn map we should try `LinearSVC` for the heart_disease dataset 

In [111]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC

# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Instantiate LinearSVC
clf = LinearSVC()

# Fit LinearSVC to train data
clf.fit(X_train, y_train)

# Evalute LinearSVC on test data
clf.score(X_test, y_test)



0.4605263157894737

#### Clearly, `LinearSVC` is not working well with default settings because its mean accuracy is less than 50% for a binary classification problem. We'll get into hyperparameter tuning later, let's move on to the next suggested model according to the sklearn ml map.

In [112]:
# Import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier()

# Fit RandomForestClassifier to train data
clf.fit(X_train, y_train)

# Evalute RandomForestClassifier on test data
clf.score(X_test, y_test)

0.8289473684210527

#### Tidbit:
1. If you have structured data, use ensemble methods.
2. If you have unstructured data, use deep learning or transfer learning.

### 3. Fit the model/algorithm on our data and use it to make predictions 

#### 3.1 Fitting the model to the data

Different names for:
* `X` = features, feature variables, data
* `y` = labels, targets, target variables

In [None]:
# Import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier()

# Fit RandomForestClassifier to train data (find patterns in data)
clf.fit(X_train, y_train)

# Evalute RandomForestClassifier on test data (use patterns learned during training)
clf.score(X_test, y_test)

In [115]:
X_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
287,57,1,1,154,232,0,0,164,0,0.0,2,1,2
282,59,1,2,126,218,1,1,134,0,2.2,1,1,1
197,67,1,0,125,254,1,1,163,0,0.2,1,2,3
158,58,1,1,125,220,0,1,144,0,0.4,1,4,3
164,38,1,2,138,175,0,1,173,0,0.0,2,4,2


In [116]:
y_train.head()

287    0
282    0
197    0
158    1
164    1
Name: target, dtype: int64

### 3.2 Make predictions using a machine learning model 