# 3. Building a baseline model

### Objectives

* Know why a baseline model is crucial to establish before commencing with machine learning
* Know how to use a dummy estimator

## Taking a step back

We jumped the gun a bit during our first attempt at machine learning. Although, we did build a simple, single-feature model, we could have built an even simpler one.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_columns = 100
%matplotlib inline

In [2]:
heart = pd.read_csv('../data/heart.csv')
heart.head()

Unnamed: 0,age,sex,chest_pain,rest_bp,chol,fbs,rest_ecg,max_hr,exang,old_peak,slope,ca,thal,disease
0,63,Male,typical,145,233,1,left ventricular hypertrophy,150,0,2.3,3,0.0,fixed,0
1,67,Male,asymptomatic,160,286,0,left ventricular hypertrophy,108,1,1.5,2,3.0,normal,1
2,67,Male,asymptomatic,120,229,0,left ventricular hypertrophy,129,1,2.6,2,2.0,reversable,1
3,37,Male,nonanginal,130,250,0,normal,187,0,3.5,3,0.0,normal,0
4,41,Female,nontypical,130,204,0,left ventricular hypertrophy,172,0,1.4,1,0.0,normal,0


In [3]:
pd.read_csv('../data/heart_data_dictionary.csv')

Unnamed: 0,column name,description
0,age,age in years
1,sex,Male or Female
2,chest_pain,"typical, asymptomatic, nonanginal, nontypical"
3,rest_bp,resting blood pressure (in mm Hg on admission ...
4,chol,serum cholestoral in mg/dl
5,fbs,(fasting blood sugar > 120 mg/dl) (1 = true; 0...
6,rest_ecg,"resting electrocardiographic results (0, 1, 2)"
7,max_hr,maximum heart rate achieved
8,exang,exercise induced angina (1 = yes; 0 = no)
9,old_peak,ST depression induced by exercise relative to ...


## The simplest classification model - Guessing the most common class 
You can't get much simpler than guessing the most common class. Let's find out what our accuracy would have been if we guessed the most common class for each observation.

In [4]:
heart['disease'].value_counts(normalize=True).round(2)

0    0.54
1    0.46
Name: disease, dtype: float64

## Establish the baseline before doing machine learning
It is crucial to first establish a baseline model before performing any machine learning. Without having a baseline, it will be hard to determine how effective the machine learning actually is.

## Using the simplest model as a baseline
For classification, using a very simple model, such as predicting the most common class is usually going to be an effective baseline model. We can compare future results against it.

## Our logistic regression vs the baseline
Just by guessing the most common class, we would have gotten 54% accuracy. Our model using `max_hr` as the lone feature produced an accuracy of 67%. This gives us evidence that our machine learning is producing some value beyond just the baseline.

# Building a Dummy estimator - a baseline model in scikit-learn

Scikit-Learn gives you the ability to build a baseline model with a dummy estimator. Let's do that now.

## Step 1. Import the Estimator

In [None]:
from sklearn.dummy import DummyClassifier

## Step 2. Instantiate the Estimator
For the `DummyClassifier`, several different strategies exist. By default, the strategy is "stratified", which will make random predictions based on the frequency of each class. For instance, it will randomly guess 0 54% of the time and 1 46% of the time.

Instead, we will change the strategy to "most_frequent", which will always guess 0 for the heart disease dataset.

In [None]:
dc = DummyClassifier(strategy='most_frequent')

## Step 3. Train the Model
The input data, X, is completely meaningless and not used for the dummy estimators. The only thing that is considered is the target variable.

In [None]:
X = heart['max_hr'].values
X = X.reshape(-1, 1)
y = heart['disease'].values

dc.fit(X, y)

### Make a Prediction
By definition, our dummy model will predict the most common class. Let's verify this with a few test cases and then with the entire dataset.

In [None]:
a = np.array([100, 150, 200]).reshape(-1, 1)
a

In [None]:
dc.predict(a)

In [None]:
dc.predict(X)

### Score model
The accuracy should equal the exact percentage that are labeled 0. Let's verify this.

In [None]:
heart['disease'].value_counts(normalize=True)

In [None]:
dc.score(X, y)

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Verify that the prediction will be the same regardless of which predictor variable is used for the dummy classifier.</span>

### Problem 2
<span  style="color:green; font-size:16px">What do you think the `predict_proba` method will return? Make a guess and then run the command.</span>