# Building a baseline model
We made a bit of a mistake in the last notebook. We jumped right into a model, albeit a simple one, but we could have made an even simpler model to begin.

# What is the simplest model you can think of for a classification problem?

....



....










....

# Guessing the most common class 
You can't get much simpler than guessing the most common class. Let's see how we would have fared if we made just that guess.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_columns = 100
%matplotlib inline

In [2]:
heart = pd.read_csv('data/heart.csv')
heart.head()

Unnamed: 0,slope,thal,rest_bp,chest_pain,num_major_vessels,sugar,rest_ekg,chol,oldpeak,sex,age,max_heart_rate,angina,disease
0,1,normal,128,2,0,0,2,308,0.0,male,45,170,0,0
1,2,normal,110,3,0,0,0,214,1.6,female,54,158,0,0
2,1,normal,125,4,3,0,2,304,0.0,male,77,162,1,1
3,1,reversible_defect,152,4,0,0,0,223,0.0,male,40,181,0,1
4,3,reversible_defect,178,1,0,0,2,270,4.2,male,59,145,0,0


In [None]:
pd.read_csv('data/heart_dd.csv')

### By guessing all 0's (no heart disease)

In [3]:
heart['disease'].value_counts(normalize=True)
# Normalize = true returns the proportion.  The percentages below are important because the fact that whether or not 
# someone has heart disease is about 50/50 which is the same as telling every other person they do have it aka
# are we better than random prediction

# when would the objective be to be below the "baseline" - 
# - when we want to avoid something
# - when the event we want to predict is a rare event

0    0.555556
1    0.444444
Name: disease, dtype: float64

# What if we decided to split the data by sex

In [4]:
f = heart['sex'] == 'male'
heart.loc[f, 'disease'].mean()

0.5564516129032258

In [5]:
heart.loc[~f, 'disease'].mean()
# this is the pct that is 1

0.19642857142857142

### Let's use a groupby for this

In [6]:
heart.groupby('sex').agg({'disease': 'mean'})

Unnamed: 0_level_0,disease
sex,Unnamed: 1_level_1
female,0.196429
male,0.556452


Looks if we guess no for women but yes for men we would get a higher score.

In [7]:
y = heart['disease']

In [8]:
y_pred = (heart['sex'] == 'male').astype(int)

# this assigns males a value of 1 and 0 to females

In [9]:
(y == y_pred).mean()

0.6333333333333333

## Hmmm... Should we keep splitting our data into different groups and just do manual machine learning?

In [10]:
heart.pivot_table(index='sex', columns='thal', values='disease')

thal,fixed_defect,normal,reversible_defect
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,,0.104167,0.75
male,0.5,0.3,0.757576


So normal male we should guess 0, by for female and reversible defect I should guess 1.

### This is fun - let's keep going

In [11]:
heart.groupby(['sex', 'thal', 'sugar', 'angina']).agg({'disease': 'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,disease
sex,thal,sugar,angina,Unnamed: 4_level_1
female,normal,0,0,0.027778
female,normal,0,1,0.333333
female,normal,1,0,0.333333
female,reversible_defect,0,0,0.666667
female,reversible_defect,0,1,1.0
female,reversible_defect,1,1,1.0
male,fixed_defect,0,0,0.25
male,fixed_defect,0,1,1.0
male,fixed_defect,1,0,0.5
male,fixed_defect,1,1,1.0


So there is a subset of men that have the reversible defect along with sugar equal to 1 and angina equal to 0 that only have a 25% chance at disease.

# What algorithm am I manually doing? ~ Decision Tree

# Your Turn 
Run a few groupby or pivot tables on different variables to manually find pockets of low and high probability of disease.

In [None]:
# your code here

# Building a Dummy estimator - a baseline model in scikit-learn

Scikit-Learn gives you the ability to build a baseline model with a dummy estimator. Let's do that now.

## Step 1. Import the Estimator

In [12]:
from sklearn.dummy import DummyClassifier

## Step 2. Instantiate the Estimator

In [13]:
dc = DummyClassifier(strategy='most_frequent')

## Step 3. Train the Model

In [14]:
X = heart['max_heart_rate'].values
X = X.reshape(-1, 1)
y = heart['disease'].values

#note the x does not matter here, just a formality because we're creating a dummy

In [15]:
dc.fit(X, y)

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

# Now we can predict and score

In [16]:
dc.predict(100)

array([0])

In [17]:
dc.predict(X)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [18]:
dc.score(X, y)

0.55555555555555558

# What will the `predict_prob` return?

In [21]:
dc.predict_proba(X)

array([[ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,