# Fit the model/algorithm on our data and use it to make predictions


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn

In Scikit-Learn, the process of having a machine learning model learn patterns from a dataset involves calling the fit() method and passing it data, such as, fit(X, y).

Where X is a feature array and y is a target array.

Other names for X include:

* Data
* Feature variables
* Features

Other names for y include:

* Labels
* Target variable

For supervised learning there is usually an X and y.

For unsupervised learning, there's no y (no labels).

Let's revisit the example of using patient data (X) to predict whether or not they have heart disease (y).

## Classification Problem

### 1. Get the data

In [205]:
heart_disease = pd.read_csv("resources/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### 2. Fitting the model to the data

In [207]:
# Import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier
# Import train test split
from sklearn.model_selection import train_test_split

# Setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)

# Fit the model to the data (training the machine learning model)
clf.fit(X_train, y_train)

# Evaluate the Random Forest Classifier (use the patterns the model has learned)
clf.score(X_test, y_test)

0.819672131147541

What's happening here?

Calling the fit() method will cause the machine learning algorithm to attempt to find patterns between X and y. Or if there's no y, it'll only find the patterns within X.

Let's see X.

In [113]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


And y.

In [117]:
y.tail()

298    0
299    0
300    0
301    0
302    0
Name: target, dtype: int64

Passing X and y to fit() will cause the model to go through all of the examples in X (data) and see what their corresponding y (label) is.

How the model does this is different depending on the model you use.

Explaining the details of each would take an entire textbook.

For now, you could imagine it similar to how you would figure out patterns if you had enough time.

You'd look at the feature variables, X, the age, sex, chol (cholesterol) and see what different values led to the labels, y, 1 for heart disease, 0 for not heart disease.

This concept, regardless of the problem, is similar throughout all of machine learning.

#### During training (finding patterns in data):

A machine learning algorithm looks at a dataset, finds patterns, tries to use those patterns to predict something and corrects itself as best it can with the available data and labels. It stores these patterns for later use.

#### During testing or in production (using learned patterns):

A machine learning algorithm uses the patterns its previously learned in a dataset to make a prediction on some unseen data.

### 3. Make predictions using a machine learning model

Now we've got a trained model, one which has hopefully learned patterns in the data, you'll want to use it to make predictions.

Scikit-Learn enables this in several ways.

Two of the most common and useful ways to make predictions are:

* predict()
* predict_proba()

In [214]:
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
231,57,1,0,165,289,1,0,124,0,1.0,1,3,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3
104,50,1,2,129,196,0,1,163,0,0.0,2,0,2
292,58,0,0,170,225,1,0,146,1,2.8,1,2,1
159,56,1,1,130,221,0,0,163,0,0.0,2,0,3


### 3.1 Make predictions using predict function

In [216]:
clf.predict(X_test)

array([0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1])

Given data in the form of X, the `predict()` function returns labels in the form of y.

> <b>Note</b>: For the `predict()` function to work, it must be passed X (data) in the same format the model was trained on. For example, if a model was trained on 10 features formatted in a certain way, predictions should be made on data with 10 features formatted in a certain way. Anything different and it will return an error.

In [125]:
np.array(y_test)

array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

It's standard practice to save these predictions to a variable named something like `y_preds` for later comparison to `y_test` or `y_true` (usually the same as `y_test`, just another name).

In [127]:
# Compare predictions to truth labels to evaluate the model
# This is a manual process
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)

0.8524590163934426

In [129]:
# The above result is same as our eariler score
clf.score(X_test, y_test)

0.8524590163934426

Another way evaluating predictions (comparing them to the truth labels) is with Scikit-Learn's `sklearn.metrics module`.

Inside, you'll find method such as `accuracy_score()`, which is the default evaluation metric for classification problems.

In [131]:
# Another way to check this
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8524590163934426

### 3.2 Make predictions using predict_proba function

* Make predictions with predict_proba() - use this if someone asks you "what's the probability your model is assigning to each prediction?"

In [134]:
X_test[:5]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
179,57,1,0,150,276,0,0,112,1,0.6,1,1,1
228,59,1,3,170,288,0,0,159,0,0.2,1,0,3
111,57,1,2,150,126,1,1,173,0,0.2,2,1,3
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2


`predict_proba()` returns the probabilities (proba is short for probability) of a classification label.

In [136]:
# predict_proba() returns probabilities of a classification label 
clf.predict_proba(X_test[:5])

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82]])

Let's see the difference.

In [138]:
# Let's predict() on the same data...
clf.predict(X_test[:5])

array([0, 1, 1, 0, 1])

`predict_proba()` returns an array of five arrays each containing two values.

Each number is the probability of a label given a sample.

In [140]:
heart_disease["target"].value_counts()

target
1    165
0    138
Name: count, dtype: int64

## Regression Problem

### 1. Get the data

In [167]:
# Get California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

# Getting it into a dataframe
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


### 2. Fitting Model to Data

In [225]:
from sklearn.ensemble import RandomForestRegressor

# Add a target variable to the dataframe from our dataset
housing_df["MedHouseVal"] = housing["target"]
housing_df.head()

# Setup random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("MedHouseVal", axis=1)
y = housing_df["MedHouseVal"] # median house price in $100,000s

# Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create model instance
model = RandomForestRegressor()

# Fit the model to the data
model.fit(X_train, y_train)

# Evaluate the Random Forest Classifier (use the patterns the model has learned)
model.score(X_test, y_test)

0.8066196804802649

### 3. Make predictions using predict function

`predict()` can also be used for regression models.

In [173]:
# Make predictions
y_preds = model.predict(X_test)

In [175]:
y_preds[:10]

array([0.49384  , 0.75494  , 4.9285964, 2.54029  , 2.33176  , 1.6549701,
       2.34323  , 1.66182  , 2.47489  , 4.8344779])

In [179]:
np.array(y_test[:10])

array([0.477  , 0.458  , 5.00001, 2.186  , 2.78   , 1.587  , 1.982  ,
       1.575  , 3.4    , 4.466  ])

Now we can evaluate our regression model by using `sklearn.metrics.mean_absolute_error` which returns the average error across all samples.

In [181]:
# Compare the predictions to the truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

0.3265721842781009

** This means our prediction is 0.3265... different from the target value 

In [185]:
housing_df["MedHouseVal"]

0        4.526
1        3.585
2        3.521
3        3.413
4        3.422
         ...  
20635    0.781
20636    0.771
20637    0.923
20638    0.847
20639    0.894
Name: MedHouseVal, Length: 20640, dtype: float64

# Reference Notes - Random Forest model deep dive
These resources will help you understand what's happening inside the Random Forest models we've been using.

* <a href="https://en.wikipedia.org/wiki/Random_forest">Random Forest Wikipedia</a>
* <a href="https://simple.wikipedia.org/wiki/Random_forest">Random Forest Wikipedia (simple version)</a>
* <a href="https://www.kdnuggets.com/2016/12/random-forests-python.html">Random Forests in Python</a>
* <a href="https://willkoehrsen.github.io/machine%20learning/tutorial/an-implementation-and-explanation-of-the-random-forest-in-python/">An Implementation and Explanation of the Random Forest in Python by Will Koehrsen</a>


# Different names for:

* X = features, features variables, data
* y = labels, targets, target variables

Now we've seen how to get a model, how to find patterns in data using the fit() function and make predictions using what its learned using the `predict()` and `predict_proba()` functions, it's time to evaluate those predictions.