# Fit the model to the data and use to make predictions

In machine learning, model fitting refers to the process of optimizing an algorithm's parameters using a training dataset. This dataset contains input features (`X`) and corresponding desired outputs (`y`). The algorithm learns the underlying relationship between `X` and `y` by adjusting its parameters to minimize a loss function, which quantifies the difference between the model's predictions and the actual values in `y`.

Once fitted, the model can be used for prediction. By feeding in new, unseen data points containing the same features as the training data, the model generates predictions for the desired outputs based on the learned relationship. The accuracy of these predictions is often evaluated using metrics like mean squared error for regression tasks or classification accuracy for categorical prediction problems.

- [Making predictions using machine learning models](#making-predictions-using-a-machine-learning-model)
  - [Making predictions on Classification models](#making-predictions-on-classification-models)
  - [Making predictions on Regression models](#making-predictions-on-regression-models)


In [39]:
# Importing packages
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

In [3]:
# Importing heart disease data
heart_disease = pd.read_csv('../datasets/heart-disease.csv')
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [32]:
# Importing California housing data
housing = fetch_california_housing(as_frame=True)

## Making predictions using a machine learning model

When it comes to extracting insights through predictions using machine learning, two functions are the most relevant, `predict()` and `predict_proba()`.

### `predict()`

This function takes an unseen data point (`X_test`) and returns the predicted category label (`y_pred`). Internally, the model utilizes a decision boundary or threshold learned during training. It calculates a score for each category based on `X_test` and the model's parameters. The category corresponding to the highest score is assigned as the predicted label. It returns a single integer or label representing the predicted category for each data point in `X_test`.

### `predict_proba()`

This function takes an unseen data point (`X_test`) and returns an array of category probabilities for each possible category. It's implemented similarly to `predict()`, however, instead of selecting the highest scoring category, it applies a softmax function to normalize these scores into probabilities. The softmax function ensures the probabilities sum to 1, providing a distribution of likelihood across all categories.

It outputs a 2D array where each row represents a data point `X_test`, and each column represents the probability of belonging to a particular category. In binary classification, this would be a single column with probabilities for the two categories. For multi-category problems, there would be one column for each category


### Making predictions on Classification models


#### Fitting the model to the data


In [None]:
# Setting up random seed
np.random.seed(42)

# Making the data
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiating classifier
clf = RandomForestClassifier()

# Fitting the model to the data (training machine learning model)
clf.fit(X_train, y_train)

# Evaluating model (using the patterns the model has found)
clf.score(X_test, y_test)

0.8524590163934426

#### `predict()`


In [6]:
# Making predictions with predict()
clf.predict(X_test)

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [8]:
# Compare predictions to truth labels to evaluate the model
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)

0.8524590163934426

In [9]:
# Creating a score using accuracy score
accuracy_score(y_preds, y_test)

0.8524590163934426

#### `predict_proba()`


In [None]:
# Make predictions with predict_proba()
clf.predict_proba(X_test)

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82],
       [0.14, 0.86],
       [0.36, 0.64],
       [0.95, 0.05],
       [0.99, 0.01],
       [0.47, 0.53],
       [0.26, 0.74],
       [0.7 , 0.3 ],
       [0.11, 0.89],
       [0.95, 0.05],
       [0.03, 0.97],
       [0.02, 0.98],
       [0.01, 0.99],
       [0.84, 0.16],
       [0.95, 0.05],
       [0.98, 0.02],
       [0.51, 0.49],
       [0.89, 0.11],
       [0.38, 0.62],
       [0.29, 0.71],
       [0.26, 0.74],
       [0.34, 0.66],
       [0.2 , 0.8 ],
       [0.22, 0.78],
       [0.83, 0.17],
       [0.15, 0.85],
       [0.94, 0.06],
       [0.92, 0.08],
       [0.96, 0.04],
       [0.62, 0.38],
       [0.46, 0.54],
       [0.89, 0.11],
       [0.44, 0.56],
       [0.16, 0.84],
       [0.33, 0.67],
       [0.08, 0.92],
       [0.13, 0.87],
       [0.17, 0.83],
       [0.18, 0.82],
       [0.38, 0.62],
       [0.32, 0.68],
       [0.77, 0.23],
       [0.39, 0.61],
       [0.  ,

### Making predictions on Regression models


#### Fitting the model to the data


In [34]:
# Setting the seed
np.random.seed(42)

# Creating the data
X = housing.data
y = housing.target

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiating regressor
model = RandomForestRegressor()

# Fitting model to the data (training model)
model.fit(X_train, y_train)

# Evaluating model
model.score(X_test, y_test)

0.8059809073051385

#### `predict()`


In [38]:
# Making predictions using the model
y_preds = model.predict(X_test)
y_preds

array([0.49058  , 0.75989  , 4.9350165, ..., 4.8539888, 0.71491  ,
       1.66568  ])

In [40]:
# Calculating the mean absolute average of the models predictions
mean_absolute_error(y_test, y_preds)

0.3270458119670544