## Scikit-Learn Exercise (Practice)

In [1]:
# standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### End-to-end Scikit-Learn classification workflow
1. Get dataset ready
2. Prepare a machine learning model to make predictions
3. Fit the model to the data and make a prediction
4. Evaluate the model's predictions

#### 1. Getting a dataset ready

In [3]:
# Import heart_disease dataset and save it to a variable
heart_disease = pd.read_csv('data/heart-disease.csv')

# View first 5 rows
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


We want to build a machine learning model on all columns except `target` to predict `target`.
* **target variable** (also called y or labels) = `target` column
* **independent variables** = all other columns

Since our target is going to be heart disease or not, we know this is a **classification** problem.
* **Classification** --> if target variable is one thing or another (heart disease or not)
* **Regression** --> if target column is a numeric value (Price)

Knowing this, let's create `x` and `y` by splitting up our dataframe.

In [5]:
# Create X (all columns except target)
X = heart_disease.drop('target', axis=1) # (label name, corresponding axis)

# Create y (only the target column)
y = heart_disease['target']

Use Scikit-Learn to split into training and test setss.

In [11]:
# Import train_test_split from sklearn's model_selection module
from sklearn.model_selection import train_test_split

# Use train_test_split to split X & y into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y) # default test size is 0.25

In [9]:
# View the different shapes of the training and test datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))

>Note: The default `test_size` in `train_test_split()` is 0.25; meaning, the test dataset is 25% of the total, while the training dataset is 75%.

Now, we can build a machine learning model to fit patternd in the training data and then make predictions on the test data.

To figure out which machine learning model to use, you can refer to [Scikit-Learn's machine learning map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

Let's say you decide to use the [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)...

#### 2. Preparing a machine learning model

In [12]:
# Import the RandomForestClassifier from sklearn's ensemble module
from sklearn.ensemble import RandomForestClassifier

# Instantiate an instance of RandomForestClassifier as clf
clf = RandomForestClassifier()

Now that we have a `RandomForestClassifier` instance, we can fit the training data and make predictions on our test data.

What is an instance?

An instance, in object-oriented programming (OOP), is a specific realization of any object. An object may be varied in a number of ways. Each realized variation of that object is an instance. The creation of a realized instance is called instantiation.

Each time a program runs, it is an instance of that program. In languages that create objects from classes, an object is an instantiation of a class. That is, it is a member of a given class that has specified values rather than variables. In a non-programming context, you could think of "dog" as a class and your particular dog as an instance of that class.

#### 3. Fitting a model and making predictions

In [13]:
# Fit the RandomForestClassifier to the training data
clf.fit(X_train, y_train)

RandomForestClassifier()

In [14]:
# Use the fitted model to make predictions on the test data
# save the predictions to a variable called y_preds
y_preds = clf.predict(X_test)