In [62]:
# Standard imports


## 0. An end-to-end Scikit-Learn workflow

Before we get in-depth, let's quickly check out what an end-to-end Scikit-Learn workflow might look like.

Once we've seen an end-to-end workflow, we'll dive into each step a little deeper.

**Note:** Since Scikit-Learn is such a vast library, capable of tackling many problems, the workflow we're using is only one example of how you can use it.

### Random Forest Classifier Workflow for Classifying Heart Disease

#### 1. Get the data ready

As an example dataset, we'll import `heart-disease.csv`. This file contains anonymised patient medical records and whether or not they have heart disease or not.

Here, each row is a different patient and all columns except `target` are different patient characteristics. `target` indicates whether the patient has heart disease (`target` = 1) or not (`target` = 0).

In [4]:
# Create X (all the feature columns)


# Create y (the target column)


In [5]:
# Split the data into training and test sets






#### 2. Choose the model and hyperparameters
This is often referred to as `model` or `clf` (short for classifier) or estimator (as in the Scikit-Learn) documentation.

Hyperparameters are like knobs on an oven you can tune to cook your favourite dish.

In [6]:
# We'll use a Random Forest


In [7]:
# We'll leave the hyperparameters as default to begin with...


#### 3. Fit the model to the data and use it to make a prediction
Fitting the model on the data involves passing it the data and asking it to figure out the patterns. 

If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels. 

If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

#### Use the model to make a prediction

The whole point of training a machine learning model is to use it to make some kind of prediction in the future.

Once our model instance is trained, you can use the `predict()` method to predict a target value given a set of features. In other words, use the model, along with some unlabelled data to predict the label. 

Note, data you predict on has to be in the same shape as data you trained on.

In [8]:
# This doesn't work... incorrect shapes


In [9]:
# In order to predict a label, data has to be in the same shape as X_train


In [10]:
# Use the model to make a prediction on the test data (further evaluation)


#### 4. Evaluate the model

Now we've made some predictions, we can start to use some more Scikit-Learn methods to figure out how good our model is. 

Each model or estimator has a built-in score method. This method compares how well the model was able to learn the patterns between the features and labels. In other words, it returns how accurate your model is.

In [11]:
# Evaluate the model on the training set


In [12]:
# Evaluate the model on the test set


There are also a number of other evaluation methods we can use for our models.

#### 5. Experiment to improve

The first model you build is often referred to as a baseline.

Once you've got a baseline model, like we have here, it's important to remember, this is often not the final model you'll use.

The next step in the workflow is to try and improve upon your baseline model.

And to do this, there's two ways to look at it. From a model perspective and from a data perspective.

From a model perspective this may involve things such as using a more complex model or tuning your models hyperparameters.

From a data perspective, this may involve collecting more data or better quality data so your existing model has more of a chance to learn the patterns within.

If you're already working on an existing dataset, it's often easier try a series of model perspective experiments first and then turn to data perspective experiments if you aren't getting the results you're looking for.

One thing you should be aware of is if you're tuning a models hyperparameters in a series of experiments, your reuslts should always be cross-validated. Cross-validation is a way of making sure the results you're getting are consistent across your training and test datasets (because it uses multiple versions of training and test sets) rather than just luck because of the order the original training and test sets were created. 

* Try different hyperparameters
* All different parameters should be cross-validated 
    * **Note:** Beware of cross-validation for time series problems 
    
Different models you use will have different hyperparameters you can tune. For the case of our model, the `RandomForestClassifier()`, we'll start trying different values for `n_estimators`.

In [13]:
# Try different numbers of estimators (trees)... (no cross-validation)


#### 6. Save a model for someone else to use

When you've done a few experiments and you're happy with how your model is doing, you'll likely want someone else to be able to use it.

This may come in the form of a teammate or colleague trying to replicate and validate your results or through a customer using your model as part of a service or application you offer.

Saving a model also allows you to reuse it later without having to go through retraining it. Which is helpful, especially when your training times start to increase.

You can save a scikit-learn model using Python's in-built `pickle` module.

In [14]:


# Save an existing model to file


In [15]:
# Load a saved model and make a prediction


## 1. Getting the data ready
<img src = "1.png"/>

Data doesn't always come ready to use with a Scikit-Learn machine learning model.

Three of the main steps you'll often have to take are:
* Splitting the data into features (usually `X`) and labels (usually `y`)
* Filling (also called imputing) or disregarding missing values
* Converting non-numerical values to numerical values (also call feature encoding)

Let's see an example.

In [16]:
# Splitting the data into X & y


In [17]:
# Splitting the data into training and test sets


In [18]:
# 80% of data is being used for the test set 


### 1.1 Make sure it's all numerical
We want to turn the `"Make"` and `"Colour"` columns into numbers.


In [19]:
# Import car-sales-extended.csv


In [20]:
# Split into X & y and train/test


Now let's try and build a model on our `car_sales` data.

In [21]:
# Try to predict with random forest on price column (doesn't work)


Oops... this doesn't work, we'll have to convert it to numbers first.

In [22]:
# Turn the categories (Make and Colour) into numbers


<img src = "2.png"/>

In [None]:
transformed_X[0]

In [None]:
X.iloc[0]

In [23]:
# Another way... using pandas and pd.get_dummies()


In [24]:
# Let's refit the model


### 1.2 What if there were missing values?

Many machine learning models don't work well when there are missing values in the data.

There are two main options when dealing with missing values.

1. Fill them with some given value. For example, you might fill missing values of a numerical column with the mean of all the other values. The practice of filling missing values is often referred to as imputation.
2. Remove them. If a row has missing values, you may opt to remove them completely from your sample completely. However, this potentially results in using less data to build your model.

**Note:** Dealing with missing values is a problem to problem issue. And there's often no best way to do it.

In [25]:
# Import car sales dataframe with missing values


In [26]:
# Let's convert the categorical columns to one hot encoded (code copied from above)
# Turn the categories (Make and Colour) into numbers


Ahh... this doesn't work. We'll have to either fill or remove the missing values.

Let's see what values are missing again.

### 1.2.1 Fill missing data with pandas

What we'll do is fill the rows where categorical values are missing with `"missing"`, the numerical features with the mean or 4 for the doors. And drop the rows where the Price is missing. 

We could fill Price with the mean, however, since it's the target variable, we don't want to be introducing too many fake labels.

**Note:** The practice of filling missing data is called **imputation**. And it's important to remember there's no perfect way to fill missing data. The methods we're using are only one of many. The techniques you use will depend heavily on your dataset. A good place to look would be searching for "data imputation techniques".

In [27]:
# Fill the "Make" column


In [28]:
# Fill the "Colour" column


In [29]:
# Fill the "Odometer (KM)" column


In [30]:
# Fill the "Doors" column


In [31]:
# Check our dataframe


In [32]:
# Remove rows with missing Price labels


We've removed the rows with missing Price values, now there's less data but there's no more missing values.

In [33]:
# Now let's one-hot encode the categorical columns (copied from above)


### 1.2.2 Filling missing data and transforming categorical data with Scikit-Learn

Now we've filled the missing columns using pandas functions, you might be thinking, "Why pandas? I thought this was a Scikit-Learn introduction?".

Not to worry, scikit-learn provides another method called [`SimpleImputer()`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) which allows us to do a similar thing.

`SimpleImputer()` transforms data by filling missing values with a given strategy.

And we can use it to fill the missing values in our DataFrame as above.

At the moment, our dataframe has no mising values.

Let's reimport it so it has missing values and we can fill them with Scikit-Learn.

In [35]:
# Reimport the DataFrame


In [36]:
# Drop the rows with missing in the "Price" column


In [37]:
# Split into X and y


# Split data into train and test


**Note:** We split data into train & test to perform filling missing values on them separately.

In [38]:
# Fill categorical values with 'missing' & numerical with mean


In [39]:
# Define different column features


**Note:** We use `fit_transform()` on the training data and `transform()` on the testing data. In essence, we learn the patterns in the training set and transform it via imputation (fit, then transform). Then we take those same patterns and fill the test set (transform only).

In [40]:


# Fill train and test values separately


# Check filled X_train


In [41]:
# Get our transformed data array's back into DataFrame's


# Check missing data in training set


In [42]:
# Check to see the original... still missing values


In [43]:
# Now let's one hot encode the features with the same code as before 

# Fill train and test values separately

# Check transformed and filled X_train


In [44]:
# Now we've transformed X, let's see if we can fit a model


# Make sure to use transformed (filled and one-hot encoded X data)


If this looks confusing, don't worry, we've covered a lot of ground very quickly. And we'll revisit these strategies in a future section in way which makes a lot more sense.

For now, the key takeaways to remember are:
* Most datasets you come across won't be in a form ready to immediately start using them with machine learning models. And some may take more preparation than others to get ready to use.
* For most machine learning models, your data has to be numerical. This will involve converting whatever you're working with into numbers. This process is often referred to as **feature engineering** or **feature encoding**.
* Some machine learning models aren't compatible with missing data. The process of filling missing data is referred to as **data imputation**.

## 2. Choosing the right estimator/algorithm for your problem

Once you've got your data ready, the next step is to choose an appropriate machine learning algorithm or model to find patterns in your data.

Some things to note:
* Sklearn refers to machine learning models and algorithms as estimators.
* Classification problem - predicting a category (heart disease or not).
    * Sometimes you'll see `clf` (short for classifier) used as a classification estimator instance's variable name.
* Regression problem - predicting a number (selling price of a car).
* Unsupervised problem - clustering (grouping unlabelled samples with other similar unlabelled samples).

If you know what kind of problem you're working with, one of the next places you should look at is the [Scikit-Learn algorithm cheatsheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

This cheatsheet gives you a bit of an insight into the algorithm you might want to use for the problem you're working on.

It's important to remember, you don't have to explicitly know what each algorithm is doing on the inside to start using them. If you do start to apply different algorithms but they don't seem to be working, that's when you'd start to look deeper into each one.

Let's check out the cheatsheet and follow it for some of the problems we're working on.

<img src="../images/sklearn-ml-map.png" width=700/>

You can see it's split into four main categories. Regression, classification, clustering and dimensionality reduction. Each has their own different purpose but the Scikit-Learn team has designed the library so the workflows for each are relatively similar.

Let's start with a regression problem. We'll use the [Boston housing dataset](https://scikit-learn.org/1.0/modules/generated/sklearn.datasets.load_boston.html) built into Scikit-Learn's `datasets` module.

### 2.1 Picking a machine learning model for a regression problem

In [45]:
# Import the Boston housing dataset



Since it's in a dictionary, let's turn it into a DataFrame so we can inspect it better.

In [46]:
# How many samples?


Beautiful, our goal here is to use the feature columns, such as `CRIM`, which is the per capita crime rate by town, `AGE`, the proportion of owner-occupied units built prior to 1940 and more to predict the `target` column. Where the `target` column is the median house prices.

In essence, each row is a different town in Boston (the data) and we're trying to build a model to predict the median house price (the label) of a town given a series of attributes about the town.

Since we have data and labels, this is a supervised learning problem. And since we're trying to predict a number, it's a regression problem.

Knowing these two things, how do they line up on the Scikit-Learn machine learning algorithm cheat-sheet?

<img src="../images/sklearn-ml-map-cheatsheet-boston-housing-ridge.png" width=700/>

Following the map through, knowing what we know, it suggests we try [`RidgeRegression`](https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression). Let's chek it out.

In [47]:
# Import the Ridge model class from the linear_model module


# Setup random seed


# Create the data


# Split into train and test sets


# Institate and fit the model (on the training set)


# Check the score of the model (on the test set)


What if `RidgeRegression` didn't work? Or what if we wanted to improve our results?

<img src="../images/sklearn-ml-map-cheatsheet-boston-housing-ensemble.png" width=700/>

Following the diagram, the next step would be to try [`EnsembleRegressors`](https://scikit-learn.org/stable/modules/ensemble.html). Ensemble is another word for multiple models put together to make a decision.

One of the most common and useful ensemble methods is the [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#forest). Known for its fast training and prediction times and adaptibility to different problems.

The basic premise of the Random Forest is to combine a number of different decision trees, each one random from the other and make a prediction on a sample by averaging the result of each decision tree.

An in-depth discussion of the Random Forest algorithm is beyond the scope of this notebook but if you're interested in learning more, [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) by Will Koehrsen is a great read.

Since we're working with regression, we'll use Scikit-Learn's [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).

We can use the exact same workflow as above. Except for changing the model.

In [48]:
# Import the RandomForestRegressor model class from the ensemble module


# Setup random seed


# Create the data


# Split into train and test sets


# Institate and fit the model (on the training set)


# Check the score of the model (on the test set)


Woah, we get a boost in score on the test set of almost 0.2 with a change of model.

At first, the diagram can seem confusing. But once you get a little practice applying different models to different problems, you'll start to pick up which sorts of algorithms do better with different types of data.

### 2.2 Picking a machine learning model for a classification problem
Now, let's check out the choosing process for a classification problem.

Say you were trying to predict whether or not a patient had heart disease based on their medical records.

The dataset in `../data/heart-disease.csv` contains data for just that problem.

In [49]:
# How many samples are there?


Similar to the Boston housing dataset, here we want to use all of the available data to predict the target column (1 for if a patient has heart disease and 0 for if they don't).

So what do we know?

We've got 303 samples (1 row = 1 sample) and we're trying to predict whether or not a patient has heart disease.

Because we're trying to predict whether each sample is one thing or another, we've got a classification problem.

Let's see how it lines up with our [Scikit-Learn algorithm cheat-sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

<img src="../images/sklearn-ml-map-cheatsheet-heart-disease-linear-svc.png" width=700/>

Following the cheat-sheet we end up at [`LinearSVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC) which stands for Linear Support Vector Classifier. Let's try it on our data. 

In [61]:
# Import LinearSVC from the svm module


# Setup random seed


# Split the data into X (features/data) and y (target/labels)


# Split into train and test sets


# Instantiate and fit the model (on the training set)


# Check the score of the model (on the test set)


Straight out of the box (with no tuning or improvements) the model scores 47% accuracy, which with 2 classes (heart disease or not) is as good as guessing.

With this result, we'll go back to our diagram and see what our options are.

<img src="../images/sklearn-ml-map-cheatsheet-heart-disease-ensemble.png" width=700/>

Following the path (and skipping a few, don't worry, we'll get to this) we come up to [`EnsembleMethods`](https://scikit-learn.org/stable/modules/ensemble.html) again. Except this time, we'll be looking at ensemble classifiers instead of regressors.

Remember our [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) from above? We'll it has a dance partner, [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) which is an ensemble based machine model learning model for classification. You might be able to guess what we can use it for.

Let's try.

In [70]:
# Import the RandomForestClassifier model class from the ensemble module


# Setup random seed


# Split the data into X (features/data) and y (target/labels)


# Split into train and test sets


# Instantiate and fit the model (on the training set)


# Check the score of the model (on the test set)


0.8524590163934426

Using the `RandomForestClassifier` we get almost double the score of `LinearSVC`.

One thing to remember, is both models are yet to receive any hyperparameter tuning. Hyperparameter tuning is fancy term for adjusting some settings on a model to try and make it better. It usually happens once you've found a decent baseline result you'd like to improve upon.

In this case, we'd probably take the `RandomForestClassifier` and try and improve it with hyperparameter tuning (which we'll see later on).

### What about the other models?

Looking at the cheat-sheet and the examples above, you may have noticed we've skipped a few.

Why?

The first reason is time. Covering every single one would take a fair bit longer than what we've done here. And the second one is the effectiveness of ensemble methods.

A little tidbit for modelling in machine learning is:
* If you have structured data (tables or dataframes), use ensemble methods, such as, a Random Forest.
* If you have unstructured data (text, images, audio, things not in tables), use deep learning or transfer learning.

For this notebook, we're focused on structured data, which is why the Random Forest has been our model of choice.

If you'd like to learn more about the Random Forest and why it's the war horse of machine learning, check out these resources:
* [Random Forest Wikipedia](https://en.wikipedia.org/wiki/Random_forest)
* [Random Forests in Python](http://blog.yhat.com/posts/random-forests-in-python.html) by yhat
* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) by Will Koehrsen

### Experiment until something works

The beautiful thing is, the way the Scikit-Learn API is designed, once you know the way with one model, using another is much the same.

And since a big part of being a machine learning engineer or data scientist is experimenting, you might want to try out some of the other models on the cheat-sheet and see how you go. The more you can reduce the time between experiments, the better.

## 3. Fit the model to data and using it to make predictions

Now you've chosen a model, the next step is to have it learn from the data so it can be used for predictions in the future.

If you've followed through, you've seen a few examples of this already.

### 3.1 Fitting a model to data

In Scikit-Learn, the process of having a machine learning model learn patterns from a dataset involves calling the `fit()` method and passing it data, such as, `fit(X, y)`.

Where `X` is a feature array and `y` is a target array.

Other names for `X` include:
* Data
* Feature variables
* Features

Other names for `y` include:
* Labels
* Target variable

For supervised learning there is usually an `X` and `y`. For unsupervised learning, there's no `y` (no labels).

Let's revisit the example of using patient data (`X`) to predict whether or not they have heart disease (`y`).

In [50]:
# Import the RandomForestClassifier model class from the ensemble module


# Setup random seed


# Split the data into X (features/data) and y (target/labels)

# Split into train and test sets


# Instantiate the model (on the training set)

# Call the fit method on the model and pass it training data


# Check the score of the model (on the test set)


What's happening here?

Calling the `fit()` method will cause the machine learning algorithm to attempt to find patterns between `X` and `y`. Or if there's no `y`, it'll only find the patterns within `X`.

Let's see `X`.

And `y`.

Passing `X` and `y` to `fit()` will cause the model to go through all of the examples in `X` (data) and see what their corresponding `y` (label) is.

How the model does this is different depending on the model you use.

Explaining the details of each would take an entire textbook. 

For now, you could imagine it similar to how you would figure out patterns if you had enough time. 

You'd look at the feature variables, `X`, the `age`, `sex`, `chol` (cholesterol) and see what different values led to the labels, `y`, `1` for heart disease, `0` for not heart disease.

This concept, regardless of the problem, is similar throughout all of machine learning.

**During training (finding patterns in data):**

A machine learning algorithm looks at a dataset, finds patterns, tries to use those patterns to predict something and corrects itself as best it can with the available data and labels. It stores these patterns for later use.

**During testing or in production (using learned patterns):**

A machine learning algorithm uses the patterns its previously learned in a dataset to make a prediction on some unseen data.

### 3.2 Making predictions using a machine learning model
Now we've got a trained model, one which has hoepfully learned patterns in the data, you'll want to use it to make predictions.

Scikit-Learn enables this in several ways. Two of the most common and useful are [`predict()`](https://github.com/scikit-learn/scikit-learn/blob/5f3c3f037/sklearn/multiclass.py#L299) and [`predict_proba()`](https://github.com/scikit-learn/scikit-learn/blob/5f3c3f037/sklearn/linear_model/_logistic.py#L1617).

Let's see them in action.

In [51]:
# Use a trained model to make predictions


Given data in the form of `X`, the `predict()` function returns labels in the form of `y`.

It's standard practice to save these predictions to a variable named something like `y_preds` for later comparison to `y_test` or `y_true` (usually same as `y_test` just another name).

In [52]:
# Compare predictions to truth


Another way of doing this is with Scikit-Learn's [`accuracy_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function.

**Note:** For the `predict()` function to work, it must be passed `X` (data) in the same format the model was trained on. Anything different and it will return an error.

`predict_proba()` returns the probabilities of a classification label.

In [53]:
# Return probabilities rather than labels


Let's see the difference.

In [54]:
# Return labels


`predict_proba()` returns an array of five arrays each containing two values.

Each number is the probability of a label given a sample.

In [55]:
# Find prediction probabilities for 1 sample


This output means the sample `X_test[:1]`, the model is predicting label 0 (index 0) with a probability score of 0.9.

Because the score is over 0.5, when using `predict()`, a label of 0 is assigned.

In [56]:
# Return the label for 1 sample


Where does 0.5 come from?

Because our problem is a binary classification task (heart disease or not heart disease), predicting a label with 0.5 probability every time would be the same as a coin toss (guessing). Therefore, once the prediction probability of a sample passes 0.5, for a certain label, it's assigned that label.

`predict()` can also be used for regression models.

In [58]:
# Import the RandomForestRegressor model class from the ensemble module


# Setup random seed


# Create the data


# Split into train and test sets


# Institate and fit the model (on the training set)


# Make predictions


In [60]:
# Compare the predictions to the truth
