# Supervised Learning

You can consult the solution for this live training in `notebook-solution.ipynb`.

## Predicting values of a *target variable* given a set of *features*

* For example, predicting if a customer will buy a product *(target)* based on their location and last five purchases *(features)*.

### Regression

* Predicting the values of a continuous variable e.g., house price.

### Classification

* Predicting a binary outcome e.g., customer churn.

# Data Dictionary

The data has the following fields:

|Column name | Description |
|------------|-------------|
| `loan_id`  | Unique loan id |
| `gender`   | Gender - `Male` / `Female` |
| `married`  | Marital status - `Yes` / `No` |
| `dependents` | Number of dependents |
| `education` | Education - `Graduate` / `Not Graduate` |
| `self_employed` | Self-employment status - `Yes` / `No` |
| `applicant_income` | Applicant's income |
| `coapplicant_income` | Coapplicant's income |
| `loan_amount` | Loan amount (thousands) |
| `loan_amount_term` | Term of loan (months) |
| `credit_history` | Credit history meets guidelines - `1` / `0` |
| `property_area` | Area of the property - `Urban` / `Semi Urban` / `Rural` | 
| `loan_status` | Loan approval status (target) - `1` / `0` |

In [1]:
# Import required libraries


In [2]:
# Read in the dataset


# Preview the data


# Exploratory Data Analysis

We can't just dive straight into machine learning!
We need to understand and format our data for modeling.
What are we looking for?

## Cleanliness

* Are columns set to the correct data type?
* Do we have missing data?

## Distributions

* Many machine learning algorithms expect data that is normally distributed.
* Do we have outliers (extreme values)?

## Relationships

* If data is strongly correlated with the target variable it might be a good feature for predictions!

## Feature Engineering

* Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?

In [3]:
# Remove the loan_id to avoid accidentally using it as a feature


In [4]:
# Counts and data types per column


In [5]:
# Distributions and relationships


In [6]:
# Correlation between variables


In [7]:
# Target frequency


In [8]:
# Class frequency by loan_status


# Modeling

In [20]:
# First model using loan_amount

# Split into training and test sets

# Previewing the training set

In [21]:
# Instantiate a logistic regression model

# Fit to the training data

# Predict test set values

# Check the model's first five predictions


# Classification Metrics

&nbsp;

## Accuracy

![accuracy_formula](accuracy_formula.png)

&nbsp;

## Confusion Matrix

**True Positive (TP)** = # Correctly predicted as positive

**True Negative (TN)** = # Correctly predicted as negative

**False Positive (FP)** = # Incorrectly predicted as positive (actually negative)

**False Negative (FN)** = # Incorrectly predicted as negative (actually positive)

&nbsp;

|        | **Predicted: Negative** | **Predicted: Positive** |
|--------|---------------------|---------------------|
|**Actual: Negative** | True Negative | False Positive |
|**Actual: Positive** | False Negative | True Positive |

&nbsp;

### Confusion Matrix Metrics

$Precision = \frac{TP}{TP + FP}$

$Recall = \frac{TP}{TP + FN}$

In [22]:
# Accuracy

In [23]:
# Confusion matrix


# Feature Engineering

In [24]:
# Convert categorical features to binary

# Previewing the new DataFrame

In [25]:
# Resplit into features and targets

# Split into training and test sets

In [26]:
# Instantiate logistic regression model

# Fit to the training data

# Predict test set values

In [27]:
# Accuracy

In [28]:
# Confusion matrix


In [29]:
# Finding the importance of features


In [19]:
# Illustrate feature importance



# Split into training and test sets


# Instantiate logistic regression model


# Fit to the training data


# Predict test set values


# Accuracy


# Confusion matrix


# How might we improve model performance?

* Further [preprocessing](https://app.datacamp.com/learn/courses/preprocessing-for-machine-learning-in-python):
	- Log transformations for skewed distributions.
	- Scale feature values. 
	- Remove outliers e.g., high earners.
* Try a different model e.g., [Decision trees](https://app.datacamp.com/learn/courses/machine-learning-with-tree-based-models-in-python).
* Gather more data.
	- Train new models on incorrect predictions (may need more data and/or a holdout set).
* [Further feature engineering](https://app.datacamp.com/learn/courses/feature-engineering-for-machine-learning-in-python).
* [Hyperparameter tuning](https://app.datacamp.com/learn/courses/hyperparameter-tuning-in-python).