# Overview
As we've seen, most of the machine learning process doesn't involve machine learning. A lot of our effort goes towards wrangling, visualizing, and cleaning up the data. But now we have our dataset in the format we need it and we're finally ready to do some prediction!

In [None]:
import sklearn
import pandas as pd
import numpy as np
from IPython.display import Image
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

Let's start by loading in the data which we saved in our last notebook:

In [None]:
import pickle
with open("diabetes_data.pkl", "rb") as f:
    (X_train, X_test, y_train, y_test) = pickle.load(f)

In [None]:
X_train.head()

In [None]:
print(X_train.shape)
print(X_test.shape)

 # Modeling the Data

## Fitting a model
To train a machine learning model, we **fit** it to the data. In scikit-learn, every model has the same `fit(X, y)` method.

Once we fit a model, we can then use it to **predict** on new data.

This **fit -> predict -> evaluate** pattern is one of the core components of machine learning. You should always fit using the training data and then predict and evaluate on the testing data.

We'll start by training a [**Random Forest**](https://en.wikipedia.org/wiki/Random_forest) model. This tends to be a high-performing model on many datasets, so it's a good starting point. We won't go into too much detail with this model, but an image below shows the general process. A random forest consists of many decision trees (another type of ML model) working together to make a prediction. [This Medium article](https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d) gives a high-level overview of this model.


In [None]:
Image("./random_forest.png")

First, we'll import the model class from scikit-learn. Then we'll instantiate it and save it to a variable called `clf` (short for **"classifier"**)"

### TODO
Import `RandomForestClassifier` from `sklearn.ensemble`. Then instantiate it and save it to a variable called `clf`.

In [None]:
from sklearn.ensemble import ____

In [None]:
____ = ____()

Then we'll **fit** our model to the data by calling `clf.fit()` and passing in our training data.

### TODO
Call the random forest's fit method and pass in `X_train` and `y_train` as arguments.

In [None]:
clf.____(____, ____)

## Evaluating a model
Once we have trained a model, we need to know how well it does. To achieve this, we'll see if it can correctly predict the classes of the testing data. 

First, we'll **predict** on `X_test` using `clf.predict(X_test)`. This will give us an array of predictions for each row which we'll compare with the labels for these rows.

### TODO
Create a variable called `pred` which is the output of `clf.predict`:

In [None]:
____ = clf.____(X_test)

Now, let's compare the labels and predictions. Let's start by doing it visually. We'll print out the first 25 labels and predictions and see if we can spot any differences:

In [None]:
print("Truth:", list(y_test[:25]))
print("Pred: ", list(pred[:25]))

### Discussion
A classifier is correct when the predicted label in `pred` is the same as the actual label in `y`. What does it mean when `y_test` and `pred` are different in these situations?
- 1 is predicted but 0 is actual
- 0 is actual but 1 is predicted

We want to have some numeric score to tell us how the dataset is doing as a whole. The simplest metric is **accuracy**, which tells us what proportion of times our model makes the correct prediction.  We can compute this by using the `accuracy_score` method from scikit-learn and passing in the labels and predictions.

### TODO
Compute the accuracy score by passing in `y_test` and `pred`.

In [None]:
from sklearn.metrics import accuracy_score
____(y_test, pred)

# Choosing an algorithm
There are many different types of algorithms that can be used for machine learning classification. Each one works a little differently and some work better for a specific problem. We started by using a single model, but in practice you want to try out a a few different models and train each of them, then analyze and compare the results.

We won't go into the details about each classifier, but we'll try out each of these 6 classifiers.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

We'll save each of these models in a dictionary with the keys as an abbreviated name so we can keep track of it.

In [None]:
models = {"LR": LogisticRegression(),
          "NB": GaussianNB(),
          "KNN": KNeighborsClassifier(),
          "DT": DecisionTreeClassifier(),
          "RFC": RandomForestClassifier(),
          "SVM": SVC()
         }


Next, we'll loop through each of the models, fit them to our training data, predict on our testing data, and calculate the accuracy. We'll then save the name and accuracy score so we can analyze performance later.

Since we're repeating steps for each model, it makes sense to turn these steps into a **function**.

### TODO
Finish defining the function below. Give it an informative name and two parameters: `clf` and `name`.

In [None]:
def ____(____, ____):
    print("Training {}".format(name))
    clf.____(X_train, y_train)
    
    # Predict on the test set
    pred = clf.____(X_test)
    # Measure the accuracy
    accuracy = accuracy_score(____, ____)
    print("Accuracy: {}".format(accuracy))
    print()
    return accuracy

Now, let's call our function on each of the models:

In [None]:
model_scores = []
model_names_scores = []
for name, clf in models.items():
    accuracy = train_model(clf, name)
    
    # Append to this list to analyze later
    model_names_scores.append((name, accuracy))

### Evaluation and Analysis
Let's see how our classifiers did on our test set. Let's start by sorting the scores by accuracy and plotting their performance.

In [None]:
sorted_name_scores = sorted(model_names_scores, key=lambda x:x[1], reverse=True)
sorted_names, sorted_scores = zip(*sorted_name_scores)

In [None]:
x_plot = range(len(sorted_names))
fig, ax = plt.subplots()

ax.bar(x_plot, sorted_scores)
ax.set_ylim(0.5, 1.0)

ax.set_xticks(x_plot)
_ = ax.set_xticklabels(sorted_names)

## Closer analysis
Now, let's take the best-performing model and look at more details. While training, we only looked at the **accuracy** of the classifier. But an accuracy score is sometimes insufficient. 

One thing to consider is **false positives** vs. **false negatives**. **False positives** occur when we incorrectly identify positive cases. In our scenario, this would mean saying a patient has a diabetes when they don't. **False negatives** are the reverse: when we fail to identify a positive case. For example, saying a patient doesn't have diabetes when they do.

**Discussion**
- Can you think of some reasons why accuracy might be insufficient?
- Are false positives or false negatives worse?

To supplement this, we'll look at three more metrics: **precision**, **recall**, and **f1-score**. Here is a blog post that explains the difference between these metrics: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

In summary, here's what each metric tells us:
- **Precision**: If our classifier says a patient has diabetes, how likely is it that our classifier is correct? **A system which produces many *false positives* will have *lower* precision**
- **Recall**: If our classifier is given a patient with diabetes, how likely is it that our classifier will correctly predict that? **A system which has many *false negatives* will have *lower* recall**
- **F1**: This is a balanced average of the two

Let's calculate these more detailed metrics with our highest-performing model. 

### TODO
Look at the model accuracies you calculated above. Pick the highest-performing model, instantiate it, and retrain it. Then predict on the test set.

In [None]:
best_clf = ____()
best_clf

In [None]:
best_clf.fit(X_train, y_train) # Retrain
pred = best_clf.predict(X_test)

Now, let's calculate the precision, recall, and F1-score. Scikit-learn offers a useful function called `classification_report` which prints these values out for us. Since this is a binary task, we only really care about how well it performs on the positive class, so we'll restrict to looking at performance where the label is **1**.

### TODO
Call `classification_report` and compare `y_test` and `pred`.

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, pred, labels=[1]))

### Discussion
- Does our model's performance look as good when looking at these metrics as it did when we looked at accuracy? What accounts for the difference?
- Does our model get more **false positives** or **false negatives**?

# Improving model performance
There are many more steps you could take to improve the performance of your model. These steps include:
- Scaling your data
- Using cross-validation instead of train/test split
- Hyperparamter tuning (see [this Medium article](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74) for an example of this)

This is out of scope for this module, but this Medium 

This [xkcd comic](https://xkcd.com/1838/) illustrates the general process of improving your system:

In [None]:
Image("./xkcd_model.png")

# Additional Resources
We just scraped the surface of machine learning today! There's so much to learn about this field, and the best way to learn is by doing hands-on projects and examples. If you are interested in learning more, here are some more resources:

- [DataCamp Machine Learning Scientist with Python](https://www.datacamp.com/tracks/machine-learning-scientist-with-python): Here are 23 DataCamp courses which offer a fairly comprehensive overview of machine learning
- [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/): This is the first textbook I used to learn about machine learning. It is available on O'Reilly, which you can access through the UVU library
- [Andrew Ng's Machine Learning Coursera course](https://www.coursera.org/learn/machine-learning): A free and popular online class about machine learning

# Next Steps
This week's homework will take another dataset, this one dealing with cardiovascular disease, and walk through the same steps as we've done in class today. Have these notebooks open as examples, and sometimes the best way to complete the homework exercises is to copy and paste from the in-class notebooks and change the variables.

[./04-homework-master.ipynb](./04-homework-master.ipynb)