# Module 17 - Classification with K-Nearest Neighbors

**_Author: Mona Khalil_**

**_Revised: Jessica Cervi_**

**Expected time =  3 hours**

**Total points =  65 points**

## Assignment Overview


At this point in the course, you have learned about classification models such as Logistic Regression and K-Nearest Neighbors. In this assignment, you will learn methods of evaluating and selecting the best classification model to predict a binary classification problem. 

You will use a sample from the [Kaggle Airlines Delay Data Set](https://www.kaggle.com/giovamata/airlinedelaycauses) to train and evaluate classification models predicting whether or not a flight was delayed. 

You will be evaluating and comparing the performance of 3 models -- a logistic regression, K-Nearest Neighbors (KNN), and a Naïve Bayes algorithm in determining whether or not a flight was delayed. You will also evaluate the importance of individual features in predicting our outcome of interest.


This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question. 
- You can download the Grading Report after submitting the assignment. This will include feedback and hints on incorrect questions. 


### Learning Objectives

- Use Logistic Regression, Naïve Bayes, and  K-Nearest Neighbors for prediction
- Compare the performance of different algorithms on a certain model  
- Examine algorithm performance and consider potential features to improve the model
- Build your acumen for model building



## Index: 

####  Module 17 - Classification with K-Nearest Neighbors

- [Question 1](#q1)
- [Question 2](#q2)
- [Question 3](#q3)
- [Question 4](#q4)
- [Question 5](#q5)
- [Question 6](#q6)
- [Question 7](#q7)
- [Question 8](#q8)
- [Question 9](#q9)
- [Question 10](#q10)
- [Question 11](#q11)
- [Question 12](#q12)
- [Question 13](#q13)



## Module 17 - Classification with K-Nearest Neighbors

### Getting and Preparing the Data



To begin, let's import a subset of the data set from a csv file. The original data set is over 65 megabytes (and 1.9 million records!), which would significantly slow down the assignment. Therefore, we've filtered the original data for our analysis.

Let's import the `numpy` and `pandas` libraries, and read in the csv file `airlines.csv` using `pandas`. 

In [1]:
import numpy as np
import pandas as pd

airlines = pd.read_csv('./data/airlines.csv')

Next, as usual, we explore the contents and columns of the `airlines` dataframe.

In [2]:
airlines.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,FlightNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut
0,33470,103250,2008,1,22,2,1302.0,1556.0,2302,174.0,94.0,68.0,116.0,36.0,CLE,LGA,418,5.0,101.0
1,33673,104002,2008,1,18,5,1444.0,1638.0,2485,114.0,105.0,84.0,23.0,14.0,LGA,CLE,418,8.0,22.0
2,34312,106260,2008,1,2,3,1442.0,1642.0,3139,120.0,105.0,74.0,27.0,12.0,LGA,CLE,418,7.0,39.0
3,34437,106726,2008,1,3,4,1258.0,1428.0,3138,90.0,95.0,67.0,28.0,33.0,CLE,LGA,418,7.0,16.0
4,34601,107285,2008,1,1,2,1735.0,2010.0,2607,155.0,97.0,80.0,93.0,35.0,CLE,LGA,418,7.0,68.0


In [3]:
airlines.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'Year', 'Month', 'DayofMonth',
       'DayOfWeek', 'DepTime', 'ArrTime', 'FlightNum', 'ActualElapsedTime',
       'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest',
       'Distance', 'TaxiIn', 'TaxiOut'],
      dtype='object')

The data set has been pre-filtered to exclude all flights that did not take off or land at LaGuardia Airport (LGA), because we're interested in using information about the flights that predict whether or not an individual flight arrived late.

The majority of columns in the data set will be useful features, but we still need to complete some cleaning steps before we can compare our models. Let's drop the first two unnamed columns, as they are additional indexes and not needed for the analysis. 

[Back to top](#Index:) 
<a id='q1'></a>

### Question 1:

*5 points*

Drop the unnamed columns at the beginning of the data set.

In [4]:
### GRADED

airlines.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis=1, inplace=True)

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [5]:
airlines.columns

Index(['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'ArrTime',
       'FlightNum', 'ActualElapsedTime', 'CRSElapsedTime', 'AirTime',
       'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'TaxiIn',
       'TaxiOut'],
      dtype='object')

### Preparing our Categorical Outcome

We have a number of features that we will use to determine the best classifier for predicting whether or not a flight is late. But first we have to determine which flights are delayed in a new column -- `IsArrDelayed` (is a flight arrival delayed)-- based on the `ArrDelay` column, which is a continuous variable counting the number of minutes a flight is behind schedule.

The Federal Aviation Institute (FAA) considers a flight _late_ if it is more than 15 minutes behind its scheduled landing time. Our new categorical outcome will reflect this standard. Ultimately, determining the threshold for a delayed flight is at the discretion of the person performing the analysis. We encourage you to explore other potential thresholds for flight delays that reflect the length of the flight, characteristics of the airport, etc.

In [9]:
airlines['IsArrDelayed'] = (airlines['ArrDelay'] > 15).astype(int)

In [13]:
airlines

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,FlightNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,IsArrDelayed
0,2008,1,22,2,1302.0,1556.0,2302,174.0,94.0,68.0,116.0,36.0,CLE,LGA,418,5.0,101.0,1
1,2008,1,18,5,1444.0,1638.0,2485,114.0,105.0,84.0,23.0,14.0,LGA,CLE,418,8.0,22.0,1
2,2008,1,2,3,1442.0,1642.0,3139,120.0,105.0,74.0,27.0,12.0,LGA,CLE,418,7.0,39.0,1
3,2008,1,3,4,1258.0,1428.0,3138,90.0,95.0,67.0,28.0,33.0,CLE,LGA,418,7.0,16.0,1
4,2008,1,1,2,1735.0,2010.0,2607,155.0,97.0,80.0,93.0,35.0,CLE,LGA,418,7.0,68.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68797,2008,12,13,6,1136.0,1427.0,1458,171.0,180.0,147.0,31.0,40.0,FLL,LGA,1076,5.0,19.0,1
68798,2008,12,13,6,1108.0,1356.0,1459,168.0,204.0,147.0,-28.0,8.0,LGA,FLL,1076,3.0,18.0,0
68799,2008,12,13,6,719.0,938.0,1491,199.0,219.0,179.0,-11.0,9.0,LGA,MSY,1183,5.0,15.0,0
68800,2008,12,13,6,753.0,1054.0,1503,181.0,188.0,143.0,11.0,18.0,LGA,TPA,1011,5.0,33.0,0


[Back to top](#Index:) 
<a id='q2'></a>

### Question 2:

*5 points*

Create a new categorical column, `IsArrDelayed`, based on the `ArrDelay` column. A flight should be considered _delayed_ (`1`) if it has a delay greater than 15 minutes. A flight, in this case, should be considered on time (_not delayed_ = `0`) if has a delay of 15 minutes or less. 

**HINT:** You can accomplish this in a number of ways, such using a list comprehension or by using a lambda function.

In [14]:
### GRADED

airlines['IsArrDelayed'] = (airlines['ArrDelay'] > 15).astype(int)

###
### YOUR CODE HERE
###


In [15]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Preparing Features

Great! We're ready to evaluate our potential features to see what can add value to our model, and what can be removed. Let's re-explore the data set using the `.head()` method.

In [16]:
airlines.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,ArrTime,FlightNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,IsArrDelayed
0,2008,1,22,2,1302.0,1556.0,2302,174.0,94.0,68.0,116.0,36.0,CLE,LGA,418,5.0,101.0,1
1,2008,1,18,5,1444.0,1638.0,2485,114.0,105.0,84.0,23.0,14.0,LGA,CLE,418,8.0,22.0,1
2,2008,1,2,3,1442.0,1642.0,3139,120.0,105.0,74.0,27.0,12.0,LGA,CLE,418,7.0,39.0,1
3,2008,1,3,4,1258.0,1428.0,3138,90.0,95.0,67.0,28.0,33.0,CLE,LGA,418,7.0,16.0,1
4,2008,1,1,2,1735.0,2010.0,2607,155.0,97.0,80.0,93.0,35.0,CLE,LGA,418,7.0,68.0,1


You'll notice that there are two columns -- `ActualElapsedTime` and `CRSElapsedTime`. The `ActualElapsedTime`  column represents the actual elapsed time for the flight, and `CRSElapsedTime` represents the _scheduled_ elapsed time. We already have a column, `ArrDelay`, representing the number of minutes that each flight is delayed. In combination with the `ActualElapsedTime`, we gain no new information from the `CRSElapsedTime` column.

We also have a column, `FlightNum`. While this is numeric information, the flight numbers are arbitrary and do not provide us with meaningful information. Finally, we can delete the `Year` column. The only year represented by this data set is 2008, so it offers no value to our predictive models.

[Back to top](#Index:) 
<a id='q3'></a>

### Question 3:

*5 points*

Drop the three columns discussed above (`CRSElapsedTime`, `FlightNum`, `Year`)

In [17]:
### GRADED

airlines.drop(['CRSElapsedTime', 'FlightNum', 'Year'], axis=1, inplace=True)

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Finally, let's retrieve some useful information from the `Origin` and `Dest` columns. At the moment, these are categorical columns with too many values to reasonably `dummy code` in our model. However, we can create a new column identifying LaGuardia Airport as the origin or destination.

[Back to top](#Index:) 
<a id='q4'></a>

### Question 4:

*5 points*

Create a new column, `LGA_origin`, with a binary outcome identifying the origin airport as LaGuardia (LGA = `1`) or not LaGuardia (`0`). This will become a new feature in our classification models. You can do this using the same method you used to answer Question 2.

In [32]:
### GRADED

airlines['LGA_origin'] = (airlines['Origin'].str.upper() == 'LGA').astype(int)

###
### YOUR CODE HERE
###


In [33]:
airlines[['Origin','LGA_origin']]

Unnamed: 0,Origin,LGA_origin
0,CLE,0
1,LGA,1
2,LGA,1
3,CLE,0
4,CLE,0
...,...,...
68797,FLL,0
68798,LGA,1
68799,LGA,1
68800,LGA,1


In [34]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


<a id="split"></a>
## Splitting our Data for Modeling

Let's prepare our `X` and `y` series, separating the features we intend to use from our outcome variable. 

In [None]:
airlines.columns

We've created a new feature and dropped some potential features prior to analysis. We should also exclude the `Origin`, `Dest`, `ArrDelay`, `DepDelay`, and `IsArrDelayed` columns from our new dataframe, `X`, that will be used to potentially identify delayed flights. 

We will be excluding these columns for a number of reason. First, the `Origin` and `Dest` columns cannot easily be dummy coded (there are too many categories to avoid the dummy variable trap, which you can read more about [here](https://towardsdatascience.com/one-hot-encoding-multicollinearity-and-the-dummy-variable-trap-b5840be3c41a)). The `ArrDelay` and `DepDelay` columns are being excluded because they offer little _predictive_ value. The departure delay is not determined until the plane departs, and has an obvious relationship to the arrival delay. Thus, we add little unique information by telling stakeholders that departure delays predict arrival delays.

Splitting the `X` and the `y` is an important step when pre-processing data. Usually, `X` containg all the features *except* the want we are trying to predict, whereas `y` contains the label we are interested in.


[Back to top](#Index:) 
<a id='q5'></a>

### Question 5:

*5 points*

Split our data into `X` and `y`. 

- `X` should contain all features except those we don't intend to use (including ` Origin`, `Dest`, `ArrDelay`, `DepDelay` ). `X` also should not contain the outcome variable `IsArrDelayed`. 
- `y` should contain only the labels/outcome variable `IsArrDelayed` as a pandas series.
Finally, explicitly cast all datatypes within the `X` dataframe to `float64`. This is to avoid an implicit data conversion warning from `int64` when we use standard scaler. See [pd.core.frame.DataFrame.astype](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html) for documentation.



In [44]:
X = airlines.drop(["Origin", "Dest", "ArrDelay", "DepDelay", "IsArrDelayed"], axis=1).astype("float64")
y = airlines['IsArrDelayed']

In [43]:
y

0        1
1        1
2        1
3        1
4        1
        ..
68797    1
68798    0
68799    0
68800    0
68801    0
Name: IsArrDelayed, Length: 68802, dtype: int64

In [45]:
### GRADED

X = airlines.drop(["Origin", "Dest", "ArrDelay", "DepDelay", "IsArrDelayed"], axis=1).astype("float64")
y = airlines['IsArrDelayed']

###
### YOUR CODE HERE
###


In [46]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Splitting into Training and Test Sets

Let's now perform a standard split into a training and test set. We'll use 70% of our data as a training set, and 30% as a test set.

[Back to top](#Index:) 
<a id='q6'></a>

### Question 6:

*5 points*

Import the module `train_test_split` from `scikit-learn` and split our `X` and `y` data into training and test sets (`X_train`, `X_test`, `y_train`, `y_test`). Use 30% of the data as a test set, and the remaining 70% for the training set. For reproducibility, set a random state of `1234`.

In [50]:
### GRADED

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1234)

X_train = X_train
X_test_ans = X_test
y_train_ans = y_train
y_test_ans =  y_test

###
### YOUR CODE HERE
###


In [54]:
X_test.shape

(20641, 11)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


<a id="feature-scaling"></a>
## Feature Scaling

We have our training and test sets prepared. Are we ready for modeling?

_Not quite!_

An important step in many machine learning models is _**feature scaling**_. Feature scaling is an important data preprocessing step that transforms all data into the same _scale_. The most common method of doing this is standardization -- converting the mean to 0 and the standard deviation to 1. This ensures an accurate comparison of data across the same scale.

Why is this necessary? Many machine learning algorithms (including K-Nearest Neighbors and Logistic Regression) take into account the magnitude of distance between each point among features. This means that a column with units that have a larger differences (i.e., arrival time) will have a different weight than columns smaller unit differences (i.e., day of the week). You can read more about the importance of feature scaling in the [scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html).

Let's import the `StandardScaler` from `scikit-learn` and scale our training set. We perform this step _after_ splitting data into our training and test sets, because the process of scaling features is done in relation to _other points in the data set_. 

In [55]:
from sklearn.preprocessing import StandardScaler

We then initialize a class for our standard scaler, `ss`. We will use the `fit_transform` method on our training set, and `transform` on our test set. This is not necessary to perform on `y_train` or `y_test`, as our outcome is a binary indicator. 

In [74]:
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

In [77]:
y_train

52219    1
58122    0
28832    1
23716    0
15497    1
        ..
55985    0
32399    1
60620    1
34086    0
58067    1
Name: IsArrDelayed, Length: 48161, dtype: int64

<a id="training-models"></a>
## Training our Models

It's time to train our models and choose the best algorithm for predicting flight delays. Each of the 3 models we've chosen will give us the same evaluation metrics (accuracy score, confusion matrix) for direct comparison. Let's get started with our logistic regression.

<a id="lr"></a>
### Logistic Regression
_**Logistic Regression**_ is a predictive model developed in the 19th century to predict population growth. It's a powerful, commonly used model for predicting binary outcomes using 1 or more continuous predictors. The logistic regression function is similar to a linear regression, except that a logit _curve_ is used to predict a binary outcome.

Take a look at the image below.

![lr.png](lr.png)

The x-axis contains values for a binary output variable, and the y-axis contains the _probability value_ associated with the logit curve. The model attempts to find the best fit curve to correctly assign values to either of the binary outcomes (i.e., Yes/No, 1/0, Purchased/Not Purchased). Data points with a probability value above 50% of the curve (halfway up the axis) are associated with the outcome at the top; all others are associated with the outcome at the bottom.

A logistic regression is a computationally inexpensive model to train and use in Python with `scikit-learn`. Let's import the `LogisticRegression` class and get started below.

In [65]:
from sklearn.linear_model import LogisticRegression

Similar to our linear regression assignment, we will initiate a `LogisticRegression()` class using a random state of 0, and the `'lbfgs'` solver. Save the class as `lr`. We will then use the `fit` method to train the model on the training set.

In [66]:
lr = LogisticRegression(random_state = 0, solver = 'lbfgs')
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

We can then create an array of predicted y-values based on the test set.

In [67]:
y_pred_lr = lr.predict(X_test)

Using the `y_pred` and `y_test` values, we can create a _confusion matrix_ to evaluate the accuracy of our classifier.

A _**confusion matrix**_ is a matrix depicting the accuracy of your model's predictions. Data points are sorted into one of 4 categories, as shown below:

| Confusion Matrix      | **True (Predicte)** | **False (Predicted)** |
|-----------------------|-------------------|--------------------|
| **True (Actual)**  | True Positive     | False Positive     |
| **False (Actual)** | False Negative    | True Negative      |

The greater percentage of data points accurately classified as _True Positive_ and _True Negative_ tell us the accuracy of the classification model. We can evaluate many types of binary classification models using a confusion matrix in `scikit-learn`. Let's import the necessary module below.

In [68]:
from sklearn.metrics import confusion_matrix

Finally, create the confusion matrix and save it as `cm_lr`. We will be comparing the logistic regression confusion matrix with those of the other classification models. 

In [69]:
cm_lr = confusion_matrix(y_test, y_pred_lr)
print(cm_lr)

[[ 1727  3778]
 [ 1341 13795]]


The logistic regression yielded a high percentage of false positives -- more than the actual number of true positives. Nearly half of the late flights were also predicted to be on time ( _False Negative_ ) based on the features in the data set. Let's take a look at the model's overall accuracy score on the test set using the `score` method. 

In [72]:
lr_score = lr.score(X_test, y_test)
print(lr_score)

0.7519984496875152


The model is approximately 75% accurate at predicting whether or not a flight will be delayed. Therefore, this model may not be the best predictor of flight delays. Let's continue with our next model to compare the output.

<a id="knn"></a>
### K-Nearest Neighbors

The K-Nearest Neighbors algorithm is an easy, straight-forward algorithm that's easy to implement using Python. It simply works by calculating the distance between new data points and classifying them based on how close it is to existing, labeled data from the training set. Thus, it's considered a _lazy_ algorithm.

K-Nearest Neighbors only requires that you specify two parameters in `scikit-learn` -- the value of K (the number of neighbors, `n_neighbors` in Python), and the distance metric used (i.e., Euclidean Distance or Manhattan Distance, specified by the parameter `p`).

Let's import the `KNeighborsClassifier` class from `scikit-learn` and instantiate the class as `knn`. We'll use `n_neighbors = 10` and `p = 1` to represent the Manhattan Distance algorithm.

In [71]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 10, p = 1)

[Back to top](#Index:) 
<a id='q7'></a>

### Question 7:

*5 points*

Use the function `StandardScaler` to standardize your features. Assign this to the variable `ss`.Next, use `fit_transform`  on the training set, in the same manner as the logistic regression above using `X_train` and `y_train`. You don't need to change the training sets at all -- just fit this new model to the data. 

Finally, observe the code above to predict the test set results from the trained `knn` object using `X_test` and save them to `y_pred_knn`.

In [83]:
### GRADED

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)


###
### YOUR CODE HERE
###


In [82]:
y_pred_knn

array([1, 0, 1, ..., 1, 1, 0])

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q8'></a>

### Question 8:

*5 points*

Create the confusion matrix for the K-Nearest Neighbors model, in the same way you did for the logistic regression. Save the matrix as `cm_knn`.

In [91]:
### GRADED

cm_knn = confusion_matrix(y_test, y_pred_knn)
print(cm_knn)


###
### YOUR CODE HERE
###


[[ 2358  3147]
 [ 2029 13107]]


In [87]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Let's compare the output of the KNN confusion matrix to the Logistic Regression model.

In [93]:
print('Logistic Regression:\n', cm_lr)
print('KNN:\n', cm_knn)

Logistic Regression:
 [[ 1727  3778]
 [ 1341 13795]]
KNN:
 [[ 2358  3147]
 [ 2029 13107]]


Finally, let's score the KNN model on the test set and compare it to the Logistic Regresion model score.

Note, this cell might take a little bit of time to run.

In [89]:
knn_score = knn.score(X_test, y_test)
print('Logistic Regression:', lr_score)
print('KNN:', knn_score)

Logistic Regression: 0.7519984496875152
KNN: 0.7492369555738578


The overall score supports the confusion matrix finding -- the K-Nearest Neighbor algorithm performed approximately 1.6% better on the test set than the Logistic Regression. Let's move onto our third model so we can choose the best model to predict our data.

<a id="nb"></a>
### Naïve Bayes

The Naïve Bayes algorithm is part of a class of _probabilistic_ algorithms based on Bayes' Theorem. Based on a training set, each new record is given a _probability value_ for belonging to an outcome group. The algorithm is considered _naïve_ because it assumes that all predictors (features) in the data set are independent of one another. 

Given the assumption of independence among predictors, this may not be the ideal model of choice for our data set. Many of our predictors are likely related to one another (i.e., landing time is dependent on departure time). We will train the model on our data set for this assignment, but we strongly recommend considering this algorithm when you're confident your predictors are independent.

The Naïve Bayes algorithm is a part of `scikit-learn`. Specifically, we'll be using the `GaussianNB` algorithm. This assumes that your data is _normally distributed_ (follows a Gaussian distribution). Since we've scaled our features to fit a normal distribution, we can use this algorithm to classify our data. Let's import `GaussianNB` below.

In [97]:
from sklearn.naive_bayes import GaussianNB

From there, you can create an object, `nb`, to fit to the training set.

[Back to top](#Index:) 
<a id='q9'></a>

### Question 9:

*5 points*

Use `GaussianNB` to create anobject called `nb`. Fit it to the training and test set. 

Afterward, create an object predicting the categories in the test set, save this to `y_pred_nb`.

In [104]:
### GRADED

nb = GaussianNB()

nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
###
### YOUR CODE HERE
###


In [105]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


From there, you can create an object, `nb`, to fit to the training set.

[Back to top](#Index:) 
<a id='q10'></a>

### Question 10:

*5 points*

Create the confusion matrix for the Naïve Bayes algorithm. Save the results as `cm_nb`.

**HINT:** you can use the same function, `confusion_matrix`, as above.

In [108]:
### GRADED




cm_nb = confusion_matrix(y_test, y_pred_nb)
print(cm_nb)

###
### YOUR CODE HERE
###


[[3855 1650]
 [5450 9686]]


In [109]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Let's evaluate our Naïve Bayes model compared to the other two.

In [110]:
print('Logistic Regression:\n', cm_lr)
print('KNN:\n', cm_knn)
print('Naïve Bayes:\n', cm_nb)

Logistic Regression:
 [[ 1727  3778]
 [ 1341 13795]]
KNN:
 [[ 2358  3147]
 [ 2029 13107]]
Naïve Bayes:
 [[3855 1650]
 [5450 9686]]


As you can see, the Naïve Bayes model more accurately classifies true positives (flights that are actually late) than the other two models. However, it is far less accurate in classifying the true negatives. Let's see how the model score compares to the previous two models.

In [111]:
nb_score = nb.score(X_test, y_test)
print('Logistic Regression:', lr_score)
print('KNN:', knn_score)
print('Naïve Bayes:', nb_score)

Logistic Regression: 0.7519984496875152
KNN: 0.7492369555738578
Naïve Bayes: 0.6560244174216365


Just as we suspected, the Naïve Bayes model has a much lower accuracy score overall. Therefore, it may not be the best choice of model!

<a id="eval"></a>
## Evaluating and Selecting a Model

Now that we have 3 model results, let's select the best model for our data. We'll use each model's _precision_ and _recall_ , which are two related measures of model accuracy. **Precision** identifies the proportion of positive identifications that were actually correct, and **recall** identifies the proportion of actual positives that were identified correctly. We can calculate the precision and recall for any model using `sklearn`. Let's import `precision_score` and calculate the precision for our logistic regression model.

In [112]:
from sklearn.metrics import precision_score

In [113]:
lr_precision = precision_score(y_test, y_pred_lr)
print(lr_precision)

0.7850110965685996


Our logistic regression model precision score tells us that the ratio of true positives to true + false positives is 0.785. Let's compare this to our recall score.

In [114]:
from sklearn.metrics import recall_score
lr_recall = recall_score(y_test, y_pred_lr)
print(lr_recall)

0.9114032769556025


Our recall is considerably higher, indicating that the logistic regression model accurately captures about 91% of true positives.

[Back to top](#Index:) 
<a id='q11'></a>

### Question 11:

*5 points*


Calculate the precision score and the recall score for the K-nearest neighbors model using the predictions generated from your KNN model earlier in this assignment. Assign the results to `knn_precision` and `knn_recall`.

In [115]:
### GRADED

knn_precision = precision_score(y_test, y_pred_knn)
knn_recall = recall_score(y_test, y_pred_knn)

# knn_precision = None
# knn_recall = None


###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q12'></a>

### Question 12:

*5 points* 

Calculate the precision score and recall score  for the Naïve Bayes model using the predictions generated from your NB model earlier in this assignment. Assign the results to `nb_precision` and `nb_recall`.

In [117]:
### GRADED

nb_precision = precision_score(y_test, y_pred_nb)
nb_recall = recall_score(y_test, y_pred_nb)


###
### YOUR CODE HERE
###


In [118]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Let's compare our precision and recall scores from each model.

In [119]:
print('LR:', lr_precision, lr_recall)
print('KNN:', knn_precision, knn_recall)
print('NB:', nb_precision, nb_recall)

LR: 0.7850110965685996 0.9114032769556025
KNN: 0.8063861203396087 0.8659487315010571
NB: 0.8544460127028934 0.639931289640592


As you can see, the logistic regression model has the highest recall, and lowest precision. The Naïve Bayes model has much lower recall, and a moderate precision score. The K-Nearest Neighbors model has precision scores above 0.82 for both precision and recall, giving it a relatively balanced accuracy.

[Back to top](#Index:) 
<a id='q13'></a>

### Question 13:

*5 points* 


Based on the confusion matrix, precision score, recall score, and overall model score results, which is the best model for our data?
- a) Logistic Regression
- b) K-Nearest Neighbors
- c) Naïve Bayes

Assign the letter corresponding to your answer to `best_model` as a string below.

In [121]:
### GRADED

best_model = 'b'


###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


<a id="limit"></a>
## Limitations and Next Steps

There are several key limitations with all of our classification models. 

It's first important to note that while two of our models had higher overall accuracy scores (75%-76.6%), these scores are not very high overall. A good classification model should have an accuracy above 80%, and a very good classification model that you intend to deploy should ideally be more than 90% accurate. Thus, while one model stood out as more accurate than the others, it is likely not an ideal choice for an airport looking to predict future delays. 

Second, it's always important to consider what _external_ factors might be incorporated into your datasets to improve the accuracy of your models. Strong predictors/features often require some additional work to come by, but they are worth the time you spend processing them. Some examples of additional features we can use to make our models more robust include:
- weather parameters, such as the temperature, and whether or not there was rain/snow on a given day
- details about the airline carriers, and their average frequency of delay
- details about the model of plane associated with the flight number (i.e., the plane's age, number of seats, etc)

Third, you'll notice that all of the models tended to perform significantly better on one group of data (delayed or not delayed) than the other. This is a problem often created when we have _imbalanced classes_.

In [None]:
airlines.groupby(['IsArrDelayed'])['Origin'].count()

In [None]:
y_test.groupby(y_test).count()

There are significantly more delayed flights in the overall data set (and the test set) than flights which arrived on time. In fact, there are nearly _three times_ as many delayed flights as on time flights. This regularly creates an issue when training classification models, which tend to be more accurate in classifying records from one group over another. 

You can adjust your model to deal with imbalanced classes in one or more of the following ways:
- Test a greater number of classification models. For example, a tree-based model may perform better than the 3 we used in this exerise.
- Use a **resampling method**, which draws a new, balanced sample from your existing data. You can either _undersample_ the larger group, or _oversample_ the smaller group.

Now that you have a handle on some common classification algorithms, we suggest you continue applying additional classification methods to data sets like the Airlines Data Set, or exploring additional data sets on Kaggle in order to test your skills further.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt


In [1]:
from sklearn.datasets import load_breast_cancer


In [6]:
cancer = load_breast_cancer()

In [7]:
X = cancer.data
y = cancer.target

In [8]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
from sklearn import Logist

In [14]:
from sklearn.linear_model import LogisticRegression

  return f(*args, **kwds)


In [15]:
lgr = LogisticRegression()

In [16]:
lgr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
lgr.score(X_test, y_test)

0.9230769230769231

In [19]:
from sklearn.naive_bayes import GaussianNB

In [21]:
nbayes = GaussianNB()

In [22]:
nbayes.fit(X_train, y_train)
nbayes.score(X_test, y_test)

0.9090909090909091

In [24]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 10, p = 1)

  return f(*args, **kwds)
  return f(*args, **kwds)


In [25]:
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

0.9090909090909091

In [26]:
from sklearn.metrics import confusion_matrix

In [27]:
confusion_matrix(y_test, preds)

NameError: name 'preds' is not defined

In [28]:
import scikitplot as skplot

  return f(*args, **kwds)


In [29]:
skplot.metrics.plot_roc_curve(y_test, )

NameError: name 'pred' is not defined