# Generative Classifiers: Naive Bayes

As part of this practical, we will use Naive Bayes to predict if a flight will have a significant delay. We will use a variety of information, including day of the week and flight distance. The dataset will require some pre-processing, but we will focus on the modeling parts more.

Let's import the packages that we will use during the practical:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Data processing and exploration

Outline of what we will do as part of data processing:
- load the dataset 
- remove the column `Month`
- in the `CarrierDelay`, `WeatherDelay`, `NASDelay`, `SecurityDelay`, `LateAircraftDelay` columns replace `NaN` with `0`
- extract and remove the column `ArrDelay`
- define the vector `major_delay` as a binary variable

### Loading the dataset

In [None]:
# load the dataset into a dataframe called data

data = pd.read_csv("data/flights08.csv")

### First look at the data
Have a look at the data:
    
- Do the features make sense?
- What's the shape of the dataset?
- How many missing values are present?
- How many unique values are present per feature? What does that tell you?

In [None]:
print("Features: {}\n".format(list(data.columns)))
print("Shape of the data: {}\n".format(data.shape))
print("Missing values:\n")
print(data.apply(np.isnan).sum())

print("\nNumber of unique values:\n")
print(data.apply(pd.Series.nunique))

### Dealing with missing values
The previous step should have shown you two things:
1. Some features have a lot of missing values, in particular those associated with delay at departure (e.g. ``CarrierDelay``). From now on, we will assume that a missing value for delay amounts to no delay.
2. Some features don't have enough unique values to be interesting (which ones?) and should probably removed. 

Based on this we:
- fill the missing values associated with delay by a 0
- remove the feature(s) that don't have enough variability
- remove all remaining rows that have missing values

In [None]:
# Your code here...


for col in ['CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']:
    data[col].fillna(0, inplace=True)
    
del data['Month']

data.dropna(axis=0, inplace=True)


### Extracting the response
Our aim is to predict whether there will be a significant delay. The variable that encodes the delay is `ArrDelay`.

1. Start by having a look at ``ArrDelay`` using ``distplot`` from ``seaborn`` .
2. Compute the delay threshold (`delay_threshold`) such that 70% of the positive delays are lower than that threshold. The method `np.percentile` might be useful here.
3. Form a response vector `major_delay` being either 0 or 1 depending on whether the delay is less than or greater or equal to `delay_threshold`.
4. Finally remove the `ArrDelay` column from the dataset.

In [None]:
plt.figure(figsize=(8, 6))
sns.distplot(data['ArrDelay'])

all_delays = data['ArrDelay']
positive_delays = all_delays[all_delays > 0]
delay_threshold = np.percentile(positive_delays, 70)

print("Percentage higher than threshold? {}pct.".format(
    100 * sum(positive_delays > delay_threshold) / len(positive_delays)))


major_delay = (all_delays >= delay_threshold).astype(int)

del data['ArrDelay']

Have a look at the value of the delay threshold:

In [None]:
delay_threshold

## Fit and evaluate a Naive Bayes model

### Train-test split
Next we split the data into training and testing sets.

- Use a random state of ``5175`` for comparable results
- Use 30% of the data for testing
- Stratify the training and testing sets using the `major_delay` vector to have a similar proportion of flights with a major delay in all sets
- Call the new sets `X_train`, `X_test`, `y_train`, `y_test`

In [None]:
from sklearn.model_selection import train_test_split
# Your code here...
X_train, X_test, y_train, y_test = train_test_split(data, major_delay, 
                                                    test_size=0.3, random_state=5175,
                                                    stratify=major_delay)


### Fit a basic Gaussian Naive Bayes model
Create and fit a Gaussian Naive Bayes model to the training data:

In [None]:
from sklearn.naive_bayes import GaussianNB

# Your code here...
gnb = GaussianNB()

gnb.fit(X_train, y_train)


### Make predictions and display the classification report
Make predictions on the test data and have a look at the classification report:

In [None]:
from sklearn.metrics import classification_report

# Your code here...
y_pred_gnb = gnb.predict(X_test)

print(classification_report(y_test, y_pred_gnb))


### Look at the probabilities
Gaussian Naive Bayes gives probabilities indicating how confident the model is about the classification. Use `distplot` from `seaborn` to display the modeled probabilities for class 1 (major delays).

Use `predict_proba` (not `score`), but you may also want to try `predict_log_proba`. Comment on the resulting graph.

In [None]:
# Your code here...
log_proba = gnb.predict_log_proba(X_test)

sns.distplot(log_proba[:, 1])
plt.show()
