# Module 4: Anomaly Detection
## Practice: Outlier Reduction for Linear Regression
In this session, we'll be fitting a `LinearRegression` model on the `boston` dataset included in `scikit-learn`.  

Having already worked with this dataset,
you may remember it as a simple yet broadly representative linear regression problem.


## Getting started - imports

In [None]:
%matplotlib inline
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score, train_test_split

## Loading dataset

First order of business is to load in the dataset.  
Run the following cell to load the boston dataset and get a description of it.

In [None]:
# Load boston housing dataset
boston = load_boston()
print(boston.DESCR)

### Some preparatory processing.

In [None]:
# Select out just a few of the features 
# NOX      nitric oxides concentration (parts per 10 million)
# AGE      proportion of owner-occupied units built prior to 1
# RAD      index of accessibility to radial highways
# PTRATIO  pupil-teacher ratio by town
# LSTAT    % lower status of the population
boston_X = boston.data[:,(4,6,8,10,12)]
boston_y = boston.target
dataset = pd.DataFrame(np.column_stack([boston_X, boston_y])).sample(frac=1).reset_index(drop = True)

# Here's how to do the same with pandas
# boston_X = pd.DataFrame(boston.data[:,(4,6,8,10,12)])
# boston_y = pd.DataFrame(boston.target)
# dataset = pd.concat([boston_X, boston_y], axis=1, join_axes=[boston_X.index]).sample(frac=1).reset_index(drop = True)

dataset.columns = ['NOX', 'AGE', 'RAD', 'PTRATIO', 'LSTAT', 'TARGET']
dataset.describe()

#### Pull columns from dataset into variables X (everything except TARGET) and y (TARGET).

In [None]:
# Split into X and y sets (use indices 4,6,8,10,12 for X)

# Complete code below this comment  (Question #P4001)
# ----------------------------------
X = np.array(<placeholder>)
y = np.array(<placeholder>)

# Print out some basic shape data on the arrays
print("X, y shape:", X.shape, y.shape)

**Create training/validation split** with 30% data held out.

In [None]:
# Complete code below this comment  (Question #P4002)
# ----------------------------------
<placeholder> = train_test_split(<placeholder>)

# verify split shapes and contents
print("X_train.shape: ", X_train.shape)
print("y_train.shape: ", y_train.shape)
print("X_test.shape: ", X_test.shape)
print("y_test.shape: ", y_test.shape)

Run cross validation on a linear ridge model.

In [None]:
naive_model = Ridge()

# Complete code below this comment  (Question #P4003)
# ----------------------------------
scores = cross_val_score(<placeholder>)
print("Scores: ", scores)
print("Mean score (3 folds): ", np.mean(scores))

Fit this model on the training dataset.

In [None]:
# Complete code below this comment  (Question #P4004)
# ----------------------------------
naive_model.fit(<placeholder>)

Make some predictions from testing dataset and plot them.

In [None]:
# Complete code below this comment  (Question #P4005)
# ----------------------------------
naive_predictions = naive_model.predict(<placeholder>)
print(X_test.shape, naive_predictions.shape)
plt.scatter(y_test, naive_predictions)

# Fit a trendline for visualization
z = np.polyfit(y_test, naive_predictions, 1)
p = np.poly1d(z)
plt.title("Predicted vs. actual target values")
plt.xlabel("Actual y value")
plt.ylabel("Model y value")
plt.plot(y_test, p(y_test), 'k--')

## Issues with the above model
It is worth noting that without outlier reduction / anomaly detection in the pipeline, 
performance is relatively low. 
The actual performance on the test set is only scarcely better than random guessing 
(we can assume a baseline performance of 50% in such a case, and the model scores about 70%).

## Why a trendline?
This is mainly for illustrative purposes. 
The highest-error estimations are those farthest from the trendline, and ideally, 
the line of best fit would be `f(x) = x` 
(that is, the estimate and actual values would be perfectly equal in all cases).

## What methods are available to us for outlier reduction?
We could try `KMeans` or an `EllipticEnvelope` again, but we're going to explore a few more options. 

In [None]:
from sklearn.ensemble import IsolationForest

# Construct IsolationForest 
iso_forest = IsolationForest(n_estimators=250,
                             bootstrap=True).fit(X, y)

In [None]:
help(IsolationForest)

Carefully read through the API documentation for Isolation Forest!

Pull **inliers** into variables X_iso and y_iso.

In [None]:
# Get labels from classifier and be ready to cull outliers
iso_outliers = iso_forest.predict(X)==-1

# Complete code below this comment  (Question #P4006)
# ----------------------------------
X_iso = <placeholder>
y_iso = <placeholder>

In [None]:
# We can of course run a train-test split on the separated data as well
X_train_iso, X_test_iso, y_train_iso, y_test_iso = train_test_split(X_iso, 
                                                                    y_iso, 
                                                                    test_size=0.3)
# Fit the new model using the IsolationForest training split
iso_model = Ridge()
iso_model.fit(X_train_iso, y_train_iso)

# Cross validate the new model
iso_scores = cross_val_score(estimator=iso_model, 
                             X=X_test_iso, y=y_test_iso)
print(iso_scores)
print("Mean CV score w/ IsolationForest:", np.mean(iso_scores))

iso_predictions = iso_model.predict(X_test)

# Plot the inlying points
plt.scatter(y_test, iso_predictions)

# Fit a trendline for visualization
z = np.polyfit(y_test, iso_predictions, 1)
p = np.poly1d(z)
plt.title("Predicted vs. actual target values")
plt.xlabel("Actual y value")
plt.ylabel("Model y value")
plt.plot(y_test, p(y_test), 'k-')

## Alternatives to IsolationForest: OneClassSVM
This means it's time to try something else.  
The code below will look very similar to the above, but using `OneClassSVM` in place of the `IsolationForest`:

In [None]:
from sklearn.svm import OneClassSVM

help(OneClassSVM)

In [None]:
# Construct OneClassSVM (kernel='rbf') and fit to full dataset
svm = OneClassSVM(kernel='rbf').fit(X, y)

#### Mark outliers.
Pull **inliers** into variables X_svm and y_svm.

In [None]:
# Complete code below this comment  (Question #P4007)
# ----------------------------------

# Get labels from classifier and mark outliers
svm_outliers = <placeholder>

# Pull inliers
X_svm = <placeholder>
y_svm = <placeholder>

In [None]:
# Train-test split
X_train_svm, X_test_svm, y_train_svm, y_test_svm = train_test_split(X_svm, y_svm, test_size=0.3)

svm_model = Ridge().fit(X_train_svm, y_train_svm)

# Cross validate the new model
iso_scores = cross_val_score(estimator=svm_model, 
                             X=X_test_svm, y=y_test_svm)
print(iso_scores)
print("Mean CV score w/ OneClassSVM:", np.mean(iso_scores))

# Make predictions with the fitted model
svm_predictions = svm_model.predict(X_test)

# Plot the inlying points
plt.scatter(y_test, svm_predictions)

# Fit a trendline for visualization
z = np.polyfit(y_test, svm_predictions, 1)
p = np.poly1d(z)
plt.title("Predicted vs. actual target values")
plt.xlabel("Actual y value")
plt.ylabel("Model y value")
plt.plot(y_test, p(y_test), 'k-')

## Summary Analysis

Of the anomaly detection algorithms used, 
which had the highest marginal performance? 
Consider computational cost, which ones seemd to run fast versus slow?

## Going further: performance analysis w/ `scikit` modules
Compute and display the following for the models produced by each anomaly detection method:
 1. Confusion Matrix
 1. Accuracy
 1. Precision
 1. $F_1$-Score

In [None]:
# Add your code for the above tasks here:   (Question #P4008)
#  Ridge
# ----------------------------------------






In [None]:
# Add your code for the above tasks here:   (Question #P4009)
#  IsolationForest
# ----------------------------------------






In [None]:
# Add your code for the above tasks here:   (Question #P4010)
#  OneClassSVM
# ----------------------------------------






# Save your notebook!