# Import and Clean Data

In [1]:
from sklearn.datasets import load_iris
iris = load_iris()

In [2]:
%matplotlib inline

In [3]:
from src.makevis import *

In [4]:
iris_df = Dataframe.create_df(iris)
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,species name
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


# Data Visualizations

We'll start by trying to reproduce the scatter plots that Fisher made (https://en.wikipedia.org/wiki/Iris_flower_data_set#/media/File:Iris_dataset_scatterplot.svg)

 <img src="imgs/Iris_dataset_scatterplot.svg">

In [None]:
p = Plots(iris_df)

In [None]:
p.create_scatter_matrix('species name', 'species','Iris data')

## Radviz

RadViz is a way of visualizing multi-variate data. It is based on a simple spring tension minimization algorithm. Basically you set up a bunch of points in a plane. In our case they are equally spaced on a unit circle. Each point represents a single attribute. You then pretend that each sample in the data set is attached to each of these points by a spring, the stiffness of which is proportional to the numerical value of that attribute (they are normalized to unit interval). The point in the plane, where our sample settles to (where the forces acting on our sample are at an equilibrium) is where a dot representing our sample will be drawn. Depending on which class that sample belongs it will be colored differently.



In [None]:
Plots.make_radviz(iris_df, 'species name')

## Multinomial Logistic Regression with SkLearn 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV as LRCV, LogisticRegression as LR

In [None]:
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Should we Cross- Validate?

Cross Validation accomplishes two things:
    * Further evaluate model performance (although a final hold-out test set does this when we're done training too)
    * Optimize a hyper parameter (in this case the regularization coefficient)
Let's evaluate model performance for  and compare

In [None]:
model = LR()
modelCV = LRCV()

In [None]:
model.fit(X_train, y_train)
modelCV.fit(X_train, y_train)

How should we evaluate model performance?

## Compute Confusion Matrix for each Iris Class 

A [Confusion Matrix](https://en.wikipedia.org/wiki/Confusion_matrix) gives the count of instances based on the actual and predicted values of the target. For a binary classifier it looks like

|                    |Actual positive|Actual negative |
|--------------------|------------------|---------------|
| **Predicted positive**| true positive    | false positive|
| **Predicted negative**| false negative   | true negative |


*True* and *false* refer to whether you are correct.

*Positive* and *negative* refer to the **predicted** result.

A *type-I error* is a false positive (which I remember because that phrase is more common than false negative).

Accuracy $= \frac{TP+TN}{TP+TN+FP+FN}$

Sensitivity = Recall = TPR $= \frac{TP}{TP+FN}$

FPR $= \frac{FP}{TN+FP}$

Specificity $= \frac{TN}{TN+FP}$

Precision = PPV $= \frac{TP}{TP+FP}$

NPV $= \frac{TN}{TN+FN}$

 <img src="imgs/confusion_matrix.png">

## Vary Acceptable Type I/II error threshold to generate ROC and Precision/ Recall Curves

In [None]:
prob = model.predict_proba(X_train)
probCV = model.predict_proba(X_train)

In [None]:
c = Curves(prob, y_train, iris)
cCV = Curves(probCV, y_train, iris)

In [None]:
prob.shape

Let's look at one of these dataframes to get a feel for what's going on as we vary our threshold

In [None]:
y_train

In [None]:
setosa_df = c.calculate_threshold_values(0)
setosa_df.head(20)

### Logistic Regression Curves

In [None]:
num_classes = len(iris_df['species name'].unique())

for i in range(num_classes):
    fig, (ax0, ax1) = plt.subplots(1,2, figsize=(12,6))
    df = c.calculate_threshold_values(i)
    c.plot_roc(ax0, df, i)
    c.plot_precision_recall(ax1, df, i)

### Logistic Regression with CV Curves

In [None]:
for i in range(num_classes):
    fig, (ax0, ax1) = plt.subplots(1,2, figsize=(12,6))
    df = cCV.calculate_threshold_values(i)
    cCV.plot_roc(ax0, df, i)
    cCV.plot_precision_recall(ax1, df, i)