# In this notebook we will explore Quadratic Discriminant Analysis (QDA) in 2D using iris dataset

## iris virginica
<img src="https://upload.wikimedia.org/wikipedia/commons/2/27/Southern_Blue_Flag_Iris_%28iris_virginica%29_-_Flickr_-_Andrea_Westmoreland.jpg" alt="Smiley face" height="400"  width="400">
## Iris versicolor
<img src="https://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg" alt="Smiley face" height="400"  width="400">
## Iris setosa
<img src="https://upload.wikimedia.org/wikipedia/commons/1/11/Iris_setosa_2.jpg" alt="Smiley face" height="400"  width="400">


## Let's download dataset $\mathcal{D} = \{(\mathbf{x_i}, y_i) \}_{i=1}^{N}$ containing features $\mathbf{x_i}$ of these flowers $y_i$ from UCI machine learning repository

In [None]:
import pandas as pd # for doing eploratory data analysis
import seaborn as sns # statistical visualization
import matplotlib.pyplot as plt
from sklearn import model_selection
import numpy as np
# to make graphics inline
%matplotlib inline 
sns.set()

# using pandas read_csv and giving name for the columns

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
iris_df = pd.read_csv(url, names=names)

# Let's randomly sample 5 observation

In [None]:
iris_df.sample(5)# randomly looking at five samples

# Statistical Summary and data sanity check

Please read pandas **isnull** and **any** functions

In [None]:
# just to make sure values in different columns are not missing
iris_df.isnull().any()

## As per the above output none of the columns have  any null value

In [None]:
# Making sure datatype is also good, so that relevant algebra on columns make sense
iris_df.dtypes

Class type is object(string). Down the line, we need to convert it to integer label

### In any machine learning task we want to do good on future data so let's split our data into train and test using sklearn. Last time we wrote splitting logic but all these activities are so routine that sklearn has inbult functions for it

# Q 1 (1 point), Partition the data into train and test test using train_test_split function from sklearn. Use following  seed parameter too.

In [None]:
validation_size = 0.32
seed = 3
# write your code  to complete following line
train_df, test_df = 


In [None]:
print('total training observation {}'.format(train_df.shape))
print('total teasting observation {}'.format(test_df.shape))
train_df.head()

# Hence ignoring the class label, we have 4 features for each type of flower

## Let's see how many flowers per class

In [None]:
train_df.groupby('class')['class'].count()

## So we have almost equal examples in each class. Class imbalance is not a big issue but let's still take care of it(We need to estimate class prior probability)

In [None]:
train_df.describe()


As per above summary most the real values featues are clustered around their mean value

## Let's visualize features to find 2 most discriminative one as we want to model class conditional densities using 2D gaussian.

**We can use matplotlib to do scatter plot(kind of Multivariate Plot) to see which features are most discriminative one but seaborn draws attractive statistical graph and we can focus more time on our main objective(find 2 most discriminative one )**

In [None]:

sns.pairplot(train_df, hue="class")

## pairplot,  plots scatter plot off diagonal and histogram(1 d distribution) of various attributes in diagonal.
## read the corresponding row and columns

Scatter plot gives a vague idea of the 2D distribution of the corresponding features. Points gets plotted on top of each other and distribution of the points may not be clear.

# Looks like any two pair of raw features can't classify flowers perfectly

As we want to explore 2D QDA, let's choose **petal length and petal width(see 3,4 entry in above pair plot)**


## Following figures gives much better idea of density although it is fitted density

In [None]:
for class_name, per_class_df in train_df.groupby('class'):
    print('{} \n'.format(class_name))
    sns.jointplot(x= 'petal_length', y = 'petal_width',kind='kde' , data = per_class_df)
    plt.show()
#train_df.hist(by= 'class', figsize = (10, 10))

# As we want to work with only 2 features(petal_lenght, petal_width), we need to select only these column from training and test dataset

In [None]:
# most of the operations  in pandas are performed on underlying  numpy array. We can get this array by using values property
# Most of the operations(mean, variance, covariance matrix) can be perfomed using pandas but let's use numpy
X_train = train_df[['petal_length', 'petal_width']].values
y_train = train_df['class']
X_test = test_df[['petal_length', 'petal_width']].values
y_test = test_df['class']

In [None]:
# Some debugging information
X_train.shape, y_train.shape, X_test.shape, y_test.shape, y_train.unique()

In [None]:
print(type(X_train), type(y_train))

In [None]:
train_df[['petal_length', 'petal_width']].head()

In [None]:
# this is how 5 the training record looks now
X_train[0:5]

## conversion to numpy looks good

<font color = 'red'> Make sure pandas train_df and X_train agree on values  </font>

For convenience let's convert these string labels to integers
    

In [None]:
# before mapping
y_train.head()

# Mapping class labels to integer in pandas

In [None]:
# creating dictionary of mapping
mapping = {v:k for k, v in enumerate(y_train.unique())}
mapping

In [None]:
# after mapping
y_train = y_train.map(mapping)
y_train = y_train.values

In [None]:
y_train

In [None]:
# let's do same of test label
y_test = y_test.map(mapping).values
y_test

# QDA steps
- Let's fit(learn mean and covariance matrix) 2D gaussian to joint distribution of (petal_length, petal_width) for different flower category
- Once we have learned per class mean and covariance matrix, we can build discriminant function for discriminating(classifying) future iris flowers

# From MLE estimate per class mean is empirical mean and covariance matrix is empirical covariance matrix
<font size = 7>
$\mathbf{\mu_c} = \sum_{i=1; y_i = c}^{N} \frac{\mathbf{x_i}}{\#I(y_i ==c)}$ 

$\Sigma_c = \sum_{i=1; y_i =c}^{N} \frac{\mathbf{(x_i -\mu_c) (x_i -\mu_c)^{T}}}{\#I(y_i ==c)}$

<br>
$\pi_c = \frac{N_c}{N}$ where $N_c$ is number of example in class $c$ and $N$ total number of example.

</font>

# Q2(3= 1+1+1 point) calculate the mean, inverse of covariance matrix and class prior
we are storing inverse of covariance matrix as we need inverse matrix in discriminant function

In [None]:
per_class_mean_vector = []
per_class_covariance_matrix = []
per_class_prior = []
for cls_idx in mapping.values():
    #calculate cls_idx mean and covariance for each class data
    X_c = X_train[y_train == cls_idx]
    cls_prior = ## ??? write your code here
    cls_mean = ## ??? write your code here
    cls_cov = ## ??? write code here
    per_class_mean_vector.append(cls_mean)
    # Let's store inv as we will see later
    # to take a inverse we need to use linalg from numpy
    per_class_covariance_matrix.append(np.linalg.inv(cls_cov))
    per_class_prior.append(cls_prior)
    


# Some debugging check before we jump into coding discriminant function.

In [None]:
per_class_prior

In [None]:
train_df.groupby('class')['class'].count()

<font size =5 color="red"> Make sure per_class_prior output and relative frequency matches. In general it is good habit to keep checking your calculation/code. It can save you a lot of time and trouble. </font>

# We want to classify(discriminate) example in test set i.e we want to evaluate probablity $P(y=c|x_{test})$, This is our discriminant function.
<font size = 5>
$P(y=c|x_{test}) = \frac{P(x_{test}|y=c) P(y=c)}{P(x_{test})}$
</font>

- As we have talked in the lecture, for each class discriminant function scaling by same value, adding a constant or taking log doesn't matter.
- On right hand side, we have modelled $P(x_{test}|y=c)$ with 2 -d gaussian density and have already estimated its paramters(mean, covariane) for each class using MLE

# After some algebraic simplification,  D=2 quadratic discriminant function for $\mathbf{x} \in \mathbb{R}^2$ will look like
<font size = 5>

$g(\mathbf{x}) = -(x- \mu_c)^T \Sigma_C^{-1}( x- \mu_c) + \log (det(\Sigma_c^{-1})) + \log \pi_c $

</font>

# Q3 (4 point) code disriminant function  which takes class mean , covariance matrix, class prior and a test example. Output the value of distriminant function
 - 2 = 1+1 point for using numpy dot function twice
 - 1 point for making sure you took the log
 - 1 point to make sure you took the determinant using linalg from numpy
 
We will use this discriminant function for predicting class label on test sample. We will compare predicted label with test sample label to see how well we did?

In [None]:
def discriminant(mean, cov_inv,prior, x):
    '''
    args::
        mean: 1-d numpy mean  vector
        cov_inv: 2-d numpy covariance matrix inverse
        prior: class prior probability
        x: feature vector
    return:
            scalar discriminant score of x
    ''' 
    score =  ## ??? are per above formula write the code here
    return score

In [None]:
# for each example let's calculate this score and store in numpy array

score_mat = np.zeros((len(X_test), len(mapping)))
score_mat.shape

In [None]:
for idx, test_example in enumerate(X_test):
    for cls, (mean, cov_inv, prior) in enumerate(zip(per_class_mean_vector, per_class_covariance_matrix, per_class_prior)):
        score_mat[idx][cls] = discriminant(mean, cov_inv, prior, test_example)

**Please read about python *enumerate, zip* inbuilt function  if you don't know how they work**

# based on these discriminant values lets try to predict class labels
see how **numpy argmax** function gives indice of the largest value. The way we build mapping and stored the score, indices 0,1,2 encode class label for different flowers

In [None]:
predicted_label = np.argmax(score_mat, axis=1)
# hence accuracy is
np.mean(predicted_label == y_test)

In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# Q4 (2 point) Fit the QDA using QuadraticDiscriminantAnalysis and report accuracy on test set
<font color = 'red'>your code should not be more than three line. Hopefully your answer matches with previous value </font>


In [None]:
# ??? write your  code here

# Side note: One check various parameters learned by QuadraticDiscriminantAnalysis. Let check mean and our mean vector. They must match

In [None]:
clf.means_ # if do clf. and press tab key you can see all the attribute and functions this class has

In [None]:
per_class_mean_vector

See this link http://scikit-learn.org/stable/auto_examples/classification/plot_lda_qda.html