# Section 07 - Classfication
### Introduction to Data Science EN.553.436/EN.553.636 - Fall 2021

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score,classification_report,mean_squared_error,confusion_matrix, r2_score, roc_curve, auc

### Classification

Classification is the task of approximating a mapping function (f) from input variables (X) to **discrete output variables (y)**. The output variables are often called `labels` or `categories`. A classification problem with two classes is often called binary classification problem. A problem with more than two classes is often called a multi-class classification problem.

For example, an email of text can be classified as belonging to one of two classes: “spam“ and “not spam“;

### Classification vs Regression 

Recap: Regression is the task of approximating a mapping function (f) from input variables (X) to a **continuous output variable (y)**. A continuous output variable is a real-value, such as an integer or floating point value.

| Classification | Regression |
| --- | --- |
| discrete | countious|
| In Regression, we try to find the best fit line, which can predict the output more accurately. | In Classification, we try to find the decision boundary, which can divide the dataset into different classes.|
| Classification predictions can be evaluated using accuracy, whereas regression predictions cannot.|Regression predictions can be evaluated using mean squared error, whereas classification predictions cannot.

<img src=https://static.javatpoint.com/tutorial/machine-learning/images/regression-vs-classification-in-machine-learning.png alt="image info" style="width: 500px;"/>



### DataSet

Due to spread of COVID-19, vaccine development is being demanded as soon as possible. The dataset we use in this notebook describes the [B-cell](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwiDidjy38fzAhX5oHIEHVEOC0IQFnoECAwQAw&url=https%3A%2F%2Fwww.hindawi.com%2Fjournals%2Fjir%2F2017%2F2680160%2F&usg=AOvVaw1qjR4h3uzKaJu1p3eWWYIW) epitope predictions, which is the antigen portion binding to the immunoglobulin or antibody. 


`input_bcell.csv` : this is our main training data. It has 14387 rows and 14 columns.

Columns Interpretions:
* `parent_protein_id`: parent protein ID
* `protein_seq`: parent protein sequence
* `start_position`: start position of peptide
* `end_position`: end position of peptide
* `peptide_seq`: peptide sequence
* `chou_fasman`: peptide feature, β turn
* `emini`: peptide feature, relative surface accessibility
* `kolaskar_tongaonkar`: peptide feature, antigenicity
* `parker`: peptide feature, hydrophobicity
* `isoelectric_point`: protein feature
* `aromacity`: protein feature
* `hydrophobicity`: protein feature
* `stability`: protein feature
* `target`: antibody valence (target value containing 0 and 1)

### Q1 - KNN

***KNN*** calculates the distance of a new data point to all other training data points. The distance can be of any type e.g Euclidean or Manhattan etc. Then it selects the K-nearest data points, where K can be any integer. Finally it assigns the data point to the class to which the majority of the K data points belong.
<img src= https://s3.amazonaws.com/stackabuse/media/k-nearest-neighbors-algorithm-python-scikit-learn-2.png alt="image info" style="width: 300px;"/>

For instance, we want to classify a new point 'X' belongs to Blue class or Red class. We can use KNN with number of neighbors equal to `3`, where it means finds the `3 nearest points` with least distance to point X. KNN first calcualtes the distance btween X and all other points, then pick 3 nearest points (circled above). Since there are two Red points and one Blue point inside the circle, then we choose Red to be the class of X.

**Read the dataset, create a correlation matrix, split the data with 30% testing, and apply KNN to do classification with number of neighbors equal to 3. What is the prediction accuracy?**

In [2]:
# read table

In [3]:
# correlation matrix

In [4]:
# prepare training and testing set

In [5]:
# do KNN

**Try to use different number of neighbors, such as [1,3,4,6,10,30,50] in KNN classifier and plot the prediction accuracy. What can you observe? How to choose number of neighbors in KNN?**

In [6]:
# see different number of neighbors

In [7]:
# generate a line graph with above result

Observation:

### Q2 - LDA

***LDA*** is a `dimensionality reduction` technique. It reduces the number of dimensions (i.e. variables or dimensions or features) in a dataset while retaining as much information as possible. LDA works by calculating summary statistics for the input features by class label, such as the mean and standard deviation. It uses those information to create a new axis and `projects the data` on to the new axis in such a way as to minimizes the variance and maximizes the distance between the means of the two classes.

***Comparing LDA and PCA:***
<img src=https://nirpyresearch.com/wp-content/uploads/2018/11/PCAvsLDA-1024x467.png alt="image info" style="width: 500px;"/>

**Apply LDA to do classification. What is the prediction accuracy?**

In [8]:
# create LDA model

### Q3 - Naive Bayes

***Naive Bayes*** is a classification algorithm that works based on the Bayes theorem. If we are trying to decide between two labels using Naive Bayes classifier, then we can compute the ratio of the posterior probabilities for each label. Then assign the new point to most probable class. It assumes `independence` among predictors.

***General Steps:***
* Step 1: Calculate the prior probability for each class
* Step 2: Find Likelihood probability with each attribute for each class
* Step 3: Calculate posterior probability using Bayes Theorem
* Step 4: Make prediction, choose the class with highest probability

**Can you manually compute predictions and prediction accuracy by using Gaussian Naive Bayes method?**
* Step1, calculate mean and std for two classes (target = 0 and target = 1)
* Step2, compute likelyhood by using gaussian function
* Step3, compute posterior
* Step4, compare posterior probabilities

In [9]:
# compute mean and std

In [10]:
# compute likelihood

In [11]:
# compute posterior

In [12]:
# compare

**Apply Gaussian Naive Bayes to do classification. What is the prediction accuracy? Why it's higher/lower compare to pervious two models?**

In [13]:
# gussian naive bayes model

Reason:

### Q4 - Combine Normalization and PCA with Classification problem

Combine methods we learnt before with this classfication problem. First apply normalization to KNN and LDA models. Then apply both normalization and PCA to those two models. Do Normalization or PCA help increase the accuracy score? Why

In [14]:
# Normalize data 

In [15]:
# Normalize data and apply PCA

**Reference**

document:

https://www.javatpoint.com/regression-vs-classification-in-machine-learning

https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/

https://medium.com/machine-learning-algorithms-from-scratch/naive-bayes-classification-from-scratch-in-python-e3a48bf5f91a

data: 

https://future-architect.github.io/articles/20200801/

https://www.kaggle.com/futurecorporation/epitope-prediction?select=input_bcell.csv
