# Lab 05 - Comparing Classification Algorithms

The following lab is aimed in developing an intuision for what type of classifiers might work for different datasets. To that end we will explore a variety of simulated datasets and for each test the following classifiers:
- **Logistic regression** - aims to model the probability of a sample being classified as $1$ by: $$\mathbb{P}\left(y=1\right) = \sigma\left(w^\top x\right)$$ where $\sigma$ denotes the sigmoidal function.
- **Decision Tree** - partitions the domain space into a dis-joint union of blocks. Then, prediction is obtained by the block's majoriy vote: $$y = \sum y_i \unicode{x1D7D9}\left[x_i\in B\left(x\right)\right]$$, where $B\left(x\right)$ denotes the block in which the new sample $x$ belongs to.
- **$k$-NN classifier** - a model free classifier, predicts label by the "neighborhood" of a given sample: $$\underset{y\in \{0,1\}}{argmax} \sum_{i=1}^k \unicode{x1D7D9}\left[y_{\pi_{i}} = y\right]$$ 
- **SVM classifier** - learns a separating hyperplane, whereby predictions are made according to the samples location in relation to the hyperplane: $$y = sign \left(w^\top x + b\right)$$
- **LDA and QDA** - assumes data is generated from different Gaussians. In the case of LDA the covariance matrices of the different Gaussians are identical. In the case of QDA each Gaussian can have a different covariance matrix.

In [11]:
import sys 
sys.path.append("../")
from utils import *

In [12]:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import PolynomialFeatures
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.datasets import make_moons

models = [
    LogisticRegression(penalty="none"),
    DecisionTreeClassifier(max_depth=5), 
    KNeighborsClassifier(n_neighbors=5),
    SVC(kernel='linear', probability=True),
    LinearDiscriminantAnalysis(store_covariance=True),
    QuadraticDiscriminantAnalysis(store_covariance=True)    
]
model_names = ["Logistic regression","Desicion Tree (Depth 5)", "KNN", "Linear SVM", "LDA", "QDA"]

## Generate Data Scenario
To load the desired dataset call corresponding function

In [3]:
np.random.seed(1)
m = 250
symbols = np.array(["circle", "x"])

def triangle():
    x = np.random.uniform(low=-5, high=5, size=(m, 2))
    return (x, np.array(x[:,0] < x[:,1], dtype=np.int)), "Linear Separation"
    
def circular(radius=9):
    x = np.random.uniform(low=-5, high=5, size=(m, 2))
    return (x, np.array(x[:,0]**2 + x[:,1]**2 <= radius, dtype=np.int)), "Circle"

def rectangles():
    x = np.random.uniform(low=-5, high=5, size=(m, 2))
    return (x, np.array( ((-5 <= x[:,0]) & (x[:,0] <= 0) & (-4 <= x[:,1]) & (x[:, 1] <= 0)) |
                         ((1 <= x[:,0])  & (x[:,0] <= 5) & (-1 <= x[:,1]) & (x[:, 1] <= 4)), dtype=np.int)), \
    "Two Rectangles"

def gaussians(p=.5):
    mu, cov = np.array([[-2,-2], [2,1]]), np.array([[[.6, .4], [.4, .6]], [[1.4, -0.9], [-0.9, .6 ]]])
    
    x, y = [], []
    for _ in range(m):
        y.append(int(np.random.uniform() <= p))
        x.append(np.random.multivariate_normal(mu[y[-1]], cov[y[-1]]))
    return (np.array(x), np.array(y)), "Two Gaussians"


def gaussians_non_linear():
    mu = [np.array([-5,-5]), np.array([-5,5]), np.array([5,-5]), np.array([5,5])]
    
    x, y = [], []
    for _m in range(m):
        y.append(int(np.random.choice([0,1,2,3])))
        x.append(mu[y[-1]] + np.random.multivariate_normal([0,0], [[1,0],[0,1]]))
    
    y = np.array(y)   
    return (np.array(x), np.logical_or(y == 1, y == 2).astype(int)), "Four Gaussians Two Classes"

def moons():
    return make_moons(n_samples=m, noise=0.2, random_state=0), "Moons"


# To load different datasets replace `triangle()` with one of the other possible datasets
(X, y), title = circular()
lims = np.array([X.min(axis=0), X.max(axis=0)]).T + np.array([-.4, .4])

go.Figure(data=[go.Scatter(x=X[:,0], y=X[:,1], mode="markers", showlegend=False,
                           marker=dict(color=y, symbol=symbols[y], line=dict(color="black", width=1),
                                       colorscale=[custom[0], custom[-1]]))], 
          layout=go.Layout(title= rf"$\textbf{{(1) {title} Dataset}}$")).show()

NameError: name 'np' is not defined

## Execute Models

In [4]:
fig = make_subplots(rows=2, cols=3, subplot_titles=[rf"$\textbf{{{m}}}$" for m in model_names],
                    horizontal_spacing = 0.01, vertical_spacing=.03)
for i, m in enumerate(models):
    fig.add_traces([decision_surface(m.fit(X, y).predict, lims[0], lims[1], showscale=False),
                    go.Scatter(x=X[:,0], y=X[:,1], mode="markers", showlegend=False,
                               marker=dict(color=y, symbol=symbols[y], colorscale=[custom[0], custom[-1]], 
                                           line=dict(color="black", width=1)) )], 
                   rows=(i//3) + 1, cols=(i%3)+1)

fig.update_layout(title=rf"$\textbf{{(2) Decision Boundaries Of Models - {title} Dataset}}$", margin=dict(t=100))\
    .update_xaxes(visible=False). update_yaxes(visible=False)

NameError: name 'make_subplots' is not defined

## Evaluating Models

In [15]:
fig = go.Figure(layout=go.Layout(title=rf"$\textbf{{(3) ROC Curves Of Models - {title} Dataset}}$", margin=dict(t=100)))
for i, model in enumerate(models):
    fpr, tpr, th = metrics.roc_curve(y, model.predict_proba(X)[:, 1])
    fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name=model_names[i]))
fig.show()

# Time To Think...

We tried out 6 different models over 3 datasets: A linear separation of $\mathbb{R}^2$, a circlular dataset and two rectangles. Which learners performed better on each one of the datasets? and what in the manner the learner models the data (that is, in the assumptions regarding the behaviour of the data) made it succeed.

- Over the linear separation dataset, the decision tree and $k$NN models were outperformed by the rest of the tested models. 
- In the case of the circular dataset, no model achieved a perfect classification. The logistic regression, SVM and LDA models have failed over all blue (x's) samples. All three models assume there exists some linear function that separates the two classes, but this is not the case for such dataset. The decision tree and $k$NN classifiers were able to provide a reasonable approximation for the correct decision boundary. The QDA model was also able to provide a good approximation of the correct decision boundary though it under estimated the size of the blue (x's) Gaussian.
- Similar to the circular dataset, over the two rectangles dataset only the decision tree, $k$NN and QDA classifiers were able to distinguish between the two classes. Notice that the QDA is not able to capture the correct shape of the decision boundary but only the area in space that is occupied by each class. 

Next, follow the same process to analyze the results over the Gaussians and Moons datasets.
- Though all models managed to correctly classify all samples for the Gaussians dataset, why did the different models produce the different decision boundaries? Why does the decision boundary of the LDA is a straight line and why that of the QDA is oval? 
- How would it influence each model if we increase the noise of the red (circles) Gaussian in the $y=x$ direction? 
- How would the decision boundary be influenced if we change the proportion of samples from each class (for example `p=0.2` or `p=0.8`)?