# Part 0
`pulsar.csv` ([source](https://archive.ics.uci.edu/ml/datasets/HTRU2)) contains statistics from two types of signal from pulsar candidates: integrated profile and dispersion-measure signal-to-noise curve. Our goal is to select the best pulsar classifier on this data by cross-validation.

# Part 1
Load and examine the dataset, noting how many instances of pulsar there are.

In [1]:
import pandas as pd

data = pd.read_csv("pulsar.csv")
display(data)
X = data.iloc[:,:8]
y = data.iloc[:,8]
print("Pulsars:", (y==1).sum(), "out of", y.shape[0])

Unnamed: 0,IP_Mean,IP_SD,IP_Kurt,IP_Skew,DMSNR_Mean,DMSNR_SD,DMSNR_Kurt,DMSNR_Skew,Pulsar
0,140.562500,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,0
...,...,...,...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,0
17894,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,0
17895,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,0
17896,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,0


Pulsars: 1639 out of 17898


# Part 2
Split the data into training and test sets using [`sklearn.model_selection.StratifiedShuffleSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html). Verify that the stratified split was performed correctly.

In [5]:
from sklearn.model_selection import StratifiedShuffleSplit

# Split.
split = StratifiedShuffleSplit(n_splits=1, test_size=1/3, random_state=0)
for train_idx, test_idx in split.split(X, y):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]

# Verify correctness of split.
print("Training Set, Pulsars:", (y_train==1).sum(), "out of", y_train.shape[0])
print("Test Set, Pulsars:", (y_test==1).sum(), "out of", y_test.shape[0])

Training Set, Pulsars: 1093 out of 11932
Test Set, Pulsars: 546 out of 5966


# Part 3
We have covered the following classifiers:
- kNN
- LDA, QDA, GNB
- decision trees, random forests
- logistic regression
- SVM

Fit these models to the training data. Where necessary, seek optimal hyperparameters using [`sklearn.model_selection.GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). If performing standardization of the data, use [`sklearn.pipeline.Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to ensure validation sets are normalized.

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as qda
from sklearn.naive_bayes import GaussianNB as gnb
from sklearn.tree import DecisionTreeClassifier as dt
from sklearn.ensemble import RandomForestClassifier as rf
from sklearn.linear_model import LogisticRegression as lr
from sklearn.svm import SVC as svm

# Return best classifier from grid search. CV = cross-validation
def best(estimator, param_grid):
    search = GridSearchCV(estimator, param_grid, scoring='accuracy', cv=5)
    search.fit(X_train, y_train)
    return search.best_estimator_

# Fit various classifiers to training data.

pipe_knn = Pipeline([('scaler', StandardScaler()), ('knn', knn())]) # Scaling for kNN.
knn_grid = {'knn__n_neighbors': [4, 16, 256]}
knn_model = best(pipe_knn, knn_grid)

lda_model = lda().fit(X_train, y_train)
qda_model = qda().fit(X_train, y_train)
gnb_model = gnb().fit(X_train, y_train)

dt_model = dt().fit(X_train, y_train)

rf_grid = {'n_estimators': [16, 256]}
rf_model = best(rf(), rf_grid)

pipe_lr = Pipeline([('scaler', StandardScaler()), ('lr', lr())]) # Scaling for regularization.
lr_grid = {'lr__penalty': ('l2', 'l1'), 'lr__solver': ['liblinear'], 'lr__C': [1e-1, 1e0, 1e1]}
lr_model = best(pipe_lr, lr_grid)

pipe_svm = Pipeline([('scaler', StandardScaler()), ('svm', svm())]) # Scaling for SVM.
svm_grid = {'svm__kernel': ('linear', 'poly', 'rbf', 'sigmoid')}
svm_model = best(pipe_svm, svm_grid)

# Part 4 
Which model has the best test performance and what were the hyperparameters used in its training?

In [4]:
from sklearn.metrics import accuracy_score as acc
import numpy as np

models = [knn_model, lda_model, qda_model, gnb_model, dt_model, rf_model, lr_model, svm_model]

# Best model.
idx = np.argmax([acc(y_test, mod.predict(X_test)) for mod in models])
print("Best:", models[idx])

Best: Pipeline(steps=[('scaler', StandardScaler()),
                ('lr',
                 LogisticRegression(C=10.0, penalty='l1', solver='liblinear'))])
