# Predicting Pulsar Stars

## Introduction

HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey. 

Pulsars are a rare type of Neutron star that produce radio emission detectable here on Earth. They are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter. 

As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars 
rotate rapidly, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes. 

Each pulsar produces a slightly different emission pattern, which varies slightly with each rotation. Thus a potential signal detection known as a 'candidate', is averaged over many rotations of the pulsar, as determined by the length of an observation. In the absence of additional info, each candidate could potentially describe a real pulsar. However in practice almost all detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find. 

Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. Classification systems in particular are being widely adopted, which treat the candidate data sets as binary classification problems. Here the legitimate pulsar examples are a minority positive class, and spurious examples the majority negative class. At present multi-class labels are unavailable, given the costs associated with data annotation. 

The data set shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. These examples have all been checked by human annotators. 

Attribute Information:

Each candidate is described by 8 continuous variables, and a single class variable. The first four are simple statistics obtained from the integrated pulse profile (folded profile). This is an array of continuous variables that describe a longitude-resolved version of the signal that has been averaged in both time and frequency (see [3] for more details). The remaining four variables are similarly obtained from the DM-SNR curve (again see [3] for more details). These are summarised below: 

1. Mean of the integrated profile. 
2. Standard deviation of the integrated profile. 
3. Excess kurtosis of the integrated profile. 
4. Skewness of the integrated profile. 
5. Mean of the DM-SNR curve. 
6. Standard deviation of the DM-SNR curve. 
7. Excess kurtosis of the DM-SNR curve. 
8. Skewness of the DM-SNR curve. 
9. Class 

HTRU 2 Summary 
17,898 total examples. 
1,639 positive examples. 
16,259 negative examples.

## EDA

In [None]:
import pandas as pd

df = pd.read_csv("/Users/cain/Documents/Projects/Data Sets/pulsar_stars.csv")

## Pre-Processing

In [None]:
from sklearn.model_selection import train_test_split

predictors = df.drop('target_class', axis = 1)
target = df['target_class']
X_train, X_test, y_train, y_test = train_test_split(predictors, target, 
                                                    test_size = 0.25, 
                                                    random_state = 13)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Techniques

### Logistic Regression (Classification)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

classifier = LogisticRegression()

# Create regularization penalty space
penalty = ['l1', 'l2']

# Create regularization hyperparameter space
C = np.logspace(0, 4, 10)

solver = ['liblinear']

# Create hyperparameter options
hyperparameters = dict(C=C, penalty=penalty, solver=solver)

# Create grid search using 5-fold cross validation
clf = GridSearchCV(classifier, hyperparameters, cv=5, verbose = 1)

# Fit grid search
best_model = clf.fit(X_train, y_train)

# View best hyperparameters
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])

In [None]:
classifier = LogisticRegression(penalty = 'l1', C = 7.74, solver = 'liblinear')

classifier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

y_pred = classifier.predict(X_test)

classification_report(y_test, y_pred)

### kNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier()

hyperparameters = {
    'n_neighbors': [3,5,7,9],
    'metric':['euclidean', 'manhattan']
}

clf = GridSearchCV(classifier, hyperparameters, 
                   verbose = 1, cv = 3)

# Fit grid search
best_model = clf.fit(X_train, y_train)

# View best hyperparameters
print('Best K:', best_model.best_estimator_.get_params()['n_neighbors'])
print('Best metric:', best_model.best_estimator_.get_params()['metric'])

In [None]:
classifier = KNeighborsClassifier(n_neighbors = 9, metric = 'manhattan')

classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

accuracy_score(y_test, y_pred)

### SVM

In [None]:
from sklearn.svm import SVC

classifier = SVC()

hyperparameters = [{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]


clf = GridSearchCV(classifier, hyperparameters, 
                   verbose = 1, cv = 3)

# Fit grid search
best_model = clf.fit(X_train, y_train)

# View best hyperparameters
print('Best C:', best_model.best_estimator_.get_params()['C'])

In [None]:
classifier = SVC(C=1000)
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

accuracy_score(y_test, y_pred)

### Naive Bayes (Classification)

In [None]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

In [None]:
classifier.fit(x_train, y_train)

In [None]:
y_pred = classifier.predict(x_test)

accuracy_score(y_test, y_pred)

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(random_state = 0)

balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}]
hyperparameters = dict(class_weight=balance)


clf = GridSearchCV(classifier, hyperparameters, 
                   verbose = 1, cv = 3)

# Fit grid search
best_model = clf.fit(X_train, y_train)

# View best hyperparameters
print('Best weight:', best_model.best_estimator_.get_params()['class_weight'])

In [None]:
classifier = DecisionTreeClassifier(class_weight = 'balanced', random_state = 0)

classifier.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
y_pred = classifier.predict(X_test)

accuracy_score(y_test, y_pred)

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(random_state = 0)

balance = [{0:100,1:1}, {0:10,1:1}, {0:1,1:1}, {0:1,1:10}, {0:1,1:100}]
hyperparameters = dict(class_weight=balance)


clf = GridSearchCV(classifier, hyperparameters, 
                   verbose = 1, cv = 3)

# Fit grid search
best_model = clf.fit(X_train, y_train)

# View best hyperparameters
print('Best weight:', best_model.best_estimator_.get_params()['class_weight'])

In [None]:
classifier = RandomForestClassifier(random_state = 0, class_weight = 'balanced')

classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

accuracy_score(y_test, y_pred)

### Gradient Boosting

### XGBoost

In [None]:
from xgboost import XGBClassifier

classifier = XGBClassifier()

### AdaBoost

### Neural Network

In [None]:
x_val = x_train[:4500]
partial_x_train = x_train[4500:]

y_val = y_train[:4500]
partial_y_train = y_train[4500:]

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(x_train)

x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout

model = Sequential()
model.add(Dense(128, activation = 'relu', 
                input_dim = partial_x_train.shape[1]))
model.add(Dense(64, activation = 'relu'))
model.add(Dropout(0.25))
model.add(Dense(64, activation = 'relu'))
model.add(Dense(32, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', 
              optimizer = 'adam', 
              metrics = ['acc'])

In [None]:
history = model.fit(partial_x_train, partial_y_train, 
                    epochs = 100, 
                    batch_size = 64, 
                    validation_data = (x_val, y_val))

In [None]:
import matplotlib.pyplot as plt

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylim(0.05, 0.25)
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc = 'upper left')
plt.show()

In [None]:
model.evaluate(x_test, y_test)