Name: Christian Hellum Bye

# Predicting a Pulsar Star

Based on previous work, we have trained a Random Forest Classifier (RFC) as our model. In this deliverable, we explore the relative importance of the features and see if we can further improve our model.

## Preprocessing the data

In [13]:
import numpy as np
from sklearn.model_selection import train_test_split #to split the dataset

In [15]:
data = np.loadtxt('../pulsar_stars.csv', delimiter=',', skiprows=1)

In [16]:
X = data[:, 0:8] #features
y = data[:, 8] #classes

In [17]:
#split the dataset into two parts, 80 % containing training and validation sets, 20 % to the test set
X_train_validation, X_test, y_train_validation, y_test = train_test_split(X, y, test_size=0.2)

#split the larger part of the dataset to two parts: 75 % (= 60 % of the total data) to training set, 25 % (= 20 % of the total)
#to the validation set
X_train, X_validation, y_train, y_validation = train_test_split(X_train_validation, y_train_validation, test_size=0.25)

## Feature importance

In [29]:
#import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, f1_score
import matplotlib.pyplot as plt
import pickle

In [19]:
rfc = pickle.load(open('RFC_weigths.sav', 'rb')) #load weights

In [20]:
#feature labels
feat_labels = ['Mean of the integrated profile', 'Standard deviation of the integrated profile', 'Excess kurtosis of the integrated profile', 'Skewness of the integrated profile', 'Mean of the DM-SNR curve', 'Standard deviation of the DM-SNR curve', 'Excess kurtosis of the DM-SNR curve', 'Skewness of the DM-SNR curve']

In [21]:
for feature in zip(feat_labels, rfc.feature_importances_):
    print(feature)

('Mean of the integrated profile', 0.246078065428781)
('Standard deviation of the integrated profile', 0.026256904658293712)
('Excess kurtosis of the integrated profile', 0.389001337277414)
('Skewness of the integrated profile', 0.15003699764332742)
('Mean of the DM-SNR curve', 0.0442104741037564)
('Standard deviation of the DM-SNR curve', 0.10189573846224295)
('Excess kurtosis of the DM-SNR curve', 0.01988925040409805)
('Skewness of the DM-SNR curve', 0.02263123202208658)


We see that four of the features have importance scores of less than 10 %, whereas the other four account for over 85 % of the importance in total. We will train a model based on the four important features:

In [33]:
important_cols = [0, 2, 3, 5] #the coloumns in the dataset corresponding to the important features

In [49]:
X_train_important = X_train[:, important_cols]
X_validation_important = X_validation[:, important_cols]
X_test_important = X_test[:, important_cols]

We now proceed the same way as in the previous deliverable to set the hyperparameters (all code in this part is copied from deliverable 3):

In [54]:
def rfc_f1(n_estimators, min_samples_split, max_features):
    rfc_important = RandomForestClassifier(n_estimators = n_estimators, min_samples_split = min_samples_split, max_features = max_features, class_weight='balanced')
    rfc_important.fit(X_train_important, y_train) #fits to training set
    
    #make predictions
    train_predict = rfc_important.predict(X_train_important)
    validation_predict = rfc_important.predict(X_validation_important)
    
    rfc_tr_f1 = f1_score(y_train, train_predict) #f1-score for training data
    rfc_validation_f1 = f1_score(y_validation, validation_predict) #f1-score for test data
    
    return rfc_tr_f1, rfc_validation_f1

In [56]:
n_vals = np.arange(1,51) #n_estimators
min_sample_split_vals = np.arange(2,11) #min_sample_split
max_features_vals = np.arange(1,5) #max_features

In [59]:
f1_scores_rfc_train = np.empty((50, 9, 4))
f1_scores_rfc_validation = np.empty((50, 9, 4))
for i in range(50): #loop through values of n_estimators
    for j in range(9): #loop through values of min_sample_split
        for k in range(4): #loop through values of max_features
            f1_scores_rfc_train[i,j,k], f1_scores_rfc_validation[i,j,k] = rfc_f1(n_vals[i], min_sample_split_vals[j], max_features_vals[k])

In [60]:
#flatten the arrays
f1_tr = np.ravel(f1_scores_rfc_train)
f1_val = np.ravel(f1_scores_rfc_validation)

In [62]:
max_index_rfc = np.argmax(f1_val) #the index corresponding to the greatest f1-score for the validation data
print('Max f1-score for validation data set is:', f1_val[max_index_rfc]) #the f1-score at this index
print('The index that maximizes the f1-score is:', np.unravel_index(max_index_rfc, f1_scores_rfc_validation.shape))

Max f1-score for validation data set is: 0.900489396411093
The index that maximizes the f1-score is: (11, 3, 0)


The best parameters are:
* n_estimators = 10 (the 0th index has n_estimators = 1)
* min_sample_split = 5 (the 0th index has min_sample_split = 2, so the third has min_sample_split = 5)
* max_features = 1 (the 0th index has max_features = 1)

We test this:

In [None]:
rfc_important = RandomForestClassifier(n_estimators = , min_samples_split = , max_features = , class_weight='balanced')
rfc_important.fit(X_train_important, y_train) #fits to training set
    
test_predict = rfc_important.predict(X_test_important) #make predictions
rfc_important_test_f1 = f1_score(y_test, test_predict) #f1-score for test data

The confusion matrix for our model is:

In [282]:
confusion_test = confusion_matrix(y_test, test_predict)
print('Confusion matrix: \n')
print(confusion_test)

tn, fp, fn, tp = confusion_test.ravel()

print('\nTrue positives', tp)
print('True negatives', tn)
print('False positives', fp)
print('False negatives', fn)

Confusion matrix: 

[[3238   21]
 [  40  281]]

True positives 281
True negatives 3238
False positives 21
False negatives 40


## Presenting

The results will be presented with a poster. As a first draft, the poster will include the following:

* Problem statement: similar to what's included in deliverable 2
* Background: motivation for choosing the project, what are pulsars and why does the problem exist
* The data used: description of the data, similar to what's in deliverable 1 and 2
* Methodology: similar to what's in deliverable 1, but also including the process of selecting the final model as descriped in this deliverable
* Results: the f1-scores and confusion matrix from this deliverable
* Discussion: compare the results to baseline performance