In [1]:
pip install -U altair

Collecting altair
  Using cached altair-5.1.2-py3-none-any.whl (516 kB)
Installing collected packages: altair
  Attempting uninstall: altair
    Found existing installation: altair 4.2.2
    Uninstalling altair-4.2.2:
      Successfully uninstalled altair-4.2.2
Successfully installed altair-5.1.2
Note: you may need to restart the kernel to use updated packages.


# Group project: pulsar

INTRODUCTION

DATA CLEANING & WRANGLING

To analyze a data set accurately, it's crucial to first observe and wrangle the data to prevent formatting issues or null values. This help choose the best analysis method for the data. First the required packages are imported from library to help perform actions.

In [2]:
import pandas as pd
import altair as alt
import numpy as np
from sklearn import set_config
from sklearn.model_selection import train_test_split # importing necessary libraries

In [3]:
set_config(transform_output="pandas") # set output as dataframes instead of arrays

The data set is downloaded from the web, the files are read using the pandas function read_csv. The first 5 values of the dataset is shown below:

In [4]:
htru2='https://drive.google.com/uc?export=download&id=1kLqmyQYnEt5M-stWnzz35p_9Zk2-FOZD'
pulsar= pd.read_csv(htru2,names=[1,2,3,4,5,6,7,8,9],index_col=False) # reading dataset from data file

In [5]:
pulsar.head(5)

Unnamed: 0,1,2,3,4,5,6,7,8,9
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


The data is organized but lacks clear variable names and meaningful 'type' column values. Thus we used the rename function to change the column names and type values to meaningful names. The column name should correspond the variables, 0s and 1s in the type column should correspond to 'others' or 'pulsar'. (See Table 1)

In [6]:
# renaming column names to meaningful names
pulsar=pulsar.rename(columns={
    1:'mean_IP', # Mean of the integrated profile.
    2:'SD_IP', # Standard deviation of the integrated profile.
    3:'EK_IP', # Excess kurtosis of the integrated profile.
    4:'S_IP', # Skewness of the integrated profile.
    5:'mean_DM-SNR', # Mean of the DM-SNR curve.
    6:'SD_DM-SNR', # Standard deviation of the DM-SNR curve.
    7:'EK_DM-SNR',# Excess kurtosis of the DM-SNR curve.
    8:'S_DM-SNR', # Skewness of the DM-SNR curve.
    9:'type'}) # type of star (others or pulsar)
pulsar['type']=pulsar['type'].replace({
    0:'others',
    1:'pulsar'}) # replacing values of type to more meaningful values

In [7]:
pulsar.head(5)

Unnamed: 0,mean_IP,SD_IP,EK_IP,S_IP,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR,type
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,others
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,others
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,others
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,others
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,others


Table 1

The data frame is then split into training and testing sets, this allows for accuracy testing in the future. (See Table 2)

In [8]:
pulsar_train, pulsar_test = train_test_split(
    pulsar, train_size=0.75, stratify=pulsar["type"]
) # splitting testing and training data

In [9]:
pulsar_train.reset_index()

Unnamed: 0,index,mean_IP,SD_IP,EK_IP,S_IP,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR,type
0,12638,111.781250,49.141686,0.060100,-0.071402,2.544314,18.475797,8.529841,81.072589,others
1,13420,127.007812,47.967843,0.308370,0.028844,17.561037,50.307836,2.768576,6.297872,others
2,16836,87.296875,38.585813,0.674328,1.971914,4.970736,21.946994,5.562278,35.768750,others
3,15729,120.976562,42.774384,0.158774,0.358653,1.137124,11.280014,15.789328,299.054531,others
4,14320,146.359375,53.171117,-0.132668,-0.348910,1.872910,16.061147,11.352980,140.696787,others
...,...,...,...,...,...,...,...,...,...,...
13418,8844,105.757812,49.003945,0.582515,0.302725,1.887124,17.536483,10.964010,128.040165,others
13419,12674,129.171875,50.174556,-0.125990,-0.100984,2.621237,18.332296,9.018426,90.526530,others
13420,17062,123.468750,57.329494,0.027868,-0.726757,11.120401,40.701711,3.716798,12.877808,others
13421,5066,128.960938,52.481009,0.106973,-0.360329,6.277592,31.425628,5.151247,26.062507,others


Table 2

To work with data, we need to know its basics information, we use the info() function to check for some traits of the data set. (See List 1)


In [10]:
pulsar_train.info() # basic information about training data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13423 entries, 12638 to 16274
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   mean_IP      13423 non-null  float64
 1   SD_IP        13423 non-null  float64
 2   EK_IP        13423 non-null  float64
 3   S_IP         13423 non-null  float64
 4   mean_DM-SNR  13423 non-null  float64
 5   SD_DM-SNR    13423 non-null  float64
 6   EK_DM-SNR    13423 non-null  float64
 7   S_DM-SNR     13423 non-null  float64
 8   type         13423 non-null  object 
dtypes: float64(8), object(1)
memory usage: 1.0+ MB


List 1

We see that all types are float64, except for the renamed "objects" column. And Non-null values are the same for all columns.

To further check if there are any null values so we could drop them, the sum of all the null values in each column are calculated. (See List 2)

In [11]:
count_nan = pulsar_train.isnull().sum() # total number of null values in each column 
count_nan 

mean_IP        0
SD_IP          0
EK_IP          0
S_IP           0
mean_DM-SNR    0
SD_DM-SNR      0
EK_DM-SNR      0
S_DM-SNR       0
type           0
dtype: int64

List 2

There are no null values in columns,so no need to drop them.

Calculation of column-wise means for pulsars and other sources to identify any differences between them. (See Table 3)

In [12]:
mean_value=pulsar_train.groupby('type').mean() # mean values of each column for pulsars and other stars
mean_value

Unnamed: 0_level_0,mean_IP,SD_IP,EK_IP,S_IP,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
others,116.591802,47.329077,0.210015,0.37921,8.725195,23.185198,8.877485,114.027863
pulsar,56.659937,38.661577,3.125507,15.535663,49.314481,56.16263,2.80656,18.755646


Table 3

The results suggest a significant difference in mean values of all variables for other source and pulsars, 
indicating distinctive characteristics between the two classes.

Compare pulsar and other observations to avoid oversampling due to unequal sample sizes. (See List 3)

In [13]:
count_obs = pulsar_train.groupby('type')['type'].count()  # total number of pulsar observations and other star observations
count_obs 

type
others    12194
pulsar     1229
Name: type, dtype: int64

List 3

Most observations in the dataset are of origins other than pulsars, which means pulsars are rare.Resampling of pulsar observations during model training is necessary.

Graph displaying the correlation between mean IP and Skewness of IP for pulsars and other stars. The graph shows clear separation between pulsars and other sources, with some overlap in the middle where KNN predictions can be challenging. (See Graph 1)

In [14]:
alt.data_transformers.disable_max_rows()
pulsar_mean_plot=alt.Chart(pulsar_train,title='mean IP verses Skewness of IP').mark_point(opacity=0.2).encode(
    x=alt.X('mean_IP'),
    y=alt.Y('S_IP'),
    color='type')
pulsar_mean_plot 

Graph 1

DATA ANALYSIS

In this section, the testing data is separated into two parts, one model based on the values of the Integrated Profile, and another model based on the DM-SNR curve. First all the training data is upsampled to account for the rareness of pulsars and prevent undersampling. The bellow table showed the upsampled training data (See Table 4)

In [15]:
from sklearn.utils import resample
np.random.seed(1)
type_pulsar=pulsar_train[pulsar_train['type']=='pulsar']
type_others=pulsar_train[pulsar_train['type']=='others']
type_others
type_pulsar_upsampled = resample(
    type_pulsar, n_samples=type_others.shape[0],random_state=1
)

upsampled_pulsar = pd.concat((type_pulsar_upsampled ,type_others))
upsampled_pulsar 

Unnamed: 0,mean_IP,SD_IP,EK_IP,S_IP,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR,type
3351,90.789062,35.814987,1.782977,6.570318,7.310201,33.496492,5.158216,27.061486,pulsar
568,23.671875,31.177694,5.157291,30.346608,69.510870,62.256018,1.346743,1.181281,pulsar
2981,85.578125,45.344620,1.570524,3.205262,1.832776,14.834724,12.665826,180.085976,pulsar
11207,46.570312,38.001991,3.909549,16.579355,48.403846,66.276140,1.453979,1.245745,pulsar
5386,37.054688,35.938382,3.992494,18.970213,61.604515,75.614517,0.983834,-0.211335,pulsar
...,...,...,...,...,...,...,...,...,...
8844,105.757812,49.003945,0.582515,0.302725,1.887124,17.536483,10.964010,128.040165,others
12674,129.171875,50.174556,-0.125990,-0.100984,2.621237,18.332296,9.018426,90.526530,others
17062,123.468750,57.329494,0.027868,-0.726757,11.120401,40.701711,3.716798,12.877808,others
5066,128.960938,52.481009,0.106973,-0.360329,6.277592,31.425628,5.151247,26.062507,others


Now pulsars and other stars have equal amounts of samples.

In [16]:
count_obs = upsampled_pulsar.groupby('type')['type'].count()  # total number of pulsar observations and other star observations
count_obs 

type
others    12194
pulsar    12194
Name: type, dtype: int64

Next 

In [17]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer

pulsar_training_IP = upsampled_pulsar[['mean_IP','SD_IP','EK_IP','S_IP','type']]
pulsar_training_IP
IP_preprocessor = make_column_transformer(
    (StandardScaler(), ['mean_IP','SD_IP','EK_IP','S_IP']),
    verbose_feature_names_out=False
)
IP_preprocessor.fit(pulsar_training_IP)
scaled_training_IP = IP_preprocessor.transform(pulsar_training_IP)
scaled_training_IP

Unnamed: 0,mean_IP,SD_IP,EK_IP,S_IP
3351,0.106103,-0.862786,0.058617,-0.111047
568,-1.628512,-1.420232,1.760557,1.792607
2981,-0.028572,0.282767,-0.048541,-0.380471
11207,-1.036712,-0.599887,1.131219,0.690329
5386,-1.282639,-0.847952,1.173055,0.881753
...,...,...,...,...
8844,0.492963,0.722652,-0.546874,-0.612863
12674,1.098090,0.863371,-0.904230,-0.645186
17062,0.950695,1.723462,-0.826627,-0.695288
5066,1.092638,1.140628,-0.786728,-0.665950


Table 4

The same is done for DM-SNR values, the resulting table is shown as Table 5

In [18]:
pulsar_training_DMSNR= upsampled_pulsar[['mean_DM-SNR','SD_DM-SNR','EK_DM-SNR','S_DM-SNR','type']]
DMSNR_preprocessor = make_column_transformer(
    (StandardScaler(), ['mean_DM-SNR','SD_DM-SNR','EK_DM-SNR','S_DM-SNR']),
    verbose_feature_names_out=False
)
DMSNR_preprocessor.fit(pulsar_training_DMSNR)
scaled_training_DMSNR = DMSNR_preprocessor.transform(pulsar_training_DMSNR)
scaled_training_DMSNR

Unnamed: 0,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR
3351,-0.519939,-0.247802,-0.144494,-0.401404
568,0.972059,0.926066,-0.932178,-0.663343
2981,-0.651326,-1.009513,1.407038,1.147388
11207,0.465768,1.090154,-0.910016,-0.662690
5386,0.782410,1.471315,-1.007177,-0.677438
...,...,...,...,...
8844,-0.650022,-0.899237,1.055339,0.620622
12674,-0.632413,-0.866754,0.653262,0.240939
17062,-0.428545,0.046291,-0.442379,-0.544960
5066,-0.544709,-0.332328,-0.145934,-0.411515


Table 5

After the IP and DM-SNR values are standarized, the most appropritate k value for both models are located using cross-validation.

First, we need to get the grid of parameter values.

In [19]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

IP_knn = KNeighborsClassifier()
IP_tune_pipe = make_pipeline(IP_preprocessor, IP_knn)
IP_tune_pipe.get_params()


{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                    ['mean_IP', 'SD_IP', 'EK_IP', 'S_IP'])],
                     verbose_feature_names_out=False)),
  ('kneighborsclassifier', KNeighborsClassifier())],
 'verbose': False,
 'columntransformer': ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                  ['mean_IP', 'SD_IP', 'EK_IP', 'S_IP'])],
                   verbose_feature_names_out=False),
 'kneighborsclassifier': KNeighborsClassifier(),
 'columntransformer__n_jobs': None,
 'columntransformer__remainder': 'drop',
 'columntransformer__sparse_threshold': 0.3,
 'columntransformer__transformer_weights': None,
 'columntransformer__transformers': [('standardscaler',
   StandardScaler(),
   ['mean_IP', 'SD_IP', 'EK_IP', 'S_IP'])],
 'columntransformer__verbose': False,
 'columntransformer__verbose_feature_names_out': False,
 'columntransformer

the appropriate parameter value would be 5

Then we tune the grib and aquire a table of cross validation results. Afterwhich it is plotted onto a line plot. (See Graph2)

In [20]:
IP_parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1,51,5),
}

In [21]:
from sklearn.model_selection import GridSearchCV

IP_tune_grid = GridSearchCV(
    estimator=IP_tune_pipe,
    param_grid=IP_parameter_grid,
    cv=10
)
scaled_training_IP
IP_accuracies_grid = pd.DataFrame(
    IP_tune_grid.fit(
        scaled_training_IP,
        upsampled_pulsar["type"]
    ).cv_results_
)
cross_val_plot=alt.Chart(IP_accuracies_grid).mark_line(point=True).encode(
    y=alt.Y("mean_test_score").scale(zero=False),
    x=alt.X("param_kneighborsclassifier__n_neighbors"),
)
cross_val_plot

Graph 2

From this graph, we see that the highest test score is when k=1, however, this would leave to overfitting and provide a less useful data. K=16 could be a useful value as it yields a high test score, the test scores for k values before and after it does it vary much, and it does not require a significant amount of computational power.

The same process is repeated for DMSNR to find the optimal k value.

In [22]:
DMSNR_knn = KNeighborsClassifier()
DMSNR_tune_pipe = make_pipeline(DMSNR_preprocessor, DMSNR_knn)
DMSNR_tune_pipe.get_params()


{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                    ['mean_DM-SNR', 'SD_DM-SNR', 'EK_DM-SNR',
                                     'S_DM-SNR'])],
                     verbose_feature_names_out=False)),
  ('kneighborsclassifier', KNeighborsClassifier())],
 'verbose': False,
 'columntransformer': ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                  ['mean_DM-SNR', 'SD_DM-SNR', 'EK_DM-SNR',
                                   'S_DM-SNR'])],
                   verbose_feature_names_out=False),
 'kneighborsclassifier': KNeighborsClassifier(),
 'columntransformer__n_jobs': None,
 'columntransformer__remainder': 'drop',
 'columntransformer__sparse_threshold': 0.3,
 'columntransformer__transformer_weights': None,
 'columntransformer__transformers': [('standardscaler',
   StandardScaler(),
   ['mean_DM-SNR', 'SD_DM-SNR', 'EK_DM-SNR', 'S_DM-SN

In [23]:
DMSNR_parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1,51,5),
}

In [24]:
DMSNR_tune_grid = GridSearchCV(
    estimator=DMSNR_tune_pipe,
    param_grid=DMSNR_parameter_grid,
    cv=100
)
scaled_training_DMSNR
DMSNR_accuracies_grid = pd.DataFrame(
    DMSNR_tune_grid.fit(
        scaled_training_DMSNR,
        upsampled_pulsar["type"]
    ).cv_results_
)
cross_val_plot=alt.Chart(DMSNR_accuracies_grid).mark_line(point=True).encode(
    y=alt.Y("mean_test_score").scale(zero=False),
    x=alt.X("param_kneighborsclassifier__n_neighbors"),
)
cross_val_plot

Accoding to the graph, the highest score for knn is again knn=1, however as mentioned before, this would lead to overfitting. Therefore other values of k is considered. k=6, k=11, and k=16 are considered, but the difference between the scores for nearby values is quite large. The final k value chosen is k=21, it have a moderate score of around 88%, and the difference between the nearby values is around 1%, furthermore it would not require a significant amount of computational power.

Afterwards, the k values are used to fit the models for both IP and DM-SNR. As shown in the code below

In [25]:
IP_knn = KNeighborsClassifier(n_neighbors=16) 
X = scaled_training_IP[['mean_IP','SD_IP','EK_IP','S_IP']]
y = upsampled_pulsar ["type"]
IP_fit = make_pipeline(IP_preprocessor, IP_knn).fit(X, y)
IP_fit

In [26]:
DMSNR_knn = KNeighborsClassifier(n_neighbors=21) 
X = scaled_training_DMSNR[['mean_DM-SNR','SD_DM-SNR','EK_DM-SNR','S_DM-SNR']]
y = upsampled_pulsar ["type"]
DMSNR_fit = make_pipeline(DMSNR_preprocessor, DMSNR_knn).fit(X, y)
DMSNR_fit

In [27]:
pulsar_IP = pulsar[["mean_IP", "SD_IP", "EK_IP", "S_IP", "type"]]
pulsar_IP
## Made a new Dataframe with only IP data and its type.

Unnamed: 0,mean_IP,SD_IP,EK_IP,S_IP,type
0,140.562500,55.683782,-0.234571,-0.699648,others
1,102.507812,58.882430,0.465318,-0.515088,others
2,103.015625,39.341649,0.323328,1.051164,others
3,136.750000,57.178449,-0.068415,-0.636238,others
4,88.726562,40.672225,0.600866,1.123492,others
...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,others
17894,122.554688,49.485605,0.127978,0.323061,others
17895,119.335938,59.935939,0.159363,-0.743025,others
17896,114.507812,53.902400,0.201161,-0.024789,others


In [28]:
pulsar_IP = pulsar_IP.assign(predicted_type = IP_fit.predict(pulsar[["mean_IP", "SD_IP", "EK_IP", "S_IP"]]))
pulsar_IP
## Made a new Column with the predicted pulsar type from our model

Unnamed: 0,mean_IP,SD_IP,EK_IP,S_IP,type,predicted_type
0,140.562500,55.683782,-0.234571,-0.699648,others,others
1,102.507812,58.882430,0.465318,-0.515088,others,others
2,103.015625,39.341649,0.323328,1.051164,others,others
3,136.750000,57.178449,-0.068415,-0.636238,others,others
4,88.726562,40.672225,0.600866,1.123492,others,others
...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,others,others
17894,122.554688,49.485605,0.127978,0.323061,others,others
17895,119.335938,59.935939,0.159363,-0.743025,others,others
17896,114.507812,53.902400,0.201161,-0.024789,others,others


In [29]:
pulsar_IP[pulsar_IP["type"] != pulsar_IP["predicted_type"]]
##Found the columns in which the predictions are not True

Unnamed: 0,mean_IP,SD_IP,EK_IP,S_IP,type,predicted_type
19,99.367188,41.572202,1.547197,4.154106,pulsar,others
42,120.554688,45.549905,0.282924,0.419909,pulsar,others
61,27.765625,28.666042,5.770087,37.419009,pulsar,others
92,23.625000,29.948654,5.688038,35.987172,pulsar,others
93,94.585938,35.779823,1.187309,3.687469,pulsar,others
...,...,...,...,...,...,...
17515,89.867188,47.482295,1.591325,2.505057,pulsar,others
17529,27.039062,33.754722,4.779124,26.255357,pulsar,others
17558,77.070312,39.000638,1.884421,6.372178,pulsar,others
17642,28.375000,27.649311,6.377273,45.944048,pulsar,others


In [30]:
correct_preds = pulsar_IP[
    pulsar_IP['type'] == pulsar_IP['predicted_type']
]

correct_preds.shape[0] / pulsar_IP.shape[0]

## Finding the Accuracy of the model

0.9116661079450218

In [31]:
confusion_matrix_IP = pd.crosstab(
    pulsar_IP["type"],
    pulsar_IP["predicted_type"]
)
confusion_matrix_IP
## Making Confusion Matrix

predicted_type,others,pulsar
type,Unnamed: 1_level_1,Unnamed: 2_level_1
others,16259,0
pulsar,1581,58


In [32]:
pulsar_SNR = pulsar[["mean_DM-SNR", "SD_DM-SNR", "EK_DM-SNR", "S_DM-SNR", "type"]]
pulsar_SNR
## Made a new Dataframe with only SNR data and its type.

Unnamed: 0,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR,type
0,3.199833,19.110426,7.975532,74.242225,others
1,1.677258,14.860146,10.576487,127.393580,others
2,3.121237,21.744669,7.735822,63.171909,others
3,3.642977,20.959280,6.896499,53.593661,others
4,1.178930,11.468720,14.269573,252.567306,others
...,...,...,...,...,...
17893,1.296823,12.166062,15.450260,285.931022,others
17894,16.409699,44.626893,2.945244,8.297092,others
17895,21.430602,58.872000,2.499517,4.595173,others
17896,1.946488,13.381731,10.007967,134.238910,others


In [33]:
pulsar_SNR = pulsar_SNR.assign(predicted_type = DMSNR_fit.predict(pulsar[["mean_DM-SNR", "SD_DM-SNR", "EK_DM-SNR", "S_DM-SNR", "type"]]))
pulsar_SNR
## Made a new Column with the predicted pulsar type from our model

Unnamed: 0,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR,type,predicted_type
0,3.199833,19.110426,7.975532,74.242225,others,others
1,1.677258,14.860146,10.576487,127.393580,others,others
2,3.121237,21.744669,7.735822,63.171909,others,others
3,3.642977,20.959280,6.896499,53.593661,others,others
4,1.178930,11.468720,14.269573,252.567306,others,others
...,...,...,...,...,...,...
17893,1.296823,12.166062,15.450260,285.931022,others,others
17894,16.409699,44.626893,2.945244,8.297092,others,others
17895,21.430602,58.872000,2.499517,4.595173,others,others
17896,1.946488,13.381731,10.007967,134.238910,others,others


In [34]:
pulsar_SNR[pulsar_SNR["type"] != pulsar_SNR["predicted_type"]]
##Found the columns in which the predictions are not True

Unnamed: 0,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR,type,predicted_type
19,27.555184,61.719016,2.208808,3.662680,pulsar,others
42,1.358696,13.079034,13.312141,212.597029,pulsar,others
61,73.112876,62.070220,1.268206,1.082920,pulsar,others
92,146.568562,82.394624,-0.274902,-1.121848,pulsar,others
93,6.071070,29.760400,5.318767,28.698048,pulsar,others
...,...,...,...,...,...,...
17712,109.775084,98.586560,-0.128871,-1.865853,others,pulsar
17800,72.738294,92.069050,0.592935,-1.454346,others,pulsar
17844,61.095318,75.256580,0.660881,-1.140900,others,pulsar
17847,61.940635,79.716328,0.621127,-1.408289,others,pulsar


In [35]:
correct_preds_SNR = pulsar_SNR[
    pulsar_SNR['type'] == pulsar_SNR['predicted_type']
]

correct_preds_SNR.shape[0] / pulsar_SNR.shape[0]

## Finding the Accuracy of the model

0.9067493574701084

In [36]:
confusion_matrix_SNR = pd.crosstab(
    pulsar_SNR["type"],
    pulsar_SNR["predicted_type"]
)
confusion_matrix_SNR
## Making Confusion Matrix

predicted_type,others,pulsar
type,Unnamed: 1_level_1,Unnamed: 2_level_1
others,16035,224
pulsar,1445,194


As seen above, the confusion matrix of SNR data is more balanced and gives out mixed results, whereas most of the predictions of IP data are wrong. Therefore, we should use SNR model to predict the type.

In [37]:
final_model = DMSNR_fit