# Group project proposal: Pulsar Star Data

### Introduction

In this project, data used to describe **puls**ating **ra**dio **s**ources – also known as **pulsars** – is being investigated.

A pulsar is a rare type of neutron star which in itself is the scientific term for the collapsed core of a massive supergiant star. While rotating, it emits beams of electromagnetic radiation that produce a characteristic pattern of radio emission. Using large radio telescopes, these periodic signals can be detected on earth. Everyone of them could potentially describe real pulsars. However in practice, the majority of detections are caused by radio frequency interference (RFI) and noise, making legitimate signals hard to find.

The HTRU2 dataset describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey. It contains a total of 17,898 total entries, 16,259 of which being spurious examples caused by RFI/noise, and 1,639 being real pulsar examples. The dataset contains nine variables for each observation. The first four describe statistical characteristics from the integrated pulse profile. This is a version of the recorded signal that has been averaged in both time and frequency. The next four are obtained from the so called DM-SNR curve. This curve shows the spectral signal to noise ratio (SNR) as a function of different dispersion measures (DM). The last variable is a class variable describing whether a pulsar exists or not. 

Pulsars are of considerable scientific interest as probes of space-time, the inter-stellar medium, and states of matter. For that reason it is of high importance to accurately predict whether a type of radio signal observed on earth is a real pulsar or a result of RFI/noise. That will be the main goal of this project.

### Preliminary exploratory data analysis

To analyze a data set accurately, it's crucial to first observe and wrangle the data to prevent formatting issues or null values. This step helps choose the best analysis method for the data set.

In [77]:
#Required packages are imported from library.
import pandas as pd
import altair as alt
import numpy as np
from sklearn import set_config
from sklearn.model_selection import train_test_split #importing neccessary libraries

In [78]:
set_config(transform_output="pandas") # set output as dataframes instead of arrays

In [79]:
#The data set is downloaded from the web, the files are read using the pandas function read_csv.
htru2='https://drive.google.com/uc?export=download&id=1kLqmyQYnEt5M-stWnzz35p_9Zk2-FOZD'
pulsar= pd.read_csv(htru2,names=[1,2,3,4,5,6,7,8,9],index_col=False) #reading dataset from data file

In [80]:
pulsar 

Unnamed: 0,1,2,3,4,5,6,7,8,9
0,140.562500,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,0
...,...,...,...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,0
17894,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,0
17895,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,0
17896,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,0


Data is organized but lacks clear variable names and meaningful 'type' column values. Columns are renamed to clarify content and 'type' values 1 and 0 are now 'pulsars' and 'others' respectively.

In [81]:
pulsar=pulsar.rename(columns={
    1:'mean_IP', #Mean of the integrated profile.
    2:'SD_IP', #Standard deviation of the integrated profile.
    3:'EK_IP', #Excess kurtosis of the integrated profile.
    4:'S_IP', #Skewness of the integrated profile.
    5:'mean_DM-SNR', #Mean of the DM-SNR curve.
    6:'SD_DM-SNR', #Standard deviation of the DM-SNR curve.
    7:'EK_DM-SNR',#Excess kurtosis of the DM-SNR curve.
    8:'S_DM-SNR', #Skewness of the DM-SNR curve.
    9:'type'}) #type of star (others or pulsar)
#renaming column names to meaningful names

In [82]:
pulsar['type']=pulsar['type'].replace({
    0:'others',
    1:'pulsar'}) #replacing values of type to more meaningful values

In [83]:
pulsar

Unnamed: 0,mean_IP,SD_IP,EK_IP,S_IP,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR,type
0,140.562500,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,others
1,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,others
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,others
3,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,others
4,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,others
...,...,...,...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,others
17894,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,others
17895,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,others
17896,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,others


In [84]:
#Then the data frame is split down to training and testing sets, this allows for accuracy testing in the future.
pulsar_train, pulsar_test = train_test_split(
    pulsar, train_size=0.75, stratify=pulsar["type"]
) #spliting testing and training data

In [85]:
#The index of the training set is reset so that it would be easier to work with in the future.
pulsar_train.reset_index() #reset index of training data 

Unnamed: 0,index,mean_IP,SD_IP,EK_IP,S_IP,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR,type
0,954,112.804688,51.668423,0.195714,0.059445,2.970736,22.281454,8.348756,73.572597,others
1,6879,133.812500,48.118303,-0.103910,0.145506,55.274247,73.513815,0.843660,-0.785131,others
2,6513,115.679688,52.758283,0.082086,-0.497257,1.864548,12.196073,11.150126,176.551192,others
3,8068,128.046875,58.414315,-0.053963,-0.710291,2.629599,16.747358,8.295720,83.421375,others
4,2077,125.429688,46.975340,0.105053,-0.226290,6.050167,33.865623,5.787360,32.972108,others
...,...,...,...,...,...,...,...,...,...,...
13418,8521,105.531250,42.454560,0.623125,0.835590,31.111204,69.203992,2.406829,4.553456,others
13419,11046,123.554688,44.005825,0.169926,0.474113,4.938963,26.654807,5.884115,36.256227,others
13420,15269,113.187500,51.431045,0.002099,-0.365370,4.759197,23.958518,6.209460,43.090209,others
13421,2754,26.578125,31.823637,5.502087,31.529843,134.798495,72.055583,0.029297,-1.211265,pulsar


To work with data, we need to know its basics. All types are float64, except for the renamed "objects" column. Non-null values are the same for all columns.

In [86]:
pulsar_train.info() #basic information about training data

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13423 entries, 954 to 9390
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   mean_IP      13423 non-null  float64
 1   SD_IP        13423 non-null  float64
 2   EK_IP        13423 non-null  float64
 3   S_IP         13423 non-null  float64
 4   mean_DM-SNR  13423 non-null  float64
 5   SD_DM-SNR    13423 non-null  float64
 6   EK_DM-SNR    13423 non-null  float64
 7   S_DM-SNR     13423 non-null  float64
 8   type         13423 non-null  object 
dtypes: float64(8), object(1)
memory usage: 1.0+ MB


In [87]:
#To further check if there are any null values so we could drop them, 
#the sum of all the null values in each column are calculated.
count_nan = pulsar_train.isnull().sum() #total number of Null values in each column 
count_nan

mean_IP        0
SD_IP          0
EK_IP          0
S_IP           0
mean_DM-SNR    0
SD_DM-SNR      0
EK_DM-SNR      0
S_DM-SNR       0
type           0
dtype: int64

No null values in columns, no need to drop them.

In [88]:
#Code below calculates column-wise means for pulsars and other sources to identify any differences between them.
mean_value=pulsar_train.groupby('type').mean() #mean values of each column for pulsars and other stars
mean_value

Unnamed: 0_level_0,mean_IP,SD_IP,EK_IP,S_IP,mean_DM-SNR,SD_DM-SNR,EK_DM-SNR,S_DM-SNR
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
others,116.624134,47.373524,0.207799,0.373049,8.862507,23.253085,8.869352,113.720493
pulsar,56.59068,38.714436,3.139684,15.622171,49.921206,56.409181,2.775235,18.325875


The results suggest a significant difference in mean values of all variables for other source and pulsars, indicating distinctive characteristics between the two classes.

In [89]:
#Compare pulsar and star observations to avoid oversampling due to unequal sample sizes.
count_obs = pulsar_train.groupby('type')['type'].count()  #total number of pulsar obseravtions and other star observations
count_obs

type
others    12194
pulsar     1229
Name: type, dtype: int64

Most observations in the dataset are of sources other than pulsars, which means pulsars are rare. Resampling of pulsar observations during model training is necessary.

These graphs show the correlation between predictors and star types, helping us understand variable-class relationships better.

In [90]:
alt.data_transformers.disable_max_rows()
pulsar_mean_plot=alt.Chart(pulsar_train,title='mean IP verse mean DM-SNR').mark_point(opacity=0.5).encode(
    x=alt.X('mean_IP'),
    y=alt.Y('mean_DM-SNR'),
    color='type')
pulsar_mean_plot #graph displaying the correlation between mean DM-SMR and mean_IP for pulsars and other stars

The graph shows clear separation between pulsars and other sources, with some overlap in the middle where KNN predictions can be challenging.

In [91]:
alt.data_transformers.disable_max_rows()
pulsar_SD_plot=alt.Chart(pulsar_train,title='Standard deviation of IP verse Standard deviation of DM-SNR').mark_point(opacity=0.5).encode(
    x=alt.X('SD_IP'),
    y=alt.Y('SD_DM-SNR'),
    color='type')
pulsar_SD_plot#graph displaying the correlation between standard deviation of DM-SMR and mean_IP for pulsars and other stars

Graphs show that pulsars and other sources have a similar distribution, with many overlapping data points.

In [92]:
alt.data_transformers.disable_max_rows()
pulsar_S_plot=alt.Chart(pulsar_train,title='Slop of IP verse Slop of DM-SNR').mark_point(opacity=0.5).encode(
    x=alt.X('S_IP'),
    y=alt.Y('S_DM-SNR'),
    color='type')
pulsar_S_plot#graph displaying the correlation between slop of DM-SMR and mean_IP for pulsars and other stars

This graph highlights the distinct difference between pulsars and other sources. Other sources tend to have low Slop of IP and a range of DM-SNR, while pulsars have low Slop of DM-SNR and a range of Slop of IP.

### Methods:
The methods we use are as follows : 
1. Read-Rename-replace-reset index to clean the data frame and successfully import the desired data frame into Jupyter Notebook.
2. Isnull-group by-Chart to familiarize with the data and check it.
3. As we see here that the data values are very far apart, we will use standard scaler to scale the values and then we will start making the model.
4. We will then divide the data into testing and training data using train_test_split.
5. We will then make a pipeline using knn instance and our preprocessor.
6. Then we will call fit method on the desired columns(we will assign all columns except the prediction column to X and the predicetion column to y)
7. Now our model is trained, now we can predict the name of the starn using .predict method on the model.

### Expected outcome and significance

The expectations of this project is to find a reliable classifier so that it can differentiate between real pulsar and RFI/noise. Moreover, we can also expect a more efficient techniques on how to filter out RFI/noise from real pulsar.

Pulsar is a rare type star that is important if we could identify them precisely. It could improve our current technology so that it could be applied beyond astronomy. Our understanding for neuron star, their structure, temperature, gravitational condition, and how it revolves will be more advanced. Wheather the discoveries of this new type of star could change our life on earth could also be an impact.

The identification of real pulsar could lead to more advanced classifier that in the future could include artificial intelligence for pulsar detection and analysis. Moreover, with the existence of new type of pulsar, this could lead to further questions whether the new pulsar is habitable or not, or whether it is already have an extraterrestrial life there.