# Group 35 Project Proposal

## Introduction (Linda)

Pulsars are rotating neutron stars. They produce radio emissions in a beam that can be detected when it sweeps past the Earth. However, not all such signals come from pulsars. Machine learning can be used in conjunction with radio emissions data to help distinguish pulsars based on these signals. 

In this project, we will train a model to use the K-nearest neighbours algorithm to perform binary classification on radio emissions data, so that it is capable of predicting whether a signal originates from a pulsar star based on its characteristics. The dataset that will be used to train and test this model is the HTRU2 dataset contributed to the University of Irvine Machine Learning Repository by Robert Lyon, which contains 17 898 observations of radio emissions, 1 639 of which come from pulsars and 16 259 of which do not. For each observation, a variable representing its true origin (pulsar or not) is recorded, along with 8 continuous variables describing features of the signal. Thus, both a class variable and 8 potential predictor variables are present in this dataset.

## Preliminary exploratory data analysis: (Arthur)

The data for this specific dataset is provided in both CSV and ARFF formats. We have opted to use the the provided CSV which comes without column names, however the columns and the data contained within are described in the Readme.txt file. Thus, we can add the column names ourselves directly. The column names are:
* Mean of the integrated profile.
* Standard deviation of the integrated profile.
* Excess kurtosis of the integrated profile.
* Skewness of the integrated profile.
* Mean of the DM-SNR curve.
* Standard deviation of the DM-SNR curve.
* Excess kurtosis of the DM-SNR curve.
* Skewness of the DM-SNR curve.
* Pulsar Classifier

In [5]:
# Load necessary libraries
library(tidyverse)

In [9]:
data_url <-'https://raw.githubusercontent.com/aronthemon/dsci-100-group-project/main/data/HTRU_2.csv'

# Read data from the dataset's CSV file
# Add column names based on the documention in readme.txt
# Convert the pulsar binary classification to a factor so it can be used later.
pulsar_data <- read_csv(data_url, 
                        col_names= c(
                            "ip_mean", 
                            "ip_std_dev", 
                            "ip_excess_kurtosis", 
                            "ip_skewness", 
                            "dm_snr_mean", 
                            "dm_snr_std_dev", 
                            "dm_snr_excess_kurtosis", 
                            "dm_snr_skewness", 
                            "pulsar")) |>
                mutate(pulsar = as_factor(pulsar))

# Display the first 6 rows of our data
head(pulsar_data, 6)

[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): ip_mean, ip_std_dev, ip_excess_kurtosis, ip_skewness, dm_snr_mean, ...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ip_mean,ip_std_dev,ip_excess_kurtosis,ip_skewness,dm_snr_mean,dm_snr_std_dev,dm_snr_excess_kurtosis,dm_snr_skewness,pulsar
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
140.5625,55.68378,-0.23457141,-0.6996484,3.199833,19.11043,7.975532,74.24222,0
102.50781,58.88243,0.46531815,-0.5150879,1.677258,14.86015,10.576487,127.39358,0
103.01562,39.34165,0.32332837,1.0511644,3.121237,21.74467,7.735822,63.17191,0
136.75,57.17845,-0.06841464,-0.6362384,3.642977,20.95928,6.896499,53.59366,0
88.72656,40.67223,0.60086608,1.1234917,1.17893,11.46872,14.269573,252.56731,0
93.57031,46.69811,0.53190485,0.4167211,1.636288,14.54507,10.621748,131.394,0


## Methods (Aaron)

We will conduct our data analysis by using cross-validation to determine the best k value. Then, we will use that k value to create the KNN classification model that is able to predict the classes of new observations of possible pulsars. We will be using all of the variables in the data set, so that we can give the model more data to work with, so that it can classify new observations better.

One way that we will be visualizing our results is by using many histograms, to plot every variable as its own histogram, and use colour on the histogram to distinguish if the observation is a pulsar or not.


## Expected outcomes and significance (Markus)

We expect to create an algorithm which can accurately predict whether or not a stellar object is a pulsar. Any new observation created will be analyzed to produce a predicted class. The model will produce a scatter plot of the data as displayed by 2 indicator variables. This plot will then show how the K-NN algorithm determined the pulsar status of a new point.

The significance of the project is that we will be able to determine which parameters are indicative of pulsar status, as well as how to classify every future observation. This could be used in astronomy labs.

This project may lead to future classification questions, including which astronomical parameters are necessary to classify any number of astronomical objects. 