*Group-005-9: Ethan Hsu,*

# Pulsare Star Predictor

## Introduction:
A pulsar star is a rare, rapidly rotating neutron star that emits beams of electromagnetic radiation out of its magnetic poles. The electromagnetic waves produced are a detectable pattern of broadband radio emission. However, the radio frequency interferences and radio noise can trigger detectors and mimic the pulsar star making it very challenging to detect a pulsar. We will be using the HTRU2 data set collected during the High Time Resolution Universe Survey containing 17898 examples of which 1639 are real pulsar examples. Observations are described by 8 different variables, and a class variable, the first 4 are mean, standard deviation, excess kurtosis and skewness of the integrated profile. The last 4 are the same but from the DM-SNR curve. We will use our prediction model to determine how accurately we can predict a signal to be from a pulsar star or not.
## Preliminary Exploratory Data Analysis:
Start by loading the libraries



In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
library(GGally)
options(repr.matrix.max.rows = 6)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

Read dataset from the web into R

In [2]:
pulsar_data<-read_csv("https://raw.githubusercontent.com/ehsu2004/K-nearest-Pulsar-Star-Predictor/main/htru2/HTRU_2.csv", col_names=FALSE)

[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): X1, X2, X3, X4, X5, X6, X7, X8, X9

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Cleaning and Wrangling the data into a tidy format:
Add column names

In [3]:
colnames(pulsar_data)<-c("mean_profile",
                         "deviation_profile",
                         "kurtosis_profile",
                         "skewness_profile",
                         "mean_curve",
                         "deviation_curve",
                         "kurtosis_curve",
                         "skewness_curve",
                         "class")

Factoring the `class` variable and renaming its values

In [4]:
pulsar_data<-pulsar_data|>
    mutate(class=ifelse(class==1,"pulsar","non_pulsar"))|>
    mutate(class=as_factor(class))
pulsar_data

mean_profile,deviation_profile,kurtosis_profile,skewness_profile,mean_curve,deviation_curve,kurtosis_curve,skewness_curve,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
140.5625,55.68378,-0.2345714,-0.6996484,3.199833,19.11043,7.975532,74.24222,non_pulsar
102.5078,58.88243,0.4653182,-0.5150879,1.677258,14.86015,10.576487,127.39358,non_pulsar
103.0156,39.34165,0.3233284,1.0511644,3.121237,21.74467,7.735822,63.17191,non_pulsar
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
119.3359,59.93594,0.1593631,-0.74302540,21.430602,58.87200,2.499517,4.595173,non_pulsar
114.5078,53.90240,0.2011614,-0.02478884,1.946488,13.38173,10.007967,134.238910,non_pulsar
57.0625,85.79734,1.4063910,0.08951971,188.306020,64.71256,-1.597527,1.429475,non_pulsar


## Summarizing the data in at least one table:
Splitting the dataset into training and testing data

In [5]:
set.seed(2024)

pulsar_split<-initial_split(pulsar_data, prop = 0.75, strata = class)
pulsar_train<-training(pulsar_split)
pulsar_test<-testing(pulsar_split)

Determining how many rows have missing data

In [6]:
pulsar_train|>
	is.na()|>
	sum()

Getting the mean of all variables

In [7]:
pulsar_train|>
    select(-class)|>
    map_df(mean)

mean_profile,deviation_profile,kurtosis_profile,skewness_profile,mean_curve,deviation_curve,kurtosis_curve,skewness_curve
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
111.319,46.59561,0.4725106,1.754829,12.49007,26.29934,8.303455,104.7698


Finding the number of each value in `class` by using the functions `group_by()` and `summarize()` or `summary()`

In [8]:
pulsar_train|>
    group_by(class)|>
	summarize(n=n())
pulsar_train|>
    summary()

class,n
<fct>,<int>
non_pulsar,12197
pulsar,1226


  mean_profile    deviation_profile kurtosis_profile  skewness_profile 
 Min.   :  6.18   Min.   :24.77     Min.   :-1.8760   Min.   :-1.7647  
 1st Qu.:101.12   1st Qu.:42.47     1st Qu.: 0.0230   1st Qu.:-0.1894  
 Median :115.30   Median :46.99     Median : 0.2194   Median : 0.1905  
 Mean   :111.32   Mean   :46.60     Mean   : 0.4725   Mean   : 1.7548  
 3rd Qu.:127.33   3rd Qu.:51.06     3rd Qu.: 0.4721   3rd Qu.: 0.9184  
 Max.   :192.62   Max.   :98.78     Max.   : 8.0695   Max.   :68.1016  
   mean_curve       deviation_curve  kurtosis_curve   skewness_curve    
 Min.   :  0.2132   Min.   :  7.37   Min.   :-2.722   Min.   :  -1.977  
 1st Qu.:  1.9289   1st Qu.: 14.43   1st Qu.: 5.742   1st Qu.:  34.477  
 Median :  2.7985   Median : 18.49   Median : 8.429   Median :  83.013  
 Mean   : 12.4901   Mean   : 26.30   Mean   : 8.303   Mean   : 104.770  
 3rd Qu.:  5.4824   3rd Qu.: 28.46   3rd Qu.:10.682   3rd Qu.: 139.358  
 Max.   :223.3921   Max.   :110.64   Max.   :34.540   Max.

## Visualizing the data with at least one plot:
We will use 4 predictor variables for the plot, either the integrated pulse profile or the DM-SNR curve with the function `ggpairs()` from `GGally` library to plot variables against itself with `class` colored

In [None]:
options(repr.plot.width = 15, repr.plot.height = 15)

pulsar_train_plot <- pulsar_train|>
                     select(!mean_curve:skewness_curve) |>
                     ggpairs(aes(color = class, alpha = 0.5),
                         lower = list(combo = wrap("facethist", binwidth = 1))) +
                     labs(title = "Integrated Profile Variables Plotted Against Itself") +
                     theme(text = element_text(size = 15))
pulsar_train_plot

In [None]:
options(repr.plot.width = 15, repr.plot.height = 15)

pulsar_train_plot_2 <- pulsar_train|>
                     select(!mean_profile:skewness_profile) |>
                     ggpairs(aes(color = class, alpha = 0.5),
                         lower = list(combo = wrap("facethist", binwidth = 1))) +
                     labs(title = "DM-SNR curve Variables Plotted Against Itself") +
                     theme(text = element_text(size = 15))
pulsar_train_plot_2

By looking at the two different plots created above, the integrated profile plot has better seperation of the variables than the DM-SNR curve plot. Therefore, from this point on we will be using the integrated profile variable as the predictor variable.
Next we need to scale the predictor variable.

In [None]:
pulsar_recipe<-recipe(class~mean_profile+deviation_profile+kurtosis_profile+skewness_profile,data=pulsar_train)|>
                    step_center(all_predictors())|>
                    step_scale(all_predictors())
pulsar_train_scaled<-pulsar_recipe|>
                        prep()|>
                        bake(pulsar_train)
pulsar_train_scaled

Lastly, plotting the now scaled predictor variables against itself.

In [None]:
options(repr.plot.width=15, repr.plot.length=15)

pulsar_scaled_plot<-pulsar_train_scaled|>
                        ggpairs(aes(color=class, alpha=0.5),
                            lower=list(combo=wrap("facethist", binwidth=1)))+
                        labs(title= "Integrated Profile Variables (Standardized) Plotted Against Itself")+
                        theme(text=element_text(size=20))
pulsar_scaled_plot

## Methods
We will use the K-nearest neighbor classification algorithm for the predictor model with `mean_profile`, `deviation_profile`, `kurtosis_profile`, and `skewness_profile` as our predictor variables to predict the best K value. We will use scatter plotting colored by class (`pulsar` or `non_pulsar`) for each observation.
## Expected Outcomes and Significance:
We expect to be able to accurately and precisely determine if an observed signal is coming from a pulsar or non pulsar star using the K-nearest neighbor classification algorithm to find the best K value. We hope that in doing so we can make an impact in the fields of astronomy and their astrological research. This could lead to further questions about understanding the physics driving pulsar behavior and how pulsars form and evolve within galaxies.
## References:
R. J. Lyon, HTRU2, DOI: 10.6084/m9.figshare.3080389.v1.

"What Are Pulsars?" Space.com, 24 Jan. 2023, www.space.com/32661-pulsars.html.