# DSCI Group Project Proposal - Pulsar Stars

_Group 66 (Andrew Ahn, Calvin Choi, Allan Hu, Dishika Taneja_)

### Introduction ###

Pulsar stars are rapidly rotating neutron stars that emit electromagnetic waves from its poles. As it spins, the "beam" of electromagnetic waves sweeps across Earth. The period of time between each sweep is useful to scientists for things like probes of space-time or precise clocks.

These stars can be measured and found based on the radio waves that they release, but since the stars are so far away and there are so many of them, it can be very time consuming to determine whether any star is a pulsar star or not. Furthermore, many radio signals received by our measuring devices are caused by inteference or noise, which means that they don't represent a pulsar star. In this project, we will try to make a model that can predict whether a radio observation is a pulsar star to allow for rapid analysis of radio wave data coming from space.

To do this, we will use the HTRU2 data set from the UC Irving Machine Learning Repository. This dataset features many radio signal observations taken from space, with 1,639 actual pulsar star examples and 16,259 observations of noise. The data taken for each observation includes mean, standard deviation, excess kurtosis, and skew for both the pulse profile and the DM-SNR curve for the signal.

(temp note: See this citation for more information about the pulse profile and DM-SNR curve:

R. J. Lyon, 'Why Are Pulsars Hard To Find?', PhD Thesis, University of Manchester, 2016.)

### Abstract (Preliminary exploratory data analysis)

In [5]:
library(tidyverse)
library(tidymodels)
library(GGally)
options(repr.matrix.max.rows = 6)

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2



In [14]:

signals <- read_csv("data/HTRU_2.csv", col_names = FALSE) |>
    rename(profile_mean = X1, profile_std = X2, profile_exk = X3, profile_skew = X4,
                       curve_mean = X5, curve_std = X6, curve_exk = X7, curve_skew = X8, class = X9)
signals_set <- signals |>
    mutate(class = as_factor(class))
signals_set

[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): X1, X2, X3, X4, X5, X6, X7, X8, X9

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


profile_mean,profile_std,profile_exk,profile_skew,curve_mean,curve_std,curve_exk,curve_skew,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
140.5625,55.68378,-0.2345714,-0.6996484,3.199833,19.11043,7.975532,74.24222,0
102.5078,58.88243,0.4653182,-0.5150879,1.677258,14.86015,10.576487,127.39358,0
103.0156,39.34165,0.3233284,1.0511644,3.121237,21.74467,7.735822,63.17191,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
119.3359,59.93594,0.1593631,-0.74302540,21.430602,58.87200,2.499517,4.595173,0
114.5078,53.90240,0.2011614,-0.02478884,1.946488,13.38173,10.007967,134.238910,0
57.0625,85.79734,1.4063910,0.08951971,188.306020,64.71256,-1.597527,1.429475,0


In order, the columns contain the variables: Mean of the integrated profile, Standard Deviation of the integrated profile, Excess kurtosis of the integrated profile, Skewness of the integrated profile, then the following four are the same except for the DM-SNR curve. The last column shows the class, where 1 indicates a pulsar star and 0 indicates noise or interference.

In [24]:
signals_split <- initial_split(signals_set, prop = 0.75, strata = class)
signals_training <- training(signals_split)
signals_testing <- testing(signals_split)
signals_training
signals_means <- signals_training |>
    select(!class) |>
    map_df(mean)
signals_means #Can be used to decide if normalization is needed.
signals_class <- signals_training |>
    group_by(class) |>
    summarize(n = n())
signals_class #Shows how many pulsars there are and how many noise signals there are in the training set.

profile_mean,profile_std,profile_exk,profile_skew,curve_mean,curve_std,curve_exk,curve_skew,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
140.5625,55.68378,-0.2345714,-0.6996484,3.199833,19.11043,7.975532,74.24222,0
102.5078,58.88243,0.4653182,-0.5150879,1.677258,14.86015,10.576487,127.39358,0
103.0156,39.34165,0.3233284,1.0511644,3.121237,21.74467,7.735822,63.17191,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
119.3359,59.93594,0.1593631,-0.74302540,21.430602,58.87200,2.499517,4.595173,0
114.5078,53.90240,0.2011614,-0.02478884,1.946488,13.38173,10.007967,134.238910,0
57.0625,85.79734,1.4063910,0.08951971,188.306020,64.71256,-1.597527,1.429475,0


profile_mean,profile_std,profile_exk,profile_skew,curve_mean,curve_std,curve_exk,curve_skew
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
111.1094,46.56328,0.4802872,1.790632,12.71011,26.40552,8.281586,104.3612


class,n
<fct>,<int>
0,12186
1,1237


- Demonstrate that the dataset can be read from the web into R 
- Clean and wrangle your data into a tidy format
- Using only **training data** , summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
- Using only **training data**, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

### Methods

Explain how you will conduct either your data analysis and which variables/columns you will use.

After reading about the DM-SNR curve and integrated profile, I believe 

- We can use statistical analysis i.e t test to find the right variables to use for our predictive analysis
- Alternatively we can just graph out and solve for the R squared to find which variables can best explain a relationship
- The expectations say to create a visualization but it doesn't necessarily say we have to make it in ggplot, we can make in tableau or powerBI if we want to be extra fancy

### Hypothesis (Expected outcomes and significance)