<h3>Group 004 25 Project Proposal</h3>
<br>

A pulsar is a rapidly rotating neutron star that emits powerful beams of light at its magnetic poles. The beam of emission rotates with the star, and it is only visible when it crosses our line of sight. When the light is pointing towards the Earth, it produces a detectable pattern of broadband radio emission. “As the pulsar rotates, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.”(Shaw, 2021). However, in practice, while trying to detect signals from pulsar stars, the radio telescopes will also receive plenty of signals caused by RFI and/or noise. This makes legitimate signals hard to find. 

Our goal in this project is to build a K-nearest neighbor classifier that predicts whether a signal is from a pulsar star or caused by RFI and/or noise (nonpulsar). 
<br>

The dataset that we will be using is named HTRU2 which describes a sample of pulsar candidates (potential signal detections) collected during the High Time Resolution Universe Survey (South). 
<br>
<br>
This dataset contains 17898 observations and the following 9 variables:

- Mean of the integrated profile.
- Standard deviation of the integrated profile.
- Excess kurtosis of the integrated profile.
- Skewness of the integrated profile.
- Mean of the DM-SNR curve.
- Standard deviation of the DM-SNR curve.
- Excess kurtosis of the DM-SNR curve.
- Skewness of the DM-SNR curve.
- Class
<br>

The first eight variables describe characteristics from the signal, and the Class variable is a categorical variable that contains the categories 0 (nonpulsar) and 1 (pulsar). The Class variable will be our target variable.

In [1]:
### Run this cell before continuing.
library(tidyverse)
library(repr)
library(tidymodels)
library(caret)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

<br>
We have downloaded the HTRU_2 data set and now let's read it.

In [2]:
path <- "data/HTRU_2.csv"

pulsar_data <- read_csv(path, col_names = c("mean_ip", "standard_deviation_ip", 
                                      "excess_kurtosis_ip", "skewness_ip",
                                      "mean_c", "standard_deviation_c", 
                                      "excess_kurtosis_c", "skewness_c",
                                      "is_pulsar"))
pulsar_data

[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): mean_ip, standard_deviation_ip, excess_kurtosis_ip, skewness_ip, me...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


mean_ip,standard_deviation_ip,excess_kurtosis_ip,skewness_ip,mean_c,standard_deviation_c,excess_kurtosis_c,skewness_c,is_pulsar
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
140.5625,55.68378,-0.2345714,-0.6996484,3.199833,19.11043,7.975532,74.24222,0
102.5078,58.88243,0.4653182,-0.5150879,1.677258,14.86015,10.576487,127.39358,0
103.0156,39.34165,0.3233284,1.0511644,3.121237,21.74467,7.735822,63.17191,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
119.3359,59.93594,0.1593631,-0.74302540,21.430602,58.87200,2.499517,4.595173,0
114.5078,53.90240,0.2011614,-0.02478884,1.946488,13.38173,10.007967,134.238910,0
57.0625,85.79734,1.4063910,0.08951971,188.306020,64.71256,-1.597527,1.429475,0


 <br>
 The values under the is_pulsar column are doubles so we will convert them to factors for the convenience of performing classifications.

In [4]:
pulsar_data <- mutate(pulsar_data, is_pulsar = as_factor(is_pulsar))
pulsar_data

mean_ip,standard_deviation_ip,excess_kurtosis_ip,skewness_ip,mean_c,standard_deviation_c,excess_kurtosis_c,skewness_c,is_pulsar
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
140.5625,55.68378,-0.2345714,-0.6996484,3.199833,19.11043,7.975532,74.24222,0
102.5078,58.88243,0.4653182,-0.5150879,1.677258,14.86015,10.576487,127.39358,0
103.0156,39.34165,0.3233284,1.0511644,3.121237,21.74467,7.735822,63.17191,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
119.3359,59.93594,0.1593631,-0.74302540,21.430602,58.87200,2.499517,4.595173,0
114.5078,53.90240,0.2011614,-0.02478884,1.946488,13.38173,10.007967,134.238910,0
57.0625,85.79734,1.4063910,0.08951971,188.306020,64.71256,-1.597527,1.429475,0


<br>
Next, let's examine how the number of observations is distributed among pulsars (1) and nonpulsars (0).

In [5]:
pulsar_distribution <- pulsar_data |>
group_by(is_pulsar) |>
summarize(n())

pulsar_distribution

is_pulsar,n()
<fct>,<int>
0,16259
1,1639


We have an uneven distribution of observations among the two classes. Additionally, the 17,898 observations in the data set will result in serious overplotting. A solution these problems is the `downSample` function from the `caret` package.

In [6]:
pulsar_balanced <- pulsar_data |>
downSample(pulsar_data$is_pulsar)

pulsar_balanced

mean_ip,standard_deviation_ip,excess_kurtosis_ip,skewness_ip,mean_c,standard_deviation_c,excess_kurtosis_c,skewness_c,is_pulsar,Class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>
139.0391,51.27109,-0.3986428,-0.18010645,1.637960,11.31409,12.904173,232.82199,0,0
119.5312,48.09056,0.3588836,0.52593872,7.802676,34.48856,4.623729,20.97594,0,0
126.4141,50.79378,-0.2143957,-0.02138971,3.409699,22.95370,7.723231,62.71614,0,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
71.01562,33.65728,2.378286,9.636837,15.820234,52.43958,3.382535,10.303138,1,1
96.32031,46.13667,1.033362,1.625057,4.243311,26.74649,7.110978,52.701218,1,1
45.09375,28.60956,4.156460,26.198202,34.565217,67.78225,1.872010,2.171717,1,1


Now let's examine distribution of the observations among the classes in our new data frame.

In [7]:
pulsar_balanced_distribution <- pulsar_balanced |>
group_by(is_pulsar) |>
summarize(n())

pulsar_balanced_distribution

is_pulsar,n()
<fct>,<int>
0,1639
1,1639


We can see that our data is indeed balanced and have less but still decent amount of observations.