In [6]:
### Run this cell before continuing.
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('tests.R')
source('cleanup.R')

“cannot open file 'tests.R': No such file or directory”


ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


<h3>Introduction:</h3>

<b>Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal.</b>
<br>
According to the information provided in the README file, Pulsars are a rare type of neutron stars that are useful as probes of space-time, interstellar medium, and states of matter. They are searched for by looking for radio signals using telescopes. It is hard to distinguish between signals caused by either RFI (Radio 
Frequency Interference) or noise and real pulsars. Thus the above is a classification problem of identifying pulsar and non-pulsar classes based on certain variables.

<b>Clearly state the question you will try to answer with your project</b>
<br>
Perform dataset classification to classify pulsar candidates to pulsar or non-pulsar categories to facilitate the exploration and identification of pulsars through data science techniques.</b>
The question : Based on certain characteristic variables, is a given candidate pulsar or non-pulsar?
<br>

<b>Identify and describe the dataset that will be used to answer the question</b>
- We will be using the Dataset available at this link: https://archive.ics.uci.edu/dataset/372/htru2
- Here is the citation for the dataset: 
R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, Fifty Years of Pulsar. Candidate Selection: From simple filters to a new principled real-time classification approach MNRAS, 2016.

- The title is HTRU2 and the dataset contains 17898 observations and the following 8 variables as described in the dataset README file:
    - Mean of the integrated profile.
    - Standard deviation of the integrated profile.
    - Excess kurtosis of the integrated profile.
    - Skewness of the integrated profile.
    - Mean of the DM-SNR curve.
    - Standard deviation of the DM-SNR curve.
    - Excess kurtosis of the DM-SNR curve.
    - Skewness of the DM-SNR curve.
- An additional dummy/categorical variable column indicates whether or not the observation has been classified as a pulsar (class == 1) or not (class == 0)
- The sample has a balance problem since there are only 1639 real pulsar examples and an overwhelming 16259 non pulsar examples (RFI/noise). This will need to be accounted for in the analysis.


<h3> Preliminary exploratory data analysis:</h3>

- Demonstrate that the dataset can be read from the web into R 
- Clean and wrangle your data into a tidy format
- Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.


In [23]:
# Reading data
pulsar_data <- read_csv("data/HTRU_2.csv", 
                        col_names = c("mean_ip", "standard_deviation_ip", 
                                      "excess_kurtosis_ip", "skewness_ip",
                                      "mean_c", "standard_deviation_c", 
                                      "excess_kurtosis_c", "skewness_c",
                                      "is_pulsar"
                                     ))
pulsar_data

[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): mean_ip, standard_deviation_ip, excess_kurtosis_ip, skewness_ip, me...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


mean_ip,standard_deviation_ip,excess_kurtosis_ip,skewness_ip,mean_c,standard_deviation_c,excess_kurtosis_c,skewness_c,is_pulsar
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
140.5625,55.68378,-0.2345714,-0.6996484,3.199833,19.11043,7.975532,74.24222,0
102.5078,58.88243,0.4653182,-0.5150879,1.677258,14.86015,10.576487,127.39358,0
103.0156,39.34165,0.3233284,1.0511644,3.121237,21.74467,7.735822,63.17191,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
119.3359,59.93594,0.1593631,-0.74302540,21.430602,58.87200,2.499517,4.595173,0
114.5078,53.90240,0.2011614,-0.02478884,1.946488,13.38173,10.007967,134.238910,0
57.0625,85.79734,1.4063910,0.08951971,188.306020,64.71256,-1.597527,1.429475,0


In [24]:
# Cleaning and wrangling data into a tidy format
pulsar_pivot_after_mean <- pulsar_data |>
       pivot_longer(cols = starts_with("mean"),
                    names_to = "type",
                    values_to = "mean")

pulsar_pivot_after_mean


standard_deviation_ip,excess_kurtosis_ip,skewness_ip,standard_deviation_c,excess_kurtosis_c,skewness_c,is_pulsar,type,mean
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>
55.68378,-0.2345714,-0.6996484,19.11043,7.975532,74.24222,0,mean_ip,140.562500
55.68378,-0.2345714,-0.6996484,19.11043,7.975532,74.24222,0,mean_c,3.199833
58.88243,0.4653182,-0.5150879,14.86015,10.576487,127.39358,0,mean_ip,102.507812
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
53.90240,0.2011614,-0.02478884,13.38173,10.007967,134.238910,0,mean_c,1.946488
85.79734,1.4063910,0.08951971,64.71256,-1.597527,1.429475,0,mean_ip,57.062500
85.79734,1.4063910,0.08951971,64.71256,-1.597527,1.429475,0,mean_c,188.306020


Methods:
Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize the results
Expected outcomes and significance:
What do you expect to find?
What impact could such findings have?
What future questions could this lead to?
Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?
Describe at least one way that you will visualize the results
A confusion matrix to see the proportions of True Positives, False Positives, True Negatives, and False Negatives.


What do you expect to find?
Answer: We expect to find how good our K-NN classifier is at predicting whether a star is pulsar or non-pulsar based on our predictor variables.
What impact could such findings have?
Answer: If in the end we find that our classifier is appropriate for making such pulsar star classification, we can conclude that the variables selected show distinct differences between pulsar and non-pulsar stars.
What future questions could this lead to?
Answer: A future question might be how good are other types of classifiers at classifying pulsar and non-pulsar stars.
