<h2>Group 004 25 Project Report</h2>
<br>

A pulsar is a rapidly rotating neutron star that emits powerful beams of light at its magnetic poles. The beam of emission rotates with the star, and it is only visible when it crosses our line of sight. When the light is pointing towards the Earth, it produces a detectable pattern of broadband radio emission. “As the pulsar rotates, this pattern repeats periodically. Thus pulsar search involves looking for periodic radio signals with large radio telescopes.”(Shaw, 2021). However, in practice, while trying to detect signals from pulsar stars, the radio telescopes will also receive plenty of signals caused by radio frequency interference (RFI) and noise. This makes legitimate signals hard to find. 

Our goal in this project is to build a K-nearest neighbor classifier that predicts whether a signal is from a pulsar star or caused by RFI and/or noise which are nonpulsar matters. 
<br>

After completing this project, we will be able to answer the following question: Based on the variables that we choose, can we accurately predict whether a candidate is a pulsar or a nonpulsar signal? 
<br>

The dataset that we will be using is named HTRU2 which describes a sample of pulsar candidates (signal detections that come from pulsar stars or RFI/noise) collected during the High Time Resolution Universe Survey. 
<br>
<br>
This dataset contains 17898 observations and the following 9 variables:

- Mean of the integrated profile.
- Standard deviation of the integrated profile.
- Excess kurtosis of the integrated profile.
- Skewness of the integrated profile.
- Mean of the DM-SNR curve.
- Standard deviation of the DM-SNR curve.
- Excess kurtosis of the DM-SNR curve.
- Skewness of the DM-SNR curve.
- Class
<br>

The first eight variables describe characteristics from the signal, and the Class variable is a categorical variable that contains the categories 0 (nonpulsar) and 1 (pulsar). The Class variable will be our target variable.

First, we load the libraries.

In [None]:
# Run this
install.packages("kknn")

In [2]:
# Loading the libraries
library(tidyverse)
library(repr)
library(tidymodels)
library(kknn)
options(repr.matrix.max.rows = 6)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



Now we obtain our dataset from the web using a URL.

In [None]:
# Reading data from url
url <- "https://raw.githubusercontent.com/aravic03/group-project-proposal/main/HTRU_2.csv"
pulsar_data <- read_csv(url,
                       col_names = c("mean_ip", "standard_deviation_ip", 
                                      "excess_kurtosis_ip", "skewness_ip",
                                      "mean_c", "standard_deviation_c", 
                                      "excess_kurtosis_c", "skewness_c",
                                      "is_pulsar" )) |> # Add column names to the data frame.                       
               mutate(is_pulsar = as_factor(is_pulsar)) # Convert our target variable to a factor.
pulsar_data

Since we will be performing classification, we split our data into a training and a testing set.

In [None]:
set.seed(1) # set the seed

# Splitting the data into training and testing sets (60% and 40%)
pulsar_data_split <- initial_split(pulsar_data, prop = 0.6, strata = is_pulsar)
pulsar_training <- training(pulsar_data_split)
pulsar_testing <- testing(pulsar_data_split)

pulsar_training

We will begin our preliminary data analysis process by examining the number of observations we have in the training set for each class.

In [None]:
num_obs_training <- pulsar_training |>
      group_by(is_pulsar) |>
      summarize(n = n()) |>
    mutate(percentage = 100*n/nrow(pulsar_training))
num_obs_training

Next, we will look at whether our variables contain missing values.

In [None]:
na_counts <- pulsar_training |>
     summarise_all(~ sum(is.na(.)))
na_counts

As shown above, there is no missing values in any of the variables.

We now begin to choose our predictor variables. First, we plot eight histograms, each showing the distribution of values in each of the variables excluding the `is_pulsar` variable which is our target variable.

In [None]:
# Visualiizng the distribution of values in each variables using eight histograms

# Histogram 1: Distribution of the Mean of the Integrated Profile
mean_ip_plot <- ggplot(pulsar_training, aes(x = mean_ip, fill = is_pulsar)) +
  geom_histogram(color = "blue") +
  labs(title = "Distribution of Mean of the Integrated Profile", 
       x = "Mean of the Integrated Profile", 
       fill = "Pulsar")     
mean_ip_plot

# Histogram 2: Distribution of the Standard Deviation of the Integrated Profile
sd_ip_plot <- ggplot(pulsar_training, aes(x = standard_deviation_ip, fill = is_pulsar)) +
  geom_histogram(color = "blue") +
  labs(title = "Standard Deviation Integrated Profile Distribution", 
       x = "Standard Deviation Integrated Profile Values", 
       fill = "Pulsar")     
sd_ip_plot

# Histogram 3: Distribution of the Excess Kurtosis of the Integrated Profile
excess_kurtosis_ip_plot <- ggplot(pulsar_training, aes(x = excess_kurtosis_ip, fill = is_pulsar)) +
  geom_histogram(color = "blue") +
  labs(title = "Excess Kurtosis Integrated Profile Distribution", x = "Excess Kurtosis Integrated Profile Values", fill = "Pulsar" )     
excess_kurtosis_ip_plot

# Histogram 4: Distribution of the Skewness of the Integrated Profile
skewness_ip_plot <- ggplot(pulsar_training, aes(x = skewness_ip, fill = is_pulsar)) +
  geom_histogram(color = "blue") +
  labs(title = "Skewness Integrated Profile Distribution", x = "Skewness Integrated Profile Values", fill = "Pulsar" )     
skewness_ip_plot

# Histogram 5: Distribution of the Mean of DM-SNR Curve
mean_c_plot <- ggplot(pulsar_training, aes(x = mean_c, fill = is_pulsar)) +
  geom_histogram(color = "blue") +
  labs(title = "Mean of the DM-SNR Curve Distribution", x = "Mean of DM-SNR Curve Values", fill = "Pulsar" )     
mean_c_plot

# Histogram 6: Distribution of the Standard Deviation of the DM-SNR Curve
sd_c_plot <- ggplot(pulsar_training, aes(x = standard_deviation_c, fill = is_pulsar)) +
  geom_histogram(color = "blue") +
  labs(title = "Standard Deviation of the DM-SNR Curve Distribution", x = "Standard Deviation of DM-SNR Curve Values", fill = "Pulsar" )     
sd_c_plot

# Histogram 7: Distribution of Excess Kurtosis of DM-SNR Curve
excess_kurtosis_c_plot <- ggplot(pulsar_training, aes(x = excess_kurtosis_c, fill = is_pulsar)) +
  geom_histogram(color = "blue") +
  labs(title = "Excess Kurtosis of the DM-SNR Curve Distribution", x = "Excess Kurtosis of DM-SNR Curve Values", fill = "Pulsar" )     
excess_kurtosis_c_plot

# Histogram 8: Skewness of the DM-SNR Curve
skewness_c_plot <- ggplot(pulsar_training, aes(x = skewness_c, fill = is_pulsar)) +
  geom_histogram(color = "blue") +
  labs(title = "Skewness of the DM-SNR Curve Distribution", x = "Skewness of DM-SNR Curve Values", fill = "Pulsar" )     
skewness_c_plot

Since we have much more nonpulsar observations than pulsar observations, there is a much higher chance that the K-nearest neighbor algorithm will classify a new observation as a nonpulsar detection. Nevertheless, a way to maintain the accuracy of our model is to choose variables where the pulsar observations have distinct values from that of the nonpulsar. Based on the histograms, we observe that for the variables `excess_kurtosis_ip`, `skewness_ip`, and `mean_c`, the values of the two classes are distinct from each other as desired. Therefore, we will choose the three variables indicated as our predictor variables.

Now we can begin building our K-NN classification model. 

First, we will make a recipe using our training set.

In [None]:
pulsar_recipe <- recipe(is_pulsar ~ excess_kurtosis_ip + skewness_ip + mean_c, data = pulsar_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())
pulsar_recipe

Then, we will build a tuning model for picking the best K value.

In [None]:
# tuning model
tune_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")

tune_spec

Next, we perform cross-validation, create a workflow that calculated the metrics for each of the K values 1, 6, ..., 46, and then return a data frame that shows the accuracy of each K value.

In [None]:
set.seed(123) # set the seed

# cross-validation
pulsar_vfold <- vfold_cv(pulsar_training, v = 5, strata = is_pulsar)

# create a set of K values
kvals <- tibble(neighbors = seq(from = 1, to = 50, by = 5))

# data analysis workflow                       
    knn_results <- workflow() |>
    add_recipe(pulsar_recipe) |>
    add_model(tune_spec) |>
    tune_grid(resamples = pulsar_vfold, grid = kvals) |>
    collect_metrics()

accuracies <- knn_results |>
    filter(.metric == "accuracy")

accuracies

After we have found the accuracy of our model for each K value, we create an accuracy versus K plot.

In [None]:
best_k_plot <- accuracies |>
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Number of neighbors", y = "Accuracy") +
    ggtitle("Accuracy vs. Number of Neighbors")
best_k_plot

From the graph, we can see that the K value that generates the highest accuracy is 26 (nearly an accuracy of 98%), so we will choose this number as the K for our classification model.

We will now build a new K-NN model with K = 26.

In [None]:
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 26) |>
    set_engine("kknn") |>
    set_mode("classification")
knn_spec

Then, we will use a workflow to fit our model using our training set, predict the labels in our testing set, and incorporate the predictions into our testing set.

In [None]:
knn_fit <- workflow() |>
    add_recipe(pulsar_recipe) |>
    add_model(knn_spec) |>
    fit(data = pulsar_training)

pulsar_predict <- predict(knn_fit, pulsar_testing) |>
    bind_cols(pulsar_testing)

pulsar_predict

After having made the predictions, we will assess the accuracy of of our classifier using the `metrics` function.

In [None]:
pulsar_accuracy <- pulsar_predict |>
    metrics(truth = is_pulsar, estimate = .pred_class) |>
    filter(.metric == "accuracy")

pulsar_accuracy

We can also examine the confusion matrix of our classifier's performance.

In [None]:
confusion_matrix <- pulsar_predict |>
    conf_mat(truth = is_pulsar, estimate = .pred_class)
confusion_matrix

Using the confusion matrix above, we found that the precision of our classifier on the testing set is equal to 0.9341 and the recall is equal to 0.830508.

<h3>Methods:</h3>

<b>Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?</b>
<br>

- The initial four variables are basic statistics derived from the integrated pulse profile, which represents a continuous array of variables describing a version of the signal that has been averaged both in time and frequency, providing a longitude-resolved perspective. The latter four variables are derived from the DM-SNR curve in a similar manner. All the variables have potential relevance to the task of predicting pulsars. It is important to empirically determine their importance through analysis. We can use Correlation Analysis to calculate the correlation between each variable. Variables with higher absolute correlation values are often more relevant. Depending on the distribution of the data, certain variables may have more direct influence on the target variable. Therefore we will choose 8 variables and make 8 histogram, then compare the distribution and use correlation analysis to determine which variable we will use for our predictors.

- In our data analysis, we will build our KNN model by tuning the number of neighbors that we have using the `neighbors = tune()` argument in the `nearest_neighbor` function. We will also perform the five-fold cross-validation process using the `vfold_cv` function. Additionally, we will identify the predictor and target variables that we will be using as well as scale and center the values in a recipe. Next we will combine the recipe, the classification model, the `tune_grid` function, as well as the `collect_metrics()` function in a workflow. After that we will filter the data frame to keep only rows containing the accuracy metrics. Finally, we will plot the graph with accuracy estimates versus the number of K and decide which K value is the most appropriate. At this point, the process of creating the classifier is completed.
<br>
<b>Describe at least one way that you will visualize the results</b>
<br>
We can create a confusion matrix to visualize the performance of our classifier. Confusion matrix displays the true positive, true negative, false positive, and false negative counts allowing us to assess the classifier’s accuracy, precision and recall.
<br>
<h3>Expected outcomes and significance:</h3>
<b>What do you expect to find?</b>
<br>
<p>We expect to find how good our K-NN classifier is at predicting whether a star is pulsar or non-pulsar based on our predictor variables.</p>
<b>What impact could such findings have?</b>
<p>If in the end we find that our classifier is appropriate for making such pulsar star classification, we can conclude that the variables selected show distinct differences between pulsar and non-pulsar stars.
Could allow for more efficient classification of stars into either pulsar or non-pulsar categories.
It helps in cataloging and characterizing these celestial objects, contributes to our knowledge of astrophysics and the cosmos. Understanding which objects emit regular signals (pulsars) and which do not (non-pulsars) is important for astrobiology and the study of exoplanets. Pulsar signals can be used to probe the atmospheres of exoplanets and assess their potential habitability.</p>
<b>What future questions could this lead to?</b>
<br>
- How good are other types of classifiers at classifying signals from pulsar stars and non-pulsar matters?
- How do the radio signal pattern of pulsars evolve over time, and what are the factors influencing their properties?
- What is the distribution of pulsars in our galaxy and beyond?
- How do pulsars interact with their surroundings?
- Are there alternative pathways to pulsar formation, and can they be identified?
