# Part 1: Data Cleaning & Exploration

## Imports

In [None]:
import pandas             as pd
import numpy              as np
import seaborn            as sns
import matplotlib.pyplot  as plt
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
sns.set(style = "white", palette = "deep")
%matplotlib inline

## Table Of Contents

1. [Exploration](#Exploration)
    - [Reading In The Data](#Reading-In-The-Data)
    - [Overview](#Overview)


2. [Data Cleaning](#Data-Cleaning)
    - [Renaming Columns](#Renaming-Columns)


3. [Visualizations](#Visualizations)
    - [Functions](#Functions)
    - [Histograms](#Histograms)
    - [Box Plots](#Box-Plots)
    - [Bar Plots](#Bar-Plots)
    - [Heat Map](#Heat-Map)

## Exploration

### Reading In The Data

In [None]:
pulsar = pd.read_csv("../Data/pulsar_stars.csv")

### Overview

In [None]:
# Checking the shape of the df

print(f"The shape of the pulsar dataframe is {pulsar.shape[0]} rows by {pulsar.shape[1]} columns.")

In [None]:
# Checking the data's head

pulsar.head()

In [None]:
# Checking for null values

pulsar.isnull().sum()

In [None]:
# Checking the df's info

pulsar.info()

In [None]:
# Description of numeric columns

pulsar.describe().T

## Data Cleaning

### Renaming Columns

As we saw above, some of the columns have very long names.  To make life easier, we decided to shorten the names of the columns.

In [None]:
# Renaming the columns

pulsar = pulsar.rename({" Mean of the integrated profile": "mean_ip",
                        " Standard deviation of the integrated profile": "sd_ip",
                        " Excess kurtosis of the integrated profile": "ex_kurt_ip",
                        " Skewness of the integrated profile": "skew_ip",
                        " Mean of the DM-SNR curve": "mean_dmsnr",
                        " Standard deviation of the DM-SNR curve": "sd_dmsnr",
                        " Excess kurtosis of the DM-SNR curve": "ex_kurt_dmsnr",
                        " Skewness of the DM-SNR curve": "skew_dmsnr",
                        " target_class": "target"}, axis = 1)

In [None]:
pulsar.head(2)

Now that we made changes to the dataframe, we want these changes to be available when we model.  To do that, we will save a new copy of the dataframe.

In [None]:
pulsar.to_csv("../Data/pulsar_cleaned.csv", index = False)

[Top](#Table-Of-Contents)

## Visualizations

### Functions

In [None]:
# Plotting histograms

def plot_histograms(columns, titles, labels, ticks):
    
    # The count determines the location of the chart within the grid
    count = 0
    fig   = plt.figure(figsize = (14,12))
    
    # Looping through each column in the list to graph
    # enumerating allows for me to index the other lists
    for c, column in enumerate(columns):
        
        # Changing the location
        count += 1
        ax    = fig.add_subplot(4, 2, count)
        
        # Plotting and setting parameters for the graph
        plt.title(f"Distribution Of {titles[c]}", size = 18)
        sns.distplot(pulsar[column], color = "black",
                     kde = False)
        plt.axvline(pulsar[column].mean(),
                    color = "red")
        plt.xlabel(f"{labels[c]}", size = 16)
        plt.ylabel("Frequency", size = 16)
        plt.xticks(ticks = ticks[c], size = 14)
        plt.yticks(size = 14)
    plt.tight_layout()
    plt.show();

In [None]:
# Plotting box plots

def plot_boxplots(columns, titles, labels, ticks):
    
    # Count sets the location within the grid
    count = 0
    fig   = plt.figure(figsize = (14,12))
    
    # Looping through each column for a graph
    # Enumerating allows me to index the other lists
    for c, column in enumerate(columns):
        
        # Changing the location for the next graph
        count += 1
        ax    = fig.add_subplot(4, 2, count)
        
        # Plotting and setting parameters for the graph
        plt.title(f"{titles[c]}", size = 18)
        sns.boxplot(pulsar[column])
        plt.xlabel(f"{labels[c]}", size = 16)
        plt.xticks(ticks = ticks[c], size = 14)
        plt.yticks(size = 14)
    plt.tight_layout()
    plt.show();

### Histograms

#### Histograms Of The Integrated Profile

In [None]:
plot_histograms(columns = ["mean_ip", "sd_ip", "ex_kurt_ip", "skew_ip"],
                titles  = ["IP: Mean", "IP: Standard Dev.", 
                           "IP: Excess Kurtosis", "IP: Skew"],
                labels  = ["Mean", "Standard Deviation", "Kurtosis", "Skew"],
                ticks   = [np.arange(0,200,25), np.arange(20,100,15),
                           np.arange(-2,9,1), np.arange(-5,70,10)])

The distributions of the `mean` and `standard deviation` are almost normally distributed, which will be something we manipulate in feature engineering: they will be more normal if we square them.  `mean` also has a very long left tail which makes the normal distribution less perfect; `skew` is also left-tailed.

The distribution of `excess kurtosis` looks normal, but most of the values are between -1 and 1.  If we square the values they will almost entirely fall between 0 and 1 which will drastically change the distribution.

`skew_ip` also has very low values and squaring them drastically changes the distribution

#### Histograms Of The DM-SNR Curve

In [None]:
plot_histograms(columns = ["mean_dmsnr", "sd_dmsnr", "ex_kurt_dmsnr", "skew_dmsnr"],
                titles  = ["DMNSR: Mean", "DMNSR: Standard Dev.", 
                           "DMNSR: Excess Kurtosis", "DMNSR: Skew"],
                labels  = ["Mean", "Standard Deviation", "Kurtosis", "Skew"],
                ticks   = [np.arange(0,250,25), np.arange(0,110, 10),
                           np.arange(-5,40,5), np.arange(-5,1200, 155)])

Some of the distributions look like they are log-normal, but when we tried taking the (natural) log of the distributions we did not get a normal distribution.


`mean` looks like it could be log-normal, but its long right tail prevents it from having a nice normal distribution, which is the goal of taking the log of the values.


`standard deviation` has a very long right tail which will be obvious in its box plot.


`excess kurtosis` almost looks normally distributed, but the left side of the chart is irregular.

[Top](#Table-Of-Contents)

### Box Plots

#### Box Plots Of The Integrated Profile

In [None]:
plot_boxplots(columns = ["mean_ip", "sd_ip", "ex_kurt_ip", "skew_ip"],
              titles  = ["IP: Mean", "IP: Standard Dev.", 
                         "IP: Excess Kurtosis", "IP: Skew"],
              labels  = ["Mean", "Standard Deviation", "Kurtosis", "Skew"],
              ticks   = [np.arange(0,200,25), np.arange(20,100,15),
                           np.arange(-2,9,1), np.arange(-5,70,10)])

The most notable part of these graphs is the sheer number of outliers, especially with the `excess kurtosis` and `skew`.


It is also clear that we will have to scale the data before modeling because the scales are different for all four graphs.

In [None]:
plot_boxplots(columns = ["mean_dmsnr", "sd_dmsnr", "ex_kurt_dmsnr", "skew_dmsnr"],
              titles  = ["DMNSR: Mean", "DMNSR: Standard Dev.", 
                         "DMNSR: Excess Kurtosis", "DMNSR: Skew"],
              labels  = ["Mean", "Standard Deviation", "Kurtosis", "Skew"],
              ticks   = [np.arange(0,250,25), np.arange(0,110, 10),
                         np.arange(-5,40,5), np.arange(-5,1200, 155)])

Similarly to the graphs above, there are an extreme amount of outliers, especially for the `mean` and `standard deviation`.


It is also evident here that we will have to scale the data.

### Bar Plots

The only column that can be plotted with a bar plot is the target column.  I will plot it again later to determine my baseline accuracy when modeling.

In [None]:
pulsar.columns

In [None]:
tick_labels = ["Non-Pulsar", "Pulsar"]

plt.figure(figsize = (10,5))
sns.countplot(pulsar["target_class"])
plt.title("Pulsar Stars", size = 18)
plt.xlabel("Category", size = 16)
plt.ylabel("Number Of Stars", size = 16)
plt.xticks(np.arange(0,2,1), 
           labels = tick_labels, 
           size = 14)
plt.yticks(size = 14);

The data is extremely unbalanced, which we will have to deal with when modeling and wil inform our choice of models.

### Heat Map

In [None]:
columns = pulsar.columns[:8]

plt.figure(figsize   = (16,8),
           facecolor = "white")
plt.title("Correlations Amongst Features", size = 18)
corr = pulsar[columns].corr()
mask = np.zeros_like(corr)                                                                                
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    sns.heatmap(corr, cmap = "RdBu", mask = mask,
                vmin = -1, vmax = 1, annot = True)
plt.xticks(size = 14)
plt.yticks(size = 14);

One of the most notable features of this heat map is that there are four distinct blocks of correlation:

- `mean_ip` and `sd_ip` with `ex_kurt_ip` and `skew_ip`


- `ex_kurt_ip` and `skew_ip` with `mean_dmsnr` and `sd_dmsnr`


- `mean_dmsnr` and `sd_dmsnr` with `ex_kurt_dmsnr` and `skew_dmsnr`


- `ex_kurt_ip` and `skew_ip` with `ex_kurt_dmsnr` and `skew_dmsnr`


We will use some of these correlations in feature engineering to create interaction columns.

## Reading In The Data

[Top](#Table-Of-Contents)