# Part 2: Modeling & Evaluation

## Imports

In [None]:
import pandas                as pd
import numpy                 as np
import matplotlib.pyplot     as plt
import seaborn               as sns
from sklearn.ensemble        import RandomForestClassifier
from sklearn.linear_model    import LogisticRegression
from sklearn.svm             import SVC
from xgboost                 import XGBClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.metrics         import roc_auc_score
from sklearn.metrics         import accuracy_score
from sklearn.metrics         import precision_score
from sklearn.metrics         import recall_score
from sklearn.metrics         import f1_score
from sklearn.metrics         import balanced_accuracy_score
from sklearn.pipeline        import Pipeline
from IPython.core.display    import display, HTML
from IPython.display         import display_html
sns.set(style = "white", palette = "deep")
display(HTML("<style>.container { width:95% !important; }</style>"))
%matplotlib inline

## Table Of Contents

1. [Reading In The Data](#Reading-In-The-Data)
    - [Overview](#Overview)
    
    
2. [Feature Engineering](#Feature-Engineering)
    - [Data Manipulation](#Data-Manipulation)
    - [Interaction Columns](#Interaction-Columns)

## Reading In The Data

In [None]:
pulsar = pd.read_csv("../Data/pulsar_cleaned.csv")

### Overview

In [None]:
# Checking the head of the data

pulsar.head()

In [None]:
# Checking the shape of the data

print(f"The shape of the dataset is: {pulsar.shape}")

In [None]:
# Summary of column data types

pulsar.dtypes.value_counts()

In [None]:
# Checking for null values

pulsar.isnull().sum()

## Feature Engineering

### Data Manipulation

While visualizing the data, we noticed that there are two columns that appeared to have a close to normal distribution.  As a result, we decided to square the values to transform them.

In [None]:
# Squaring the `mean_ip` column

pulsar["mean_ip_squared"] = pulsar["mean_ip"].apply(lambda x: x**2)

# Squaring the `sd_ip` column

pulsar["sd_ip_squared"]   = pulsar["sd_ip"].apply(lambda x: x**2)

### Interaction Columns

Based off of the heat map in the previous notebook, we noticed that there are some columns with very high correlations.  We felt that creating interaction columns, we would be emphasizing the correlation while also reducing the number of features.

We defined high correlations as magnitude of >0.5.

The columns in particular are:

| Column 1   | Column 2   | Correlation |
|:-----------|:-----------|:-----------:|
| mean_ip    | sd_ip      | 0.55        |
| ex_kurt_ip | skew_ip    | 0.95        |
| mean_dmsnr | sd_dmsnr   | 0.80        |
| ex_kurt_ip | skew_dmsnr | 0.92        |
| mean_ip    | ex_kurt_ip | -0.87       |
| mean_ip    | skew_ip    | -0.74       |
| sd_ip      | ex_kurt_ip | -0.52       |
| sd_ip      | skew_ip    | -0.54       |