# Preprocessing

This notebook covers the preprocessing for the cleaned data. The preprocessing will include PCA, and feature selection. It will have various pipelines that can be used to train different sets of processed data to compare the results of a machine learning model. 

Import Libraries

In [21]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb

import warnings

warnings.filterwarnings('ignore')

Before we begin preprocessing, we will run some simple models to determine a baseline. This baseline can then be used as a benchmark after completing other preprocessing steps.

### Importing Cleaned Data

In [15]:
kois = pd.read_csv('..\data\kois_cleaned.csv', index_col=0)
kois.head()

Unnamed: 0_level_0,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,koi_time0bk_err2,...,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag,koi_disposition_encoded
rowid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,9.488036,2.775e-05,-2.775e-05,170.53875,0.00216,-0.00216,...,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347,1
2,0,0,0,0,54.418383,0.0002479,-0.0002479,162.51384,0.00352,-0.00352,...,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347,1
3,0,1,0,0,19.89914,1.494e-05,-1.494e-05,175.850252,0.000581,-0.000581,...,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436,2
4,0,1,0,0,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,-0.000115,...,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597,2
5,0,0,0,0,2.525592,3.761e-06,-3.761e-06,171.59555,0.00113,-0.00113,...,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509,1


In [16]:
kois.isnull().sum()

koi_fpflag_nt              0
koi_fpflag_ss              0
koi_fpflag_co              0
koi_fpflag_ec              0
koi_period                 0
koi_period_err1            0
koi_period_err2            0
koi_time0bk                0
koi_time0bk_err1           0
koi_time0bk_err2           0
koi_impact                 0
koi_impact_err1            0
koi_impact_err2            0
koi_duration               0
koi_duration_err1          0
koi_duration_err2          0
koi_depth                  0
koi_depth_err1             0
koi_depth_err2             0
koi_prad                   0
koi_prad_err1              0
koi_prad_err2              0
koi_teq                    0
koi_insol                  0
koi_insol_err1             0
koi_insol_err2             0
koi_model_snr              0
koi_steff                  0
koi_steff_err1             0
koi_steff_err2             0
koi_slogg                  0
koi_slogg_err1             0
koi_slogg_err2             0
koi_srad                   0
koi_srad_err1 

### Splitting Data

In [17]:
#separate our target variable from the rest of the data
y = kois['koi_disposition_encoded']
X = kois.drop(['koi_disposition_encoded'], axis=1)

In [29]:
#print the percentage of values in each class
print(y.value_counts(normalize=True))
y.describe()

koi_disposition_encoded
2    0.505385
1    0.254469
0    0.240147
Name: proportion, dtype: float64


count    9007.000000
mean        1.265238
std         0.821739
min         0.000000
25%         1.000000
50%         2.000000
75%         2.000000
max         2.000000
Name: koi_disposition_encoded, dtype: float64

In [18]:
#split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #use a test size of 20%

### Instantiating Models

In [19]:
#Decision tree classifier
dtc = DecisionTreeClassifier()

#Random forest classifier
rfc = RandomForestClassifier()

#Logistic regression classifier
logreg = LogisticRegression()

#Support vector machine classifier
svc = SVC()

#K-nearest neighbors classifier
knn = KNeighborsClassifier()

#XGBoost classifier
xgbc = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

### Train the Models

In [22]:
#train Decision tree classifier
dtc.fit(X_train, y_train)

#train Random forest classifier
rfc.fit(X_train, y_train)

#train Logistic regression classifier
logreg.fit(X_train, y_train)

#train Support vector machine classifier
svc.fit(X_train, y_train)

#train K-nearest neighbors classifier
knn.fit(X_train, y_train)

#train XGBoost classifier
xgbc.fit(X_train, y_train)

### Check Model Performance

In [23]:
#Decision tree classifier score
dtc_score = dtc.score(X_test, y_test)

#Random forest classifier score
rfc_score = rfc.score(X_test, y_test)

#Logistic regression classifier score
logreg_score = logreg.score(X_test, y_test)

#Support vector machine classifier score
svc_score = svc.score(X_test, y_test)

#K-nearest neighbors classifier score
knn_score = knn.score(X_test, y_test)

#XGBoost classifier score
xgbc_score = xgbc.score(X_test, y_test)

In [30]:
#Print the scores

# Create a dictionary with the model names and their scores
model_scores = {
    "Decision Tree": dtc_score,
    "Random Forest": rfc_score,
    "Logistic Regression": logreg_score,
    "SVM": svc_score,
    "KNN": knn_score,
    "XGBoost": xgbc_score
}

# Convert the dictionary to a DataFrame
scores_df = pd.DataFrame(list(model_scores.items()), columns=['Model', 'Accuracy Score'])

# Sort the DataFrame by 'Accuracy Score' in descending order
scores_df = scores_df.sort_values(by='Accuracy Score', ascending=False)

# Display the DataFrame
print(scores_df)

                 Model  Accuracy Score
5              XGBoost        0.893452
1        Random Forest        0.891787
0        Decision Tree        0.849057
4                  KNN        0.623751
2  Logistic Regression        0.529967
3                  SVM        0.490566


Now that we have the above scores, we have a benchmark to work with. We also know that the median is 2 for the target variable. And if you chose the median for every prediction you would be right approximately 50% of the time. Looking at the scores from our models, we can now see which models had the greatest gains over simply choosing the median every time. 

We can also see which models performed worse than our simplest guess. 

Before we cast judgment on the performance of these models, we have to remember that some are more sensitive to feature scaling than others. For example, Logistic Regression, SVM, and KNN all benefit greatly from feature scaling, while Decision Trees, Random Forest, and XGBoost all are typically invariant to feature scaling. 

What this tells us is that in order to utilize some of these models properly we should perform some sort of feature scaling on the data.

### Feature Scaling