## Machine Learning- Exoplanet Exploration

#### Extensive Data Dictionary: https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html

Highlightable columns of note are:

* kepoi_name: A KOI is a target identified by the Kepler Project that displays at least one transit-like sequence within Kepler time-series photometry that appears to be of astrophysical origin and initially consistent with a planetary transit hypothesis

* kepler_name: [These names] are intended to clearly indicate a class of objects that have been confirmed or validated as planets—a step up from the planet candidate designation.

* koi_disposition: The disposition in the literature towards this exoplanet candidate. One of CANDIDATE, FALSE POSITIVE, NOT DISPOSITIONED or CONFIRMED.

* koi_pdisposition: The disposition Kepler data analysis has towards this exoplanet candidate. One of FALSE POSITIVE, NOT DISPOSITIONED, and CANDIDATE.

* koi_score: A value between 0 and 1 that indicates the confidence in the KOI disposition. For CANDIDATEs, a higher value indicates more confidence in its disposition, while for FALSE POSITIVEs, a higher value indicates less confidence in that disposition.


In [1]:
# # Update sklearn to prevent version mismatches
# !pip install sklearn --upgrade
# # install joblib
# !pip install joblib

### Import Dependencies 

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Hide warning messages in notebook
import warnings
warnings.filterwarnings('ignore')

# Read the CSV and Perform Basic Data Cleaning

In [3]:
# Read/Load CSV file
df = pd.read_csv("exoplanet_data.csv")

# Drop the null columns where all values are null
df = df.dropna(axis='columns', how='all')

# Drop the null rows
df = df.dropna()
df.head()

Unnamed: 0,koi_disposition,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,CONFIRMED,0,0,0,0,54.418383,0.0002479,-0.0002479,162.51384,0.00352,...,-81,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,FALSE POSITIVE,0,1,0,0,19.89914,1.49e-05,-1.49e-05,175.850252,0.000581,...,-176,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
2,FALSE POSITIVE,0,1,0,0,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,...,-174,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597
3,CONFIRMED,0,0,0,0,2.525592,3.76e-06,-3.76e-06,171.59555,0.00113,...,-211,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509
4,CONFIRMED,0,0,0,0,4.134435,1.05e-05,-1.05e-05,172.97937,0.0019,...,-232,4.486,0.054,-0.229,0.972,0.315,-0.105,296.28613,48.22467,15.714


## Basic Statistic Details

In [4]:
df.describe()

Unnamed: 0,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,koi_time0bk_err2,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
count,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,...,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0
mean,0.157059,0.244743,0.202975,0.125018,56.191248,0.001851122,-0.001851122,164.48882,0.00934,-0.00934,...,-161.20698,4.305049,0.121091,-0.14048,1.740749,0.35271,-0.388568,292.082406,43.812143,14.271508
std,0.363882,0.429966,0.402243,0.330763,117.570962,0.007184503,0.007184503,67.020475,0.021989,0.021989,...,71.448481,0.439238,0.132048,0.08199,5.903415,0.839017,1.907797,4.762908,3.606167,1.350802
min,0.0,0.0,0.0,0.0,0.25982,1.1e-08,-0.1568,120.515914,9e-06,-0.569,...,-1733.0,0.047,0.0,-1.007,0.109,0.0,-103.825,279.85608,36.577381,6.966
25%,0.0,0.0,0.0,0.0,2.620126,5.005e-06,-0.0002401,132.683917,0.001145,-0.01,...,-197.0,4.209,0.044,-0.195,0.829,0.128,-0.252,288.70473,40.79776,13.455
50%,0.0,0.0,0.0,0.0,8.947426,3.3e-05,-3.3e-05,136.73923,0.00399,-0.00399,...,-159.0,4.436,0.07,-0.127,0.999,0.248,-0.111,292.31476,43.679661,14.534
75%,0.0,0.0,0.0,0.0,34.282605,0.0002401,-5.005e-06,169.937005,0.01,-0.001145,...,-112.0,4.543,0.149,-0.088,1.357,0.357,-0.069,295.88855,46.693659,15.322
max,1.0,1.0,1.0,1.0,1071.232624,0.1568,-1.1e-08,1472.522306,0.569,-9e-06,...,0.0,5.364,1.472,0.0,180.013,25.956,0.0,301.72076,52.33601,19.065


# Select Features (columns)
* Feature Selection: Removing irrelevant feature results in better performing model that is easeir to understands & model runs faster


In [5]:
target_names = df["koi_disposition"].unique()
#target_names
print(df["koi_disposition"].unique())

['CONFIRMED' 'FALSE POSITIVE' 'CANDIDATE']


In [6]:
# Assign X (Independant data) and y (Dependant target)

# Set X equal to the entire data set, except for the first column
X = df.iloc[:, 1:]
# X.head()

# Set y equal to the first column
y = df.iloc[:,0].values.reshape(-1, 1)
# y.head()

In [7]:
from sklearn.ensemble import ExtraTreesClassifier

# Search for top 10 features according to feature importances
model = ExtraTreesClassifier()
model.fit(X,y)
model.feature_importances_

# sorted(zip(model.feature_importances_, X), reverse=True)

array([0.13643403, 0.1510962 , 0.1296954 , 0.06021623, 0.02050111,
       0.01276818, 0.01532974, 0.01273703, 0.02475581, 0.02906709,
       0.01270221, 0.01050403, 0.0095014 , 0.01739827, 0.02089361,
       0.02434746, 0.03023508, 0.01306763, 0.01500074, 0.01122412,
       0.0129667 , 0.00999862, 0.01422634, 0.00934513, 0.00817938,
       0.0083652 , 0.03176525, 0.00964307, 0.00911639, 0.02127287,
       0.01493137, 0.01024276, 0.01296766, 0.01278031, 0.00845728,
       0.00890067, 0.0085182 , 0.01150418, 0.010269  , 0.00907425])

In [8]:
# Store the top (20) features  as a series, using the column headers as the index
top_feat = pd.Series(model.feature_importances_, index=X.columns).nlargest(10)
top_feat

koi_fpflag_ss        0.151096
koi_fpflag_nt        0.136434
koi_fpflag_co        0.129695
koi_fpflag_ec        0.060216
koi_model_snr        0.031765
koi_depth            0.030235
koi_time0bk_err2     0.029067
koi_time0bk_err1     0.024756
koi_duration_err2    0.024347
koi_steff_err1       0.021273
dtype: float64

In [9]:
# Set features based on feature importances
X = df[top_feat.index]

# Use `koi_disposition` for the y values
y = df['koi_disposition']

# y = df['koi_disposition'].values.reshape(-1, 1)

# Create a Train Test Split

In [10]:
from sklearn.model_selection import train_test_split

# Split the data into smaller buckets for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
X_train.head()

Unnamed: 0,koi_fpflag_ss,koi_fpflag_nt,koi_fpflag_co,koi_fpflag_ec,koi_model_snr,koi_depth,koi_time0bk_err2,koi_time0bk_err1,koi_duration_err2,koi_steff_err1
3563,0,0,0,0,11.7,102.9,-0.00411,0.00411,-0.143,120
4099,0,0,0,0,18.0,593.3,-0.00446,0.00446,-0.153,144
5460,0,0,0,0,476.0,47337.0,-9.6e-05,9.6e-05,-0.0152,126
1091,0,0,0,0,34.7,584.8,-0.00529,0.00529,-0.165,101
5999,0,0,0,0,8.7,193.6,-0.0286,0.0286,-0.939,164


In [11]:
# X and y Train shape have 5243 rows (80% of data)
X_train.shape, y_train.shape

((5243, 10), (5243,))

In [12]:
# X and y Test shape have 1748 rows (20% of data)
X_test.shape, y_test.shape

((1748, 10), (1748,))

# Pre-processing

Scale the data using the MinMaxScaler

MinMaxScaler: 
              * A way to normalize the input features/variables
              * Features will be transformed into the range
              * Scales the range of fetures from 0 to 1


In [14]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler model and fit it to the training data
X_scaler = MinMaxScaler().fit(X_train)

#Transform the training and testing data using the X_scaler
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
#print(np.matrix(X_test_scaled))

# Train the Model 
* Used Forest Tree Model



In [35]:
from sklearn.ensemble import RandomForestClassifier

# Create a Logistic Regression Model
model = RandomForestClassifier(n_estimators=200)

# Train (Fit) the model to the data
model.fit(X_train_scaled, y_train)

# Validate the model using the test data
print(f"Training Data Score: {model.score(X_train_scaled, y_train)}")
print(f"Testing Data Score: {model.score(X_test_scaled, y_test)}")

# Printed the r2 score for the test data, testing is lower than training which is good we are not over feeding

Training Data Score: 1.0
Testing Data Score: 0.8964530892448512


## Prediction

In [36]:
predictions = model.predict(X_test_scaled)

# print(f"first 10 Predictions{predictions[:10].tolist()}")
# print(f"first 10 Actual{y_test[:10].tolist()}")

# Printing into a Dataframe (y_test can't be reshap on top)
df_pred = pd.DataFrame({"Actual":y_test, "Predicted":predictions}) 
df_pred.head()

Unnamed: 0,Actual,Predicted
1213,CONFIRMED,CANDIDATE
5220,FALSE POSITIVE,FALSE POSITIVE
6688,FALSE POSITIVE,FALSE POSITIVE
2031,CONFIRMED,CONFIRMED
5824,FALSE POSITIVE,FALSE POSITIVE


# Hyperparameter Tuning ???????

Use `GridSearchCV` to tune the model's parameters

In [None]:
from sklearn.svm import SVC

# Create the GridSearchCV model
model = SVC(kernel='linear')
model.fit(X_train_scaled, y_train)

## Accuracy

In [37]:
# Predicting the Test set results
y_predic = model.predict(X_test)

# Making the confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predic)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_predic)

accuracy

0.715675057208238

In [38]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [1, 5, 10, 20, 50],
              'gamma': [0.0001,0.001, 0.01, 0.0005, 0.005]}

grid = GridSearchCV(model, param_grid, verbose=3)


In [39]:
# Train the model with GridSearch

grid.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV] C=1, gamma=0.0001 ...............................................
[CV] ................... C=1, gamma=0.0001, score=0.833, total=   0.1s
[CV] C=1, gamma=0.0001 ...............................................
[CV] ................... C=1, gamma=0.0001, score=0.818, total=   0.1s
[CV] C=1, gamma=0.0001 ...............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV] ................... C=1, gamma=0.0001, score=0.817, total=   0.1s
[CV] C=1, gamma=0.001 ................................................
[CV] .................... C=1, gamma=0.001, score=0.833, total=   0.1s
[CV] C=1, gamma=0.001 ................................................
[CV] .................... C=1, gamma=0.001, score=0.818, total=   0.1s
[CV] C=1, gamma=0.001 ................................................
[CV] .................... C=1, gamma=0.001, score=0.817, total=   0.1s
[CV] C=1, gamma=0.01 .................................................
[CV] ..................... C=1, gamma=0.01, score=0.833, total=   0.1s
[CV] C=1, gamma=0.01 .................................................
[CV] ..................... C=1, gamma=0.01, score=0.818, total=   0.1s
[CV] C=1, gamma=0.01 .................................................
[CV] ..................... C=1, gamma=0.01, score=0.817, total=   0.1s
[CV] C=1, gamma=0.0005 ...............................................
[CV] .

[CV] .................. C=50, gamma=0.0001, score=0.862, total=   0.1s
[CV] C=50, gamma=0.0001 ..............................................
[CV] .................. C=50, gamma=0.0001, score=0.842, total=   0.1s
[CV] C=50, gamma=0.0001 ..............................................
[CV] .................. C=50, gamma=0.0001, score=0.843, total=   0.1s
[CV] C=50, gamma=0.001 ...............................................
[CV] ................... C=50, gamma=0.001, score=0.862, total=   0.1s
[CV] C=50, gamma=0.001 ...............................................
[CV] ................... C=50, gamma=0.001, score=0.842, total=   0.1s
[CV] C=50, gamma=0.001 ...............................................
[CV] ................... C=50, gamma=0.001, score=0.843, total=   0.1s
[CV] C=50, gamma=0.01 ................................................
[CV] .................... C=50, gamma=0.01, score=0.862, total=   0.1s
[CV] C=50, gamma=0.01 ................................................
[CV] .

[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    6.6s finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='linear',
                           max_iter=-1, probability=False, random_state=None,
                           shrinking=True, tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [1, 5, 10, 20, 50],
                         'gamma': [0.0001, 0.001, 0.01, 0.0005, 0.005]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=3)

In [40]:
# List the best parameters for this dataset
print(grid.best_params_)

# List the best score
print('Best Score: ', grid.best_score_)

{'C': 50, 'gamma': 0.0001}
Best Score:  0.8491321762349799


In [41]:
# Score the model
grid.score(X_test_scaled, y_test)

0.870137299771167

In [42]:
# Make predictions with the hypertuned model
predictions = grid.predict(X_test_scaled)
df_grid = pd.DataFrame({"Actual":y_test, "Predicted":predictions}) 
df_grid.head()

Unnamed: 0,Actual,Predicted
1213,CONFIRMED,CONFIRMED
5220,FALSE POSITIVE,FALSE POSITIVE
6688,FALSE POSITIVE,FALSE POSITIVE
2031,CONFIRMED,CONFIRMED
5824,FALSE POSITIVE,FALSE POSITIVE


In [43]:
# Calculate classification report
# print(np.array(y_test))

from sklearn.metrics import classification_report
print(classification_report(y_test, predictions,
                            target_names=target_names))

                precision    recall  f1-score   support

     CONFIRMED       0.85      0.55      0.67       404
FALSE POSITIVE       0.68      0.90      0.78       435
     CANDIDATE       0.99      1.00      0.99       909

      accuracy                           0.87      1748
     macro avg       0.84      0.82      0.81      1748
  weighted avg       0.88      0.87      0.87      1748



# Save the Model

In [47]:
import joblib
filename = 'RandomForestClassifier.sav'
joblib.dump(RandomForestClassifier, filename)

['RandomForestClassifier.sav']