## Machine Learning- Exoplanet Exploration

#### Extensive Data Dictionary: https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html

Highlightable columns of note are:

* kepoi_name: A KOI is a target identified by the Kepler Project that displays at least one transit-like sequence within Kepler time-series photometry that appears to be of astrophysical origin and initially consistent with a planetary transit hypothesis

* kepler_name: [These names] are intended to clearly indicate a class of objects that have been confirmed or validated as planets—a step up from the planet candidate designation.

* koi_disposition: The disposition in the literature towards this exoplanet candidate. One of CANDIDATE, FALSE POSITIVE, NOT DISPOSITIONED or CONFIRMED.

* koi_pdisposition: The disposition Kepler data analysis has towards this exoplanet candidate. One of FALSE POSITIVE, NOT DISPOSITIONED, and CANDIDATE.

* koi_score: A value between 0 and 1 that indicates the confidence in the KOI disposition. For CANDIDATEs, a higher value indicates more confidence in its disposition, while for FALSE POSITIVEs, a higher value indicates less confidence in that disposition.


In [1]:
# # Update sklearn to prevent version mismatches
# !pip install sklearn --upgrade
# # install joblib
# !pip install joblib

### Import Dependencies 

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Hide warning messages in notebook
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Set the seed value for the notebook so the results are reproducible
from numpy.random import seed
seed(1)
import tensorflow

# Read the CSV and Perform Basic Data Cleaning

In [4]:
# Read/Load CSV file
df = pd.read_csv("exoplanet_data.csv")

# Drop the null columns where all values are null
df = df.dropna(axis='columns', how='all')

# Drop the null rows
df = df.dropna()
df.head()

Unnamed: 0,koi_disposition,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
0,CONFIRMED,0,0,0,0,54.418383,0.0002479,-0.0002479,162.51384,0.00352,...,-81,4.467,0.064,-0.096,0.927,0.105,-0.061,291.93423,48.141651,15.347
1,FALSE POSITIVE,0,1,0,0,19.89914,1.49e-05,-1.49e-05,175.850252,0.000581,...,-176,4.544,0.044,-0.176,0.868,0.233,-0.078,297.00482,48.134129,15.436
2,FALSE POSITIVE,0,1,0,0,1.736952,2.63e-07,-2.63e-07,170.307565,0.000115,...,-174,4.564,0.053,-0.168,0.791,0.201,-0.067,285.53461,48.28521,15.597
3,CONFIRMED,0,0,0,0,2.525592,3.76e-06,-3.76e-06,171.59555,0.00113,...,-211,4.438,0.07,-0.21,1.046,0.334,-0.133,288.75488,48.2262,15.509
4,CONFIRMED,0,0,0,0,4.134435,1.05e-05,-1.05e-05,172.97937,0.0019,...,-232,4.486,0.054,-0.229,0.972,0.315,-0.105,296.28613,48.22467,15.714


## Basic Statistic Details

In [5]:
df.describe()

Unnamed: 0,koi_fpflag_nt,koi_fpflag_ss,koi_fpflag_co,koi_fpflag_ec,koi_period,koi_period_err1,koi_period_err2,koi_time0bk,koi_time0bk_err1,koi_time0bk_err2,...,koi_steff_err2,koi_slogg,koi_slogg_err1,koi_slogg_err2,koi_srad,koi_srad_err1,koi_srad_err2,ra,dec,koi_kepmag
count,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,...,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0,6991.0
mean,0.157059,0.244743,0.202975,0.125018,56.191248,0.001851122,-0.001851122,164.48882,0.00934,-0.00934,...,-161.20698,4.305049,0.121091,-0.14048,1.740749,0.35271,-0.388568,292.082406,43.812143,14.271508
std,0.363882,0.429966,0.402243,0.330763,117.570962,0.007184503,0.007184503,67.020475,0.021989,0.021989,...,71.448481,0.439238,0.132048,0.08199,5.903415,0.839017,1.907797,4.762908,3.606167,1.350802
min,0.0,0.0,0.0,0.0,0.25982,1.1e-08,-0.1568,120.515914,9e-06,-0.569,...,-1733.0,0.047,0.0,-1.007,0.109,0.0,-103.825,279.85608,36.577381,6.966
25%,0.0,0.0,0.0,0.0,2.620126,5.005e-06,-0.0002401,132.683917,0.001145,-0.01,...,-197.0,4.209,0.044,-0.195,0.829,0.128,-0.252,288.70473,40.79776,13.455
50%,0.0,0.0,0.0,0.0,8.947426,3.3e-05,-3.3e-05,136.73923,0.00399,-0.00399,...,-159.0,4.436,0.07,-0.127,0.999,0.248,-0.111,292.31476,43.679661,14.534
75%,0.0,0.0,0.0,0.0,34.282605,0.0002401,-5.005e-06,169.937005,0.01,-0.001145,...,-112.0,4.543,0.149,-0.088,1.357,0.357,-0.069,295.88855,46.693659,15.322
max,1.0,1.0,1.0,1.0,1071.232624,0.1568,-1.1e-08,1472.522306,0.569,-9e-06,...,0.0,5.364,1.472,0.0,180.013,25.956,0.0,301.72076,52.33601,19.065


# Select Features (columns)
* Feature Selection: Removing irrelevant feature results in better performing model that is easeir to understands & model runs faster


In [6]:
target_names = df["koi_disposition"].unique()
#target_names
print(df["koi_disposition"].unique())

['CONFIRMED' 'FALSE POSITIVE' 'CANDIDATE']


In [7]:
# Assign X (Independant data) and y (Dependant target)

# Set X equal to the entire data set, except for the first column
X = df.iloc[:, 1:]
# X.head()

# Set y equal to the first column
y = df.iloc[:,0].values.reshape(-1, 1)
# y.head()

In [8]:
from sklearn.ensemble import ExtraTreesClassifier

# Search for top 10 features according to feature importances
model = ExtraTreesClassifier()
model.fit(X,y)
model.feature_importances_

# sorted(zip(model.feature_importances_, X), reverse=True)

array([0.13073522, 0.16928087, 0.13071984, 0.05817844, 0.01773339,
       0.01305674, 0.02116328, 0.01249918, 0.02393155, 0.02871403,
       0.01103683, 0.0140373 , 0.00907399, 0.01992297, 0.02198486,
       0.02374409, 0.02089962, 0.01018814, 0.01162114, 0.0112719 ,
       0.00999627, 0.01020811, 0.01191154, 0.00855717, 0.00839419,
       0.00718397, 0.032982  , 0.009957  , 0.00896558, 0.02233331,
       0.01167389, 0.00973852, 0.00971656, 0.01824186, 0.00886778,
       0.01016445, 0.00851821, 0.01119689, 0.0108167 , 0.01078259])

In [9]:
# Store the top (10) features  as a series, using the column headers as the index
top_feat = pd.Series(model.feature_importances_, index=X.columns).nlargest(10)
top_feat

koi_fpflag_ss        0.169281
koi_fpflag_nt        0.130735
koi_fpflag_co        0.130720
koi_fpflag_ec        0.058178
koi_model_snr        0.032982
koi_time0bk_err2     0.028714
koi_time0bk_err1     0.023932
koi_duration_err2    0.023744
koi_steff_err1       0.022333
koi_duration_err1    0.021985
dtype: float64

In [10]:
# Set features based on feature importances
X = df[top_feat.index]

# Use `koi_disposition` for the y values
y = df['koi_disposition']

#y = df['koi_disposition'].values.reshape(-1, 1)

# Create  -- Train -- Test -- Split

In [11]:
from sklearn.model_selection import train_test_split

# Split the data into smaller buckets for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.head()

Unnamed: 0,koi_fpflag_ss,koi_fpflag_nt,koi_fpflag_co,koi_fpflag_ec,koi_model_snr,koi_time0bk_err2,koi_time0bk_err1,koi_duration_err2,koi_steff_err1,koi_duration_err1
6122,0,0,0,0,10.8,-0.00844,0.00844,-0.306,154,0.306
6370,1,0,0,1,13.8,-0.00795,0.00795,-0.282,158,0.282
2879,0,1,0,0,254.3,-0.00619,0.00619,0.0,151,0.0
107,0,0,0,0,38.4,-0.00182,0.00182,-0.0595,76,0.0595
29,0,0,0,0,696.5,-8.3e-05,8.3e-05,-0.0075,77,0.0075


In [12]:
# X and y Train shape have 5243 rows (80% of data)
X_train.shape, y_train.shape

((5243, 10), (5243,))

In [13]:
# X and y Test shape have 1748 rows (20% of data)
X_test.shape, y_test.shape

((1748, 10), (1748,))

# Pre-processing

Scale the data using the MinMaxScaler: 

MinMaxScaler: 
              * A way to normalize the input features/variables
              * Features will be transformed into the range
              * Scales the range of fetures from 0 to 1


In [14]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler model and fit it to the training data
X_scaler = MinMaxScaler().fit(X_train)

#Transform the training and testing data using the X_scaler
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
#print(np.matrix(X_test_scaled))

# Train the Model 
* Used TensorFlow- Deep Learning



In [15]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# ==================== Step 1: Label-encode data set =============================
label_encoder = LabelEncoder()

label_encoder.fit(y_train)

y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# ==================== Step 2: Convert encoded labels to ONE-HOT-encoding ====================
y_train_categorical = to_categorical(y_train_encoded)
y_test_categorical = to_categorical(y_test_encoded)


print(X_train_scaled.shape)

(5243, 10)


# Deep Learning Model

In [27]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Create a modle and  Neural Network Layers here
model = Sequential()

# 1st Layer (Input)
model.add(Dense(units=100, activation='relu', input_dim=10))

# 2nd Hidden Layer
model.add(Dense(units=80, activation='relu'))
          
# 3rd Layer (Output)          
model.add(Dense(units=3, activation='softmax'))

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_10 (Dense)             (None, 100)               1100      
_________________________________________________________________
dense_11 (Dense)             (None, 80)                8080      
_________________________________________________________________
dense_12 (Dense)             (None, 3)                 243       
Total params: 9,423
Trainable params: 9,423
Non-trainable params: 0
_________________________________________________________________


In [28]:
# ==================== Compile the model ====================
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# ==================== Fit the model ====================
model.fit(
    X_train_scaled,
    y_train_categorical,
    epochs=100,
    shuffle=True,
    verbose=2
)

Epoch 1/100
164/164 - 1s - loss: 0.5479 - accuracy: 0.7181
Epoch 2/100
164/164 - 0s - loss: 0.3724 - accuracy: 0.8037
Epoch 3/100
164/164 - 0s - loss: 0.3512 - accuracy: 0.8322
Epoch 4/100
164/164 - 0s - loss: 0.3389 - accuracy: 0.8428
Epoch 5/100
164/164 - 0s - loss: 0.3305 - accuracy: 0.8436
Epoch 6/100
164/164 - 0s - loss: 0.3264 - accuracy: 0.8468
Epoch 7/100
164/164 - 0s - loss: 0.3205 - accuracy: 0.8516
Epoch 8/100
164/164 - 0s - loss: 0.3178 - accuracy: 0.8556
Epoch 9/100
164/164 - 0s - loss: 0.3177 - accuracy: 0.8568
Epoch 10/100
164/164 - 0s - loss: 0.3169 - accuracy: 0.8558
Epoch 11/100
164/164 - 0s - loss: 0.3156 - accuracy: 0.8543
Epoch 12/100
164/164 - 0s - loss: 0.3148 - accuracy: 0.8526
Epoch 13/100
164/164 - 0s - loss: 0.3123 - accuracy: 0.8591
Epoch 14/100
164/164 - 0s - loss: 0.3175 - accuracy: 0.8554
Epoch 15/100
164/164 - 0s - loss: 0.3131 - accuracy: 0.8554
Epoch 16/100
164/164 - 0s - loss: 0.3132 - accuracy: 0.8592
Epoch 17/100
164/164 - 0s - loss: 0.3145 - accura

<tensorflow.python.keras.callbacks.History at 0x1de85efe448>

## Quantify our Trained Model "Evaluate"

In [29]:
# ==================== Evaluate ====================
model_loss, model_accuracy = model.evaluate(X_test_scaled, y_test_categorical, verbose=2)

print(f"Normal Neural Network - Loss: {model_loss}, Accuracy: {model_accuracy}")


55/55 - 0s - loss: 0.3174 - accuracy: 0.8581
Normal Neural Network - Loss: 0.3174290657043457, Accuracy: 0.8581235408782959


## Prediction

In [30]:
# ==================== Prediction ====================

# Use the first 5 test data values to make a prediction and compare it to the actual labels
encoded_predictions = model.predict_classes(X_test_scaled[:10])
prediction_labels = label_encoder.inverse_transform(encoded_predictions)

# print(f"Predicted classes : {prediction_labels}")
# print(f"Actual Labels: {list(y_test[:10])}")

df_predict = pd.DataFrame({"Actual":y_test[:10], "Predicted":prediction_labels}) 
df_predict.head()

Unnamed: 0,Actual,Predicted
4982,FALSE POSITIVE,FALSE POSITIVE
4866,CANDIDATE,CANDIDATE
2934,FALSE POSITIVE,FALSE POSITIVE
5007,FALSE POSITIVE,FALSE POSITIVE
3869,FALSE POSITIVE,FALSE POSITIVE


In [33]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test[:10], prediction_labels)

array([[3, 1, 0],
       [0, 1, 0],
       [0, 0, 5]], dtype=int64)

## SAVE the Model

In [20]:
model.save("DeepLearning_model.h5")

## ?????????????

# Hyperparameter Tuning

Use `GridSearchCV` to tune the model's parameters

In [41]:
# Check Sequential parameters that can be used for Tuning
model = Sequential()

# 1st Layer (Input)
model.add(Dense(units=100, activation='relu', input_dim=10))

# 2nd Hidden Layer
model.add(Dense(units=80, activation='relu'))
          
# 3rd Layer (Output)          
model.add(Dense(units=3, activation='softmax'))

model.summary()

# ==================== Compile the model ====================
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_13 (Dense)             (None, 100)               1100      
_________________________________________________________________
dense_14 (Dense)             (None, 80)                8080      
_________________________________________________________________
dense_15 (Dense)             (None, 3)                 243       
Total params: 9,423
Trainable params: 9,423
Non-trainable params: 0
_________________________________________________________________


In [56]:
from sklearn.model_selection import GridSearchCV

scoring = ['precision_macro', 'recall_macro', 'balanced_accuracy']

param_grid = dict(epochs=[10,20,30])

grid = GridSearchCV(estimator=model, param_grid=param_grid, scoring=scoring, n_jobs=-1, cv=3, refit=False)

In [57]:
# Train the model with GridSearch

grid.fit(X_train_scaled, y_train)

TypeError: Cannot clone object '<tensorflow.python.keras.engine.sequential.Sequential object at 0x000001DE84A52E88>' (type <class 'tensorflow.python.keras.engine.sequential.Sequential'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.

In [55]:
# List the best parameters for this dataset
print(grid.best_params_)

# List the best score
print('Best Score: %.3f' % grid.best_score_)

AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

In [56]:
# Score the model
print('Model Score: %.3f' % grid.score(X_test_scaled, y_test))

NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

In [30]:
# Make predictions with the hypertuned model
predictions = grid.predict(X_test_scaled)
df_grid = pd.DataFrame({"Actual":y_test, "Predicted":predictions}) 
df_grid.head()

Unnamed: 0,Actual,Predicted
4982,FALSE POSITIVE,FALSE POSITIVE
4866,CANDIDATE,CANDIDATE
2934,FALSE POSITIVE,FALSE POSITIVE
5007,FALSE POSITIVE,FALSE POSITIVE
3869,FALSE POSITIVE,FALSE POSITIVE


In [51]:
# Calculate classification report
# print(np.array(y_test))

from sklearn.metrics import classification_report
print(classification_report(y_test, predictions,
                            target_names=target_names))

ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets

# Save the Model

In [32]:
import joblib
filename = 'LogisticRegression.sav'
joblib.dump(LogisticRegression, filename)

['LogisticRegression.sav']