# Machine Learning of NASA Asteroid data

## Summary
- summarize findings from EDA
- what cols make for best predictors
- what target
- what preprocessing

### Load our data

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# data preprocessing and tuning
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, explained_variance_score

# Suite of Machine Learning Algorithms
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier

import helper

# to get the newest version of helper
import importlib
importlib.reload(helper)

# Setup to Ignore Version Errors and Deprecations
import warnings
warnings.filterwarnings("ignore")

In [2]:
analysis_df = pd.read_csv("data/analysis_data.csv")
analysis_df.head()

Unnamed: 0,Absolute Magnitude,Close Approach Date,Relative Velocity km per hr,Miss Dist.(kilometers),Orbit Determination Date,Orbit Uncertainity,Minimum Orbit Intersection,Epoch Osculation,Eccentricity,Perihelion Distance,Aphelion Dist,Perihelion Time,Mean Anomaly,Hazardous,Est Dia in M(avg)
0,21.6,0,22017.003799,62753692.0,534,5,0.025282,2458000.5,0.425549,0.808259,2.005764,2458162.0,264.837533,1,205.846088
1,21.3,0,65210.346095,57298148.0,432,3,0.186935,2458000.5,0.351674,0.7182,1.497352,2457795.0,173.741112,0,236.342931
2,20.3,1,27326.560182,7622911.5,1910,0,0.043058,2458000.5,0.348248,0.950791,1.966857,2458120.0,292.893654,1,374.578302
3,27.4,2,40225.948191,42683616.0,1761,6,0.005512,2458000.5,0.216578,0.983902,1.527904,2457902.0,68.741007,0,14.24107
4,21.6,2,35426.991794,61010824.0,1190,1,0.034798,2458000.5,0.210448,0.967687,1.483543,2457814.0,135.142133,1,205.846088


## Base classifiers

### "Shotgun Approach": Classification Models
Given the nature of the target data, let's first try seeing the accuracy of classification models.

In [3]:
TARGET = ["Hazardous"]

X, y = analysis_df.drop(columns=TARGET, axis=1), analysis_df[TARGET]

In [4]:
# Train test split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                        train_size=0.7,
                                        test_size=0.3,
                                        random_state=42)

In [5]:
# creating models that we want to test to see which is most optimized
# we will use this to test different X_train variations
cat_models = {
    "KNN": {
        "Estimator": KNeighborsClassifier(),
        },
    "SVM": {
        "Estimator": SVC(),
        },
    "CART": {
        "Estimator": DecisionTreeClassifier(),
        },
    "NB": {
        "Estimator": GaussianNB(),
        },
    "LOGREG": {
        "Estimator": LogisticRegression(),
        }
}

In [10]:
# test performance of different models using X_train
helper.test_models_performance(cat_models, X_train, y_train, isRegressor=False)


[MODEL TYPE: KNN]

>>>> Top Performance: 		0.8293
>>>> Average Performance: 	0.8107
>>>> Spread of Performance: 	0.0104

[MODEL TYPE: SVM]

>>>> Top Performance: 		0.8354
>>>> Average Performance: 	0.8348
>>>> Spread of Performance: 	0.0012

[MODEL TYPE: CART]

>>>> Top Performance: 		1.0000
>>>> Average Performance: 	0.9948
>>>> Spread of Performance: 	0.0045

[MODEL TYPE: NB]

>>>> Top Performance: 		0.8476
>>>> Average Performance: 	0.8220
>>>> Spread of Performance: 	0.0157

[MODEL TYPE: LOGREG]

>>>> Top Performance: 		0.8415
>>>> Average Performance: 	0.8338
>>>> Spread of Performance: 	0.0037


The decision tree is probably overfit, as the average performance is 99%, and the top performing tree has a 100% fit. Naive Bayes or Logistic Regression may be the best performing models. Let's use Logistic Regression and try to improve it.

In [11]:
# let's create an instance of that and tune it to have even better accuracy
logreg_model = LogisticRegression()

y_pred = helper.fit_predict(logreg_model, X_train, y_train, X_test, y_test, isRegressor=False)

> ACCURACY: 	84.93%


### Accuracy to beat: 84.93%

## Exhaustive Machine Learning

### Tuning: Standard Scaler

In [13]:
# use standard scaler
# check if that will gain better results
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [14]:
y_pred = helper.fit_predict(logreg_model, X_train_scaled, y_train, X_test_scaled, y_test, isRegressor=False)

> ACCURACY: 	95.02%


Scaling the data increased the accuracy to 95%!
#### Accuracy to beat: 95.02%

### Tuning: MinMax Scaler

In [15]:
# use minmax scaler
# check if that will gain better results
scaler = MinMaxScaler()
X_train_mm_scaled = scaler.fit_transform(X_train)
X_test_mm_scaled = scaler.transform(X_test)

In [16]:
y_pred = helper.fit_predict(logreg_model, X_train_mm_scaled, y_train, X_test_mm_scaled, y_test, isRegressor=False)

> ACCURACY: 	92.54%


Using a MinMax scaler made the accuracy worse. Let's stick to our standard scaled data.

#### Accuracy to beat: 95.02%

In [None]:
# Epoch Osculation vs Perihelion Time

### Tuning: GridSearchCV

In [19]:
# let's investigate the expressed signal from each of our features
importances, features = logreg_model.coef_[0], list(X)

feature_importances = [(features[iteration], importances[iteration]) for iteration in range(len(features))]
feature_importances.sort(reverse=True, key=lambda X: X[1])

feature_importances

[('Eccentricity', 0.8945822234224964),
 ('Aphelion Dist', 0.6969495661330356),
 ('Relative Velocity km per hr', 0.5887025912082454),
 ('Mean Anomaly', 0.11972332411952545),
 ('Perihelion Distance', 0.04467061736083125),
 ('Miss Dist.(kilometers)', 0.0270469711715801),
 ('Orbit Determination Date', -0.07232177828083347),
 ('Close Approach Date', -0.29709808147759004),
 ('Epoch Osculation', -0.3167106608492958),
 ('Perihelion Time', -0.5957942862935678),
 ('Est Dia in M(avg)', -1.2607464341376515),
 ('Orbit Uncertainity', -2.5202651722567797),
 ('Absolute Magnitude', -8.452310552720926),
 ('Minimum Orbit Intersection', -13.698405188997702)]

In [20]:
# choose hyperparamters to test in the GridSearchCV
hyperparameters = {
		 'penalty': ['l1', 'l2', 'elasticnet', 'none'],
		 'C': [0.001, 0.01, 0.1, 1, 10, 100],
		 'solver': ["liblinear", "lbfgs", "newton-cg", "sag", "saga"],
		 'max_iter': [100, 200, 500],
		 'l1_ratio': [0.0, 0.25, 0.5, 0.75, 1.0]
}

In [21]:
tuned_model = LogisticRegression(random_state=42)
model_tuner = GridSearchCV(tuned_model, hyperparameters, cv=10)

In [None]:
# commenting this out because it took 16 minutes to run
# model_tuner.fit(X_train_scaled, y_train)

## Exhaustive ML
- Pick one or two best models and improve them
- Data preprocessing
- Hyperparameter tuning
- Data transformation
- Additional pipelining
- Use eval metrics and predictive data vis (ROC-AUC, heatmap confusion matrices)

## Summary
- How to further improve your model

### Resources