<a href="https://colab.research.google.com/github/cm-int/machine-learning-fundamentals/blob/main/module_2/Labs/Lab_2_1_NearEarthObjects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2.1: Creating a Classification Machine Learning Model

In this lab, you’ll build and test several binary classification models to classify Near Earth Objects (NEOs) as Hazardous or Not Hazardous, depending on whether they are likely to collide with the Earth.

You’ll build a Decision Tree model, a Random Forest Model, and a Gradient Boosted Tree model. You’ll compare the performance of all three models, and examine the effects of adjusting the probability thresholds of predictions on the recall of a model. 




**About the data:**

(From https://www.kaggle.com/code/elnahas/nasa-nearest-earth-objects/data)

There is an infinite number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. This dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object.

The solution code for this lab is available <a href="https://colab.research.google.com/github/cm-int/machine-learning-fundamentals/blob/main/module_2/Labs/Lab_2_1_NearEarthObjects_solution.ipynb" target="_parent">here</a>

#Read the Data

In [None]:
# Upload the neo_v2.csv file from Github

!wget 'https://raw.githubusercontent.com/cm-int/machine-learning-fundamentals/main/module_2/Labs/neo_v2.csv'

In [None]:
# Read the data from the neo_v2.csv file into a Pandas DataFrame named neos and display the data

import numpy as np
import pandas as pd

# Code goes here

In [None]:
# We want to predict whether an object is hazardous. Create a new DataFrame named hazardous containing dummy variables for the 'hazardous' column and display the results


In [None]:
# Rename the two columns to 'Yes' and 'No'. The default names are 'True' and 'False' which are also the names of Python constants, and can cause problems later.


In [None]:
# Remove the 'No' column from the DataFrame and flatten the result into a NumPy array


In [None]:
# Display the first few values from the hazardous array


In [None]:
# Is 'sentry_object' useful in the neo dataframe?  How many different values does it have? (Answer: Just one - False repeated 90836 times)


In [None]:
# Similarly, is 'orbiting_body' useful?


In [None]:
# Remove the 'hazardous' column from the neos DataFrame, along with columns that probably won't help with classification (orbiting_body, sentry_object)



# Visualize the Data

In [None]:
# Create a TSNE model and transform the data in the neos array into a 2D representation of the data
# NOTE: This step uses a random sample of 2000 rows from the neos array, to save time

# This step is complete

from sklearn.manifold import TSNE

sample_data = neos.sample(2000)
sample_hazardous = hazardous[sample_data.index]

visual_model = TSNE(learning_rate = 100, init='pca')
visual_transformation = visual_model.fit_transform(sample_data)


In [None]:
# Extract first and second columns from the array containing the transformed data and use them to create a Pandas DataFrame

# This step is complete

x_data = visual_transformation[:, 0]
y_data = visual_transformation[:, 1]
transformed_data = pd.DataFrame({'x':x_data, 'y':y_data})

In [None]:
# Display the results as a Matplotlib graph
# The clustering doesn't look promising!

# This step is complete

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))
plt.scatter(transformed_data.loc[sample_hazardous==0]['x'], transformed_data.loc[sample_hazardous==0]['y'], c='red')
plt.scatter(transformed_data.loc[sample_hazardous==1]['x'], transformed_data.loc[sample_hazardous==1]['y'], c='blue')
plt.legend(loc ='lower left', labels = ['Not Hazardous', 'Hazardous'])
plt.show()

In [None]:
# Try creating a 3D TSNE model
# NOTE: This step takes between 4 and 5 minutes to run



In [None]:
# Extract first, second and third columns from the array containing the transformed data and use them to create a Pandas DataFrame


In [None]:
# Display the results as a 3D subplot in a Matplotlib graph
# This time the clustering looks better, so there's a chance we can come up with a decent model


#Classification using a Decision Tree Model

In [None]:
# Split the data into training and test datasets named features_train, features_test, predictions_train, predictions_test


In [None]:
# Create and fit a Decision Tree model. Use the default values for the hyperparameters.



In [None]:
# Display and review the decision tree using graphviz and export_graphviz
# NOTE: This step takes a minute or two to run. Scroll the results pane to see the tree when it is complete.


In [None]:
# Use the model to make predictions using the features_test dataset. Save the results in an array named test_results


In [None]:
# Compare the first 30 test predictions to the actual test results



In [None]:
# Examine the accuracy, precision and recall of the model using K-Fold cross-validation. Set K to 5


In [None]:
# Display the confusion matrix for the model. How many potentially Hazardous NEOs have been misclassified as Not Hazardous (false negatives)? 



In [None]:
# Generate the ROC curve and calculate the AUC. Is this model better than random guesswork?


In [None]:
# Generate and view the probabilities generated by the decision tree model for each prediction. They are either 1 or 0

# This step is complete

test_probs = classifier_model.predict_proba(features_test)
test_probs[0:30]

# Classification using a Random Forest

In [None]:
# Create and fit the model using the same training and test datasets as before


In [None]:
# Test predictions and display the results


In [None]:
# Cross-validate the model and compare the accuracy, precision, and recall against the Decision Tree model


In [None]:
# Compare Confusion Matrix and ROC curves to the decision tree model


# Note: The number of FP and FNs have both dropped slightly. The model is still more likely to label a Hazardous NEO as Not Hazardous.
# Is this a good thing? The default threshold of 0.5 is probably too high for this dataset, so investigate further ...

In [None]:
# Find the probabilities for each prediction in the test sample


In [None]:
# Keep the probabilities for the positive outcome (Hazardous) and discard those for the negative outcome (Not Hazardous)


In [None]:
# Generate and display the ROC curve for the model

# Calculate the J statistic to find the optimal threshold for this model. Store this value in a variable named optimal_threshold.


In [None]:
# Display the value of the optimal threshold for this model


In [None]:
# Examine how using the optimal threshold impacts the quality of the predictions made using the model 

# This step is complete

# Find the index for every prediction with a probability probability >= optimal_threshold
indexes = np.where(test_probs_hazardous >= optimal_threshold)

# Create a new array cor holding the adjusted predictions after applying the new threshold. Initialize it with zeros, and make it the same length as the original set of test results
adjusted_test_results = np.zeros(test_results.size)

# Set the value in the adjusted predictions array to 1 for each item indicated by the indexes array
adjusted_test_results[indexes] = 1

# Display the results
print(adjusted_test_results)

In [None]:
# Generate a confusion matrix comparing the predictions test data and the adjusted test results

# The number of false negatives should have dropped considerably, although there is now an increased level of false positives. Are false positives better than false negatives?


In [None]:
# Calculate the accuracy, precision, and recall for the model using this new threshold
# Recall should be much improved from before. Why is this more important than precision for this model?

# This step is complete

from sklearn.metrics import accuracy_score, precision_score, recall_score

accuracy = accuracy_score(predictions_test, adjusted_test_results)
precision = precision_score(predictions_test, adjusted_test_results)
recall = recall_score(predictions_test, adjusted_test_results)

print(f'Accuracy: {accuracy}\n')
print(f'Precision: {precision}\n')
print(f'Recall: {recall}\n')

#Classification using Gradient Boosted Decision Tree

In [None]:
# Create and fit a Gradient Boosted decision tree model. Set n_estimators to 100, learning_rate to 0.1, and max_depth to 3


In [None]:
# Test predictions and display the results


In [None]:
# Perform cross-validation, and calculate the accuracy, precision and recall of the model


In [None]:
# Generate the Confusion Matrix and ROC curves

# This model has even more FNs than the original Random Forest, but far fewer FPs

In [None]:
# Plot the ROC curve and calculate the optimal threshold for this model, as per the Random Forest model


In [None]:
# Display the value of the optimal threshold for this model

In [None]:
# Follow the same process as before to see how amending the threshold affects the quality of the model

# Find the index for every prediction with a probability probability >= optimal_threshold


# Create a new array cor holding the adjusted predictions after applying the new threshold. Initialize it with zeros, and make it the same length as the original set of test results

# Set the value in the adjusted predictions array to 1 for each item indicated by the indexes array

# Display the results


In [None]:
# Calculate the accuracy, precision, and recall for the model using this new threshold
# Again, recall should be much improved from before.

