# Import Required Libraries

In [76]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

In this methodology / modelling notebook, we will try to build several models than can predict the asteroid hazardousness. To perform modelling, we import several libraries according to the models that we would like to use as well as the metrics.

In [77]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

Before building the predictive models, we will have to split our full train data into train and validation data. It is best practice to split into train and validation and keep the test data that will be used when we have found the best model.

In [78]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score, recall_score

First we can start by reading our train and test datasets. These datasets have been cleaned in the preprocessing part.

In [79]:
TRAIN = pd.read_csv('../input/asteroid-train-and-test-dataset-clean/TRAIN_CLEANED.csv')
TEST = pd.read_csv('../input/asteroid-train-and-test-dataset-clean/TEST_CLEANED.csv')
TRAIN.head()

In [80]:
temp_df = TRAIN.copy()
X = temp_df.drop('potentially_hazardous', axis = 1) # features
y = temp_df['potentially_hazardous'] # target

# Data Splitting

In [81]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 42)

In [82]:
X_test = TEST.drop('potentially_hazardous', axis = 1)
y_test = TEST['potentially_hazardous']

# Build Baseline Models

There are several models that we can try as our baseline models. After getting the result, we can choose one of these models as our final model.

In [83]:
models = [LogisticRegression(), GaussianNB(), KNeighborsClassifier(), DecisionTreeClassifier(), LGBMClassifier()]

We define a function that can display the classification report and confusion matrix as one of the most crucial evaluation methods for classification problems. This function also returns the recall score of a model. The reason we choose the recall score as our main evaluation metric is because the dataset is extremely imbalanced.

In [84]:
def get_model_recall(model_class):
    model = model_class
    model.fit(X_train, y_train)
    predictions = model.predict(X_valid)
    print('\nModel: ' + str(model_class))
    print(f'Accuracy Score: {accuracy_score(predictions, y_valid)}')
    print('Classification Report:')
    print(classification_report(predictions, y_valid))
    print('Confusion Matrix:')
    print(sns.heatmap(confusion_matrix(predictions, y_valid), annot = True, cmap = 'YlGnBu'))
    plt.show()
    print('')
    return recall_score(predictions, y_valid)

# Get Recall of Models

In [85]:
evaluation_table = {}

In [86]:
get_model_recall(CatBoostClassifier())
evaluation_table['CatBoostClassifier()'] = get_model_recall(CatBoostClassifier())

In [87]:
get_model_recall(XGBClassifier())
evaluation_table['XGBoostClassifier()'] = get_model_recall(XGBClassifier())

In [88]:
get_model_recall(RandomForestClassifier())
evaluation_table['RandomForestClassifier()'] = get_model_recall(RandomForestClassifier())

In [89]:
for i in range(len(models)):
    evaluation_table[str(models[i])] = get_model_recall(models[i])

In [90]:
print(evaluation_table)

In [91]:
for e in evaluation_table:
    print(e)

# Create Data Frame to View Recall Score of All Models

We have had the recall score of all models. In order to get a better overview, we can create a data frame.

In [92]:
model_performance_table = pd.DataFrame(evaluation_table.values(), evaluation_table.keys())
model_performance_table.rename(columns = {0: 'Recall Score'}, inplace = True)
model_performance_table.sort_values('Recall Score', inplace = True, ascending = False)
model_performance_table

From the data frame above, we can see that the top 3 models are XGBoost, Random Forest, and Decision Tree. The model that we would like to choose as our final model is definitely XGBoost, since it has quite accurate implementation of gradient boosting and fast computation. Meanwhile model like Decision Tree tend to overfit. Although Random Forest is more generalized than Decision Tree, XGBoost is still a better model overall.

In [93]:
best_model = XGBClassifier()
best_model.fit(X_train, y_train)
predictions = best_model.predict(X_valid)

In [94]:
predictions

# Final Validation

In [95]:
def evaluate_model(model, predictions, actual):
    print('CLASSIFICATION REPORT:')
    print(classification_report(predictions, actual))
    print('=' * 100)
    print('\nCONFUSION MATRIX:')
    sns.heatmap(confusion_matrix(predictions, actual), annot = True, cmap = 'YlGnBu')
    plt.title('Confusion Matrix')
    plt.show()

In [96]:
evaluate_model(best_model, predictions, y_valid)

In [97]:
final_predictions = best_model.predict(X_test)
final_predictions

# Testing

In [98]:
evaluate_model(best_model, final_predictions, y_test)

We actually get quite a great result in the test data. With only 1 data that is incorrectly predicted.