# NASA asteroid classification

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

In this assignment you will a train Decision Tree Classifier to predict if some Near to Earth objects (NEO) constitute an hazard for planet Earth. The dataset is a pre-processed version of the original dataset that comes from a NASA [website](https://data.nasa.gov/Space-Science/Asteroids-NeoWs-API/73uw-d9i8), and that you can also find following this [link](https://www.kaggle.com/shrutimehta/nasa-asteroids-classification).

For the preprocessing, some columns have been deleted either because they were redundant, because they were of no use for solving the classification problem, or because they were too correlated with the output variable.

In [1]:
import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report


np.random.seed(8)

### Dataset and variables

**1. Load the dataset called `nasa_asteroids.csv` into a DataFrame**

Store the answer in a variable called `df`and call the `.head()` method to view the first 5 rows.

In [2]:
# Add your code below
df = pd.read_csv('data/nasa_asteroids.csv')
df.head()

Unnamed: 0,Est Dia in KM(min),Est Dia in KM(max),Relative Velocity km per sec,Relative Velocity km per hr,Miles per hour,Orbit Uncertainity,Jupiter Tisserand Invariant,Epoch Osculation,Eccentricity,Semi Major Axis,Inclination,Asc Node Longitude,Orbital Period,Perihelion Distance,Perihelion Arg,Aphelion Dist,Mean Anomaly,Mean Motion,Hazardous
0,0.12722,0.284472,6.115834,22017.003799,13680.509944,5,4.634,2458000.5,0.425549,1.407011,6.025981,314.373913,609.599786,0.808259,57.25747,2.005764,264.837533,0.590551,True
1,0.146068,0.326618,18.113985,65210.346095,40519.173105,3,5.457,2458000.5,0.351674,1.107776,28.412996,136.717242,425.869294,0.7182,313.091975,1.497352,173.741112,0.84533,False
2,0.231502,0.517654,7.590711,27326.560182,16979.661798,0,4.557,2458000.5,0.348248,1.458824,4.237961,259.475979,643.580228,0.950791,248.415038,1.966857,292.893654,0.559371,True
3,0.008801,0.019681,11.173874,40225.948191,24994.839864,6,5.093,2458000.5,0.216578,1.255903,7.905894,57.173266,514.08214,0.983902,18.707701,1.527904,68.741007,0.700277,False
4,0.12722,0.284472,9.840831,35426.991794,22012.954985,1,5.154,2458000.5,0.210448,1.225615,16.793382,84.629307,495.597821,0.967687,158.263596,1.483543,135.142133,0.726395,True


Looking at the first few rows we can see that all the columns are numerical except for the output variable we want to predict, the `Hazardous` column, whose values are boolean True / False. We will convert it to a numerical format later in the notebook.

Here is a table with the description of the features. Our goal is to use all the available information regarding the asteroids to predict whether or not each is hazardous.

| Feature name                 | Description                                                                                                             |
|------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| Est Dia in KM(min)           | estimate of the minimum diameter of the asteroid in kilometres (KM)                                                     |
| Est Dia in KM(max)           | estimate of the maximum diameter of the asteroid in kilometres (KM)                                                     |                                                           |
| Relative Velocity km per sec | relative velocity of the asteroid in kilometres per second                                                               |
| Relative Velocity km per hr  | relative velocity of the asteroid in kilometres per hour.                                                                |
| Miles per hour               | relative velocity of the asteroid in Miles per hour                                                                     |
| Orbit Uncertainity           | parameter that indicates the uncertainity on the identification of the orbit                                            |
| Jupiter Tisserand Invariant  | Tisserand’s parameter for the asteroid. This parameter is used to distinguish different kinds of orbits                 |
| Epoch Osculation             | the instant of time at which the position and velocity vectors are specified                                            |
| Eccentricity                 | eccentricity of the asteroid’s orbit, i.e. how far from circular each orbit is                                          |
| Semi Major Axis              | Semi Major Axis of the asteroid’s orbit                                                                                 |
| Inclination                  | inclination of the asteroid's orbit                                                                                     |
| Asc Node Longitude           | angular position at which the asteroid passes from the southern side of the orbital plane of Earth to the northern side |
| Orbital Period               | time the asteroid takes to complete one orbit                                                                           |
| Perihelion Distance          | Perihelion distance of the asteroid. For a body orbiting the Sun, the point of least distance is the perihelion         |
| Perihelion Arg               | angle (starting from the center of the orbit) between the asteroid's periapsis and its ascending node                   |
| Aphelion Dist                | the point in the orbit of the asteroid which it is farthest from the sun.                                               |
| Mean Anomaly                 | fraction of an elliptical orbit's period that has elapsed since the asteroid passed periapsis                           |
| Mean Motion                  | angular speed required for a body to complete one orbit                                                                 |
| Hazardous                    | whether the asteroid is hazardous or not                                                                                |

**2. Create a variable where you store the numpy array with the values of all the columns of `df` except for `Hazardous`**

Call this variable `X`

*Hint: to transform a DataFrame into a numpy array you can use the `.to_numpy()` method.*


In [47]:
# Add your code below
X = df.drop(columns=['Hazardous'],axis=1).to_numpy()
X.shape

(4687, 18)

**3. Create a variable `y` where you store the array with the values of `Hazardous`, where the values `True` are converted to `1` and the values `False` are converted to `0`.**

*Hint: Converting a boolean array `arr` with values True/False to an array with values 1/0 is as easy as multiplying `arr` by 1!*

Your answer should be a numpy array

In [48]:
# Add your code below
y = (df['Hazardous']*1).to_numpy()


**4. What is the fraction of the element of `y` with label 0? Store this number in the variable `frac0`.**

In [54]:
# Add your code below
frac0 = 1 - np.mean(y)
frac0

0.8389161510561126

**5. What is the fraction of the element of `y` with label 1? Store this number in the variable `frac1`.**

In [55]:
# Add your code below
frac1 = np.mean(y)
frac1

0.16108384894388736

The dataset is unbalanced. Our tree will have to give more weight on our minority class.

**6. Split the data into training and test sets using sklearn `train_test_split` and specifying `random_state`=8. Also, remember to set the argument `stratify` = y, and set the `test_size` equal to 0.25**

Save the result in the variables `x_train`, `x_test`, `y_train`, `y_test`.

In [56]:
# Add your code below
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=8,
    stratify=y
)


### Decision Tree Classifier

**7. Create a decision tree classifier, with `max_depth=5`, `min_samples_split=3`, `min_samples_leaf=5`, `class_weight="balanced"`, and `random_state=101`**

The `random_state` is for reproducibility.

Store the tree in a variable called `dec_tree`.

In [57]:
# Add your code below
# dec_tree = ...
from sklearn.tree import DecisionTreeClassifier

dec_tree = DecisionTreeClassifier(max_depth=5, min_samples_split=3, min_samples_leaf=5, class_weight= 'balanced', random_state=101)

**8. Fit the model on `x_train` and `y_train` and compute the accuracy on the training set. Use the `accuracy_score` method from sklearn.metrics.**

Store the accuracy in a variable called `acc_train`.


In [58]:
# Add your code below
dec_tree.fit(x_train, y_train)

from sklearn.metrics import accuracy_score

acc_train = accuracy_score(y_train, dec_tree.predict(x_train))
acc_train

0.7914651493598862

**9. What is the accuracy on the test set?**

Store your answer in a variable called `acc_test`.


In [59]:
# Add your code below
acc_test = accuracy_score(y_test, dec_tree.predict(x_test))
acc_test


0.7815699658703071

**10. Create a variable `class_report` where you store the classification report on the test set obtained with the `classification_report` method from sklearn.metrics.**

In [60]:
# Add your code below
from sklearn.metrics import classification_report

class_report = classification_report(y_test, dec_tree.predict(x_test))
print(class_report)

              precision    recall  f1-score   support

           0       0.99      0.74      0.85       983
           1       0.42      0.97      0.59       189

    accuracy                           0.78      1172
   macro avg       0.71      0.86      0.72      1172
weighted avg       0.90      0.78      0.81      1172



Uncomment and run the cell below to have a nicer view of the report.

In [61]:
print(class_report)

              precision    recall  f1-score   support

           0       0.99      0.74      0.85       983
           1       0.42      0.97      0.59       189

    accuracy                           0.78      1172
   macro avg       0.71      0.86      0.72      1172
weighted avg       0.90      0.78      0.81      1172



### Let's now do a Grid Search to find better hyperparameters

#### 11. Define a dictionary with the values of the parameters we want to try in the grid search. 
Namely, for `max_depth` we will try the values 3, 5, 12; for `min_samples_split` 3, 8, 16; for `min_samples_leaf` 1, 5, 10. 

Store the dictionary in a variable called `grid_param`.


In [69]:
# Add your code below
from sklearn.model_selection import GridSearchCV
grid_param = {'max_depth': [3, 5, 12], 'min_samples_split': [3,8,16], 'min_samples_leaf': [1,5,10]}

**12. Define a GridSearch object, where you set `cv=5`. Make sure you use the `dec_tree` variable created earlier.**

Store it in a variable called `grid_cv`.

In [70]:
# Add your code below
grid_cv = GridSearchCV(dec_tree, grid_param, cv=5)
grid_cv.fit(x_train, y_train)

**13. Fit grid_cv on the training set and store the best parameters found with the Grid Search in a variable called `best_parameters`.**

Use the method `best_params_`

In [71]:
# Add your code below
best_parameters = grid_cv.best_params_
best_parameters


{'max_depth': 12, 'min_samples_leaf': 1, 'min_samples_split': 3}

**14. Create a copy of `dec_tree` and store it in a variable called `dec_tree_best`; set its parameters to `best_parameters`. Now train the model on the training set. What is the accuracy on the training set? 
Store your answer in a variable called `acc_train_best`.**

*Hint: to create a copy of your tree, use the `sklearn.base.clone` method. Here is an example of its use: new_tree = sklearn.base.clone(old_tree)*

In [76]:
# Add your code below
from sklearn.base import clone

dec_tree_best = clone(dec_tree).set_params(**best_parameters)
dec_tree_best.fit(x_train, y_train)
acc_train_best = accuracy_score(y_train, dec_tree_best.predict(x_train))

**15. What is the accuracy on the test set?**

Store your answer in a variable called `acc_test_best`.

In [77]:
# Add your code below
acc_test_best = accuracy_score(y_test, dec_tree_best.predict(x_test))
print(acc_test_best)

0.8959044368600683


**16. Create a variable `class_report_best` where you store the classification report on the test set.**

In [78]:
# Add your code below
class_report_best = classification_report(y_test, dec_tree_best.predict(x_test))


Uncomment and run the cell below to have a nicer view of the report.

In [79]:
print(classification_report(y_test, dec_tree_best.predict(x_test)))

              precision    recall  f1-score   support

           0       0.98      0.90      0.94       983
           1       0.62      0.90      0.74       189

    accuracy                           0.90      1172
   macro avg       0.80      0.90      0.84      1172
weighted avg       0.92      0.90      0.90      1172

