<h1 style="text-align: center">Maratona Behind The Code</h1>
<h2 style="text-align: center">Final Challenge -  Machine Learning applied to Planet Exploration</h2>

<hr>

## Short Description

Astronomy has always fascinated mankind. Until we can trace ancient civilizations had looked to the sky and found some patterns in the dynamics of the night sky and whether a celestial body can emit light or just reflect it. Using this simple approach, the Greeks, the Egyptians and the Babylonians had mapped the planets until saturn.

More than a thousand years had passed to the discovery of Uranus, and this was only possible grace of technological advance. And during the last three decades humanity has discovery more planets than ever and now it is time to put A.I to help astronomers to classify the celestials bodies that light and the gravitational disturb is meraly measerable. Are you up for the challenge ?

<hr>

## Installing Libs

In [None]:
!pip install --upgrade ibm-cos-sdk==2.7.0

In [None]:
!pip install --upgrade --force-reinstall ibm_watson_machine_learning

In [None]:
!pip install scikit-learn --upgrade

In [None]:
!pip install xgboost --upgrade

In [None]:
!pip install imblearn --upgrade

<hr>

<!-- ## Aquisição do conjunto de dados -->
## Acquiring dataset

It is necessary to insert the dataset as a dataframe on jupyter notebook.

In [None]:
<<ISERT_YOUR_PANDAS_DATAFREAME_HERE>>

In [None]:
df_training_dataset = df_data_1
df_training_dataset.fillna(0., inplace=True)
df_training_dataset.tail()

We have some astronomical data on this dataset and it is important you know some of them:

- **TARGET**: The disposition in the literature towards this exoplanet candidate. One of CANDIDATE, FALSE POSITIVE or CONFIRMED.
- **koi_pdisposition**: The disposition Kepler data analysis has towards this exoplanet candidate. One of FALSE POSITIVE and CANDIDATE.
- **koi_score**: A value between 0 and 1 that indicates the confidence in the KOI disposition. For CANDIDATEs, a higher value indicates more confidence in its disposition, while for FALSE POSITIVEs, a higher value indicates less confidence in that disposition.

In [None]:
df_training_dataset.info()

In [None]:
df_training_dataset.nunique()

<hr>

## Challenge Details: Multiclass Classification

The proposal of the challenge is to classify data to enable machine to point if an amout of measures available on the dataset can be a planet, a candidate of planet that require more studies, or none which means it is not a planet. For this, we can use two approaches: supervised machine learning (classification) or unsupervised (clustering). In this challenge the classification will be applied, since a dataset is already available with "labels", or in other words, already with examples of data together with the target variable.

In the scikit-learn library we have several algorithms for classification. The participant is free to use the framework he wishes to complete this challenge. The role notebook is prepared for sckit-learn deployment though.

<hr>

## Data exploration

Use the cells below to explore the data, check which variables most influence the `TARGET` variable and the distribution of values.

## Pre-processing the dataset before training

### Construction of the complete Pipeline for WML encapsulation

#### Preparing custom transformations for loading on WML

To integrate these types of custom transformations into Watson Machine Learning Pipelines, you must first package your custom code as a Python library. This can be done easily using the *setuptools* tool.

On the following git repository: https://github.com/vnderlev/sklearn_transforms we have all the necessaries files to create a Python package, named **my_custom_sklearn_transforms**.
This package has the following file structure:

    /my_custom_sklearn_transforms.egg-info
        dependency_links.txt
        not-zip-safe
        PKG-INFO
        SOURCES.txt
        top_level.txt
    /my_custom_sklearn_transforms
        __init__.py
        sklearn_transformers.py
    PKG-INFO
    README.md
    setup.cfg
    setup.py
    
The main file, which will contain the code for our custom transforms, is the file **/my_custom_sklearn_transforms/sklearn_transformers.py**. If you access it in the repository, you will notice that it contains a class called `DropColumns()`, which has the necessary methods to remove columns from any dataset.

    - DropColumns() custom transformation code:
    
    from sklearn.base import BaseEstimator, TransformerMixin
    # All sklearn Transforms must have the `transform` and `fit` methods
    class DropColumns(BaseEstimator, TransformerMixin):
        def __init__(self, columns):
            self.columns = columns
        def fit(self, X, y=None):
            return self
        def transform(self, X):
            # Primeiro realizamos a cópia do dataframe 'X' de entrada
            data = X.copy()
            # Retornamos um novo dataframe sem as colunas indesejadas
            return data.drop(labels=self.columns, axis='columns')

If you have declared your own transformations (in addition to the provided DropColumn), you must add all the classes of those transformations created by you in this same file. To do this, you must fork this repository, and add your custom classes in the file **sklearn_transformers.py**.

If you only made use of the provided transformation (DropColumns), you can skip this fork step, and continue using the supplied base package! :)

After preparing your Python package with your custom transforms, replace the git repository link in the cell below and run it. If you have not prepared any new transforms, execute the cell with the repository link already provided.

<hr>
    
**PAY ATTENTION**

If the execution of the cell below returns an error that the repository already exists, run the foolowing command:

**!rm -r -f sklearn_transforms**

In [None]:
import numpy as np

In [None]:
# replace the link below with the link from your git repository (if applicable)
!git clone https://github.com/vnderlev/sklearn_transforms.git

In [None]:
!cd sklearn_transforms
!ls -ltr

In [None]:
!zip -r sklearn_transforms.zip sklearn_transforms

In [None]:
!pip install sklearn_transforms.zip

In [None]:
from my_custom_sklearn_transforms.sklearn_transformers import DropColumns

In [None]:
# Creating a custom `` DropColumns`` Transform

rm_columns = DropColumns(
    columns=['rowid']
)

In [None]:
# Creating a `` SimpleImputer`` object
from sklearn.impute import SimpleImputer

si = SimpleImputer(
    missing_values=np.nan,  # the missing values are type `` np.nan`` (standard Pandas)
    strategy='constant',  # the chosen strategy is to change the missing value by a constant
    fill_value=0,  # the constant that will be used to fill in the missing values is an int64 = 0.
    verbose=0,
    copy=True
)

## Training a classifier

### Selecting FEATURES and setting the TARGET variable

In [None]:
df_training_dataset.columns

In [None]:
features = df_training_dataset[
    [
        'rowid', 'kepid', 'koi_pdisposition', 'koi_score', 'koi_fpflag_nt',
        'koi_fpflag_ss', 'koi_fpflag_co', 'koi_fpflag_ec', 'koi_period',
        'koi_period_err1', 'koi_period_err2', 'koi_time0bk', 'koi_time0bk_err1',
        'koi_time0bk_err2', 'koi_impact', 'koi_impact_err1', 'koi_impact_err2',
        'koi_duration', 'koi_duration_err1', 'koi_duration_err2', 'koi_depth',
        'koi_depth_err1', 'koi_depth_err2', 'koi_prad', 'koi_prad_err1',
        'koi_prad_err2', 'koi_teq', 'koi_insol', 'koi_insol_err1',
        'koi_insol_err2', 'koi_model_snr', 'koi_tce_plnt_num', 'koi_steff',
        'koi_steff_err1', 'koi_steff_err2', 'koi_slogg', 'koi_slogg_err1',
        'koi_slogg_err2', 'koi_srad', 'koi_srad_err1', 'koi_srad_err2', 'ra',
        'dec', 'koi_kepmag'
    ]
]
target = df_training_dataset['TARGET']  ## DO NOT CHANGE THE NAME OF THE TARGET VARIABLE.

In [None]:
# Preparing the arguments for the methods of the `` scikit-learn`` library
X = features
y = target

### Splitting the dataset into train and test partition

In [None]:
from sklearn.model_selection import train_test_split

# Separation of data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=337)

### Building a pipeline

In [None]:
# Creating our pipeline for storage at Watson Machine Learning:
from sklearn.you import YourModel
from sklearn.pipeline import Pipeline

my_pipeline = Pipeline(
    steps=[
        ('step_1_remove_columns', rm_columns),
        ('step_2_imputer', si),
        ('choosen_model', YourModel()),
    ]
)

In [None]:
# Pipeline initialization (pre-processing and model training)
model = my_pipeline.fit(X_train, y_train)

### Making predictions in the test sample

In [None]:
y_pred = my_pipeline.predict(X_test)
print(y_pred)

Tip: use the `metrics` library in scikit-learn to get more information about your model's metrics.[ref](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)

### Analyzing the quality of the model through the confusion matrix

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import itertools


def plot_confusion_matrix(cm, target_names, title='Confusion matrix', cmap=None, normalize=True):
    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy
    if cmap is None:
        cmap = plt.get_cmap('Blues')
    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    plt.show()

In [None]:
from sklearn.metrics import confusion_matrix


plot_confusion_matrix(confusion_matrix(y_test, y_pred), ['0', '1', '2'])

<hr>

## Deploy to WML

**WARNING**: the model you deploy to Watson Machine Learning must receive as input for a prediction **ALL of the columns** provided in the dataset, **except the TARGET column**. Any operations with the columns, such as dropping, must be done via pipeline. If the model does not behave as expected, your submission will fail.

With the model running, now we can deploy it to Watson Machine Learning, a service available on the IBM Cloud capable of executing and making machine learning models available through an API in a dedicated environment.

In [None]:
from ibm_watson_machine_learning import APIClient

To access Watson Machine Learning, you need to create an APIKEY. There are two ways to do this: via the IBM Cloud cli or via the IBM Cloud interface.

If you want to create an APIKEY via API first download install the [IBM Cloud CLI](https://cloud.ibm.com/docs/cli). Once installed, run the following commands to obtain the APIKEY

ibmcloud login <br>
ibmcloud iam api-key-create API_KEY_NAME

Through the interface, just click on `Manage` and then on `Access(IAM)` as shown in the image below.

![api-1](https://imgur.com/bS61qef.png "api1")

As soon as the page loads, on the left side there is a menu. Click on API keys to create a new one, as shown in the image below.

![api-2](https://imgur.com/XaOalxq.png "api2")

The image below shows a panel with all its APIs created for the IBM Cloud platform, let's create a new one accessing the WML service by clicking on `Create an IBM Cloud API key`

![api-3](https://imgur.com/0WKTanm.png "api3")

A form will open where you simply name your API and click on `Create`. As soon as you click on the button your API will be created and just copy it, insert in `apikey` in the cell below.

![api-4](https://imgur.com/3wCTLaH.png "api4")

In addition to needing an APIKEY to access Watson Machine Learning, we need to know the URL where it is located, so be aware when creating the service in which region you are instantiating it. Each region has a specific URL and they are listed below.

- Dallas: `https://us-south.ml.cloud.ibm.com`
- London: `https://eu-gb.ml.cloud.ibm.com`
- Frankfurt: `https://eu-de.ml.cloud.ibm.com`
- Tokyo: `https://jp-tok.ml.cloud.ibm.com`

With the WML properly located, just enter the correct URL in the cell below.

In [None]:
wml_credentials = {
  "apikey": "YOUR_WML_APIKEY",
  "url": "URL_REGION_OF_YOUR_WML"
}

print(wml_credentials)

In [None]:
client = APIClient(wml_credentials)

### Preparing the environment that will receive the model

Watson Machine Learning organizes the deployment of models in spaces, so that it is possible to use the organization's WML instance and divide it into small spaces dedicated to hosting the models that each department will build and make available. Therefore, after instantiating the WML, it is necessary to create a space to receive the model that we are going to create. To create a space in WML we need to go back to the Cloud Pak 4 Data home screen and click on `Deployments`, located on the left side, as shown in the image below.

![img-01](https://imgur.com/Fhx5iKO.png "deployment")

As soon as the page loads we are inside the deployment interface that constitutes a direct access to Watson Machine Learning. Now let's click on the `deployment space` button to create a new space, as shown in the image below.

![img-02](https://imgur.com/DRFuLj6.png "space")

Let's create an empty space to receive our model, as shown in the image below.

![img-03](https://imgur.com/uxUf77y.png "creat")

We must fill in some information now. We need to give the space a name, associate an Object Storage and Watson Machine Learning to the space. With the form completed just click on the `Create` button located in the lower right corner.

<!-- ![img-04](https://i.imgur.com/trikImj.png "form") -->


With the space created, we can deploy the created model and proceed with the execution of the cells of this notebbok.

In [None]:
client.spaces.list(limit=10)

With the spaces listed above, you should find your newly created space and copy the space id in the cell below to be stored in the `space_id` variable.

In [None]:
space_id = 'YOUR_SPACE_ID'

In [None]:
client.set.default_space(space_id)

As seen during the creation of the Pipeline, we used a library external to the model creation framework (scikit-learn, tensorflow, keras, etc.) We need to upload our library so that the pipeline can use the methods contained there. For this, the cell below uploads the library so that the model can run correctly.

In [None]:
meta_prop_pkg_extn = {
    client.package_extensions.ConfigurationMetaNames.NAME: "my_custom_sklearn_transforms",
    client.package_extensions.ConfigurationMetaNames.DESCRIPTION: "Pkg extension for custom lib",
    client.package_extensions.ConfigurationMetaNames.TYPE: "pip_zip"
}

pkg_extn_details = client.package_extensions.store(meta_props=meta_prop_pkg_extn, file_path="sklearn_transforms.zip")
pkg_extn_uid = client.package_extensions.get_uid(pkg_extn_details)
pkg_extn_url = client.package_extensions.get_href(pkg_extn_details)

In [None]:
details = client.package_extensions.get_details(pkg_extn_uid)

In [None]:
client.software_specifications.ConfigurationMetaNames.show()

In [None]:
client.software_specifications.list()

In [None]:
sofware_spec_uid = client.software_specifications.get_id_by_name("default_py3.7")

With the upload of the library, it must be made available by creating a specific software environment where it is available for use. The cell below creates an environment that does just that.

In [None]:
meta_prop_sw_spec = {
    client.software_specifications.ConfigurationMetaNames.NAME: "my_custom_sklearn_transforms",
    client.software_specifications.ConfigurationMetaNames.DESCRIPTION: "Software specification for linalgnorm-0.1",
    client.software_specifications.ConfigurationMetaNames.BASE_SOFTWARE_SPECIFICATION: {"guid": sofware_spec_uid}
}

sw_spec_details = client.software_specifications.store(meta_props=meta_prop_sw_spec)
sw_spec_uid = client.software_specifications.get_uid(sw_spec_details)


client.software_specifications.add_package_extension(sw_spec_uid, pkg_extn_uid)

With the environment created, just upload the model created to Watson Machine Learning.

In [None]:
metadata = {
            client.repository.ModelMetaNames.NAME: 'Final',
            client.repository.ModelMetaNames.TYPE: 'scikit-learn_0.23',
            client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: sw_spec_uid
}

published_model = client.repository.store_model(
    model=model,
    meta_props=metadata)

In [None]:
client.repository.list_models()

In [None]:
import json
saved_model_uid = client.repository.get_model_uid(published_model)
model_details = client.repository.get_details(saved_model_uid)
print(json.dumps(model_details, indent=2))

With a model stored in WML, it is now necessary to make the model available so that it is available to be accessed via an API call. To make a model available, run the cell below.

In [None]:
metadata = {
    client.deployments.ConfigurationMetaNames.NAME: "champion",
    client.deployments.ConfigurationMetaNames.ONLINE: {}
}

created_deployment = client.deployments.create(client.repository.get_model_uid(published_model), meta_props=metadata)

## Making a prediction

In [None]:
deployment_uid = client.deployments.get_uid(created_deployment)

In [None]:
scoring_endpoint = client.deployments.get_scoring_href(created_deployment)
print(scoring_endpoint)

In [None]:
print(np.array(X.iloc[0].values).tolist())
print(y.iloc[0])

In [None]:
scoring_payload = {
    "input_data": [{
        'fields': X.columns.to_list(),
        'values': [[1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 9.48803557, 0.02775, -0.02775, 170.53875, 0.00216, -0.00216, 146.0, 318.0, -146.0, 0.0, 0.0819, -0.0819, 615.8, 19.5, 0.0, 2.26, 0.0, -0.15, 793.0, 93.59, 29.45, -16.65, 35.8, 1.0, 5455.0, 81.0, -81.0, 4467.0, 64.0, -96.0, 927.0, 105.0, -61.0, 291.93423, 48.141651, 15.347]]}]
}
scoring_payload

In [None]:
predictions = client.deployments.score(deployment_uid, scoring_payload)

In [None]:
print(json.dumps(predictions, indent=2))

# Important things you will use next

In the cell below you will find the necessaries credentials you must insert in the submission app you have deployed on Red Hat OpenShift and provide them at the submission.

In [None]:
print('WML APIKEY: ', wml_credentials['apikey'])
print('URL to make predictions: ', scoring_endpoint)

# References

- [Cloud Pak 4 data docs](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/wml-ai.html)
- [ibm-watson-machine-learning sdk docs](http://ibm-wml-api-pyclient.mybluemix.net)
- [Watson Machine Learning REST API docs](https://cloud.ibm.com/apidocs/machine-learning)
- [Watson Machine Learning tutorials](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-samples-overview.html)