<img src="https://cybersecurity-excellence-awards.com/wp-content/uploads/2017/06/366812.png">

<h1><center>Darwin Supervised Regression Model Building </center></h1>

Prior to getting started, there are a few things you want to do:
1. Set the dataset path.
2. Enter your username and password to ensure that you're able to log in successfully

Once you're up and running, here are a few things to be mindful of:
1. For every run, look up the job status (i.e. requested, failed, running, completed) and wait for job to complete before proceeding. 
2. If you're not satisfied with your model and think that Darwin can do better by exploring a larger search space, use the resume function.

## Import libraries

In [1]:
# Import necessary libraries
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Image
from time import sleep
import os
import numpy as np
from sklearn.metrics import r2_score

#from amb_sdk.sdk import DarwinSdk

In [2]:
import os
os.environ["PYTHONPATH"] = "../../"
print(os.environ['PYTHONPATH'])
print(os.getcwd())
import sys
sys.path.append('../../../')
from amb_sdk.sdk import DarwinSdk

../../
/home/skenkare/CS363/Darwin/forkedDarwin/darwin-sdk/examples/Trial


## Setup

**Login to Darwin**<br>
Enter your registered username and password below to login to Darwin.

In [6]:
# Login
ds = DarwinSdk()
ds.set_url('https://amb-demo-api.sparkcognition.com/v1/')
status, msg = ds.auth_login_user('skenkare@austin.rr.com', 'C9ZcJp32Ze')
if not status:
    print(msg)

**Data Path** <br>
In the cell below, set the path to your dataset, the default is Darwin's example datasets

In [7]:
path = '../../sets/'

## Data Upload and Clean

**Read dataset and view a file snippet**

After setting up the dataset path, the next step is to upload the dataset from your local device to the server. <br> In the cell below, you need to specify the dataset_name if you want to use your own data.

In [8]:
dataset_name = 'boston.csv'
df = pd.read_csv(os.path.join(path, dataset_name))
df.head()

Unnamed: 0,PID,ST_NUM,ST_NAME,ST_NAME_SUF,ZIPCODE,Assessed_Value,Lot_Area,Gross_Area,Living_Area,Owner_Occupied,...,Roof_Type,Exterior_Finish,Main_Bathroom_Style,Main_Kitchen_Style,Heating_type,Exterior_Condition,Overall_Condition,Interior_Condition,Interior_Finish,View
0,2001658000_,43,STRATFORD,ST,02132_,963300,20897,7396,3887,1,...,Hip,Vinyl,Modern,Modern,Forced Air,Average,Average,Average,Normal,Average
1,2001659000_,47,STRATFORD,ST,02132_,915600,9856,6730,3566,1,...,Hip,Wood Shake,Modern,Modern,Hot Water,Average,Average,Good,Normal,Average
2,2001660000_,53,STRATFORD,ST,02132_,911400,8415,6442,2843,1,...,Gable,Wood Shake,Semi-Modern,Modern,Hot Water,Good,Good,Good,Normal,Average
3,2001661000_,57,STRATFORD,ST,02132_,862500,8333,6020,3558,1,...,Hip,Wood Shake,Semi-Modern,Modern,Hot Water,Average,Average,Good,Normal,Average
4,2001662000_,61,STRATFORD,ST,02132_,789300,8232,5574,2978,1,...,Gable,Wood Shake,Modern,Modern,Hot Water,Average,Average,Good,Normal,Average


**Upload dataset to Darwin**

In [9]:
# Upload dataset
status, dataset = ds.upload_dataset(os.path.join(path, dataset_name))
if not status:
    print(dataset)

** clean dataset **

In [10]:
# clean dataset
target = "Assessed_Value"
status, job_id = ds.clean_data(dataset_name, target = target)

if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

{'status': 'Requested', 'starttime': '2019-04-13T19:58:22.427757', 'endtime': None, 'percent_complete': 0, 'job_type': 'CleanDataTiny', 'loss': None, 'generations': None, 'dataset_names': ['boston.csv'], 'artifact_names': ['02006f50c4af46eeac34d3b51a6fc956'], 'model_name': None, 'job_error': None}
{'status': 'Complete', 'starttime': '2019-04-13T19:58:22.427757', 'endtime': '2019-04-13T19:58:37.195478', 'percent_complete': 100, 'job_type': 'CleanDataTiny', 'loss': None, 'generations': None, 'dataset_names': ['boston.csv'], 'artifact_names': ['02006f50c4af46eeac34d3b51a6fc956'], 'model_name': None, 'job_error': ''}


## Create and Train Model 

We will now build a model that will learn the class labels in the target column.<br> In the default boston dataset, the target column is "Assessed_Value". <br> You will have to specify your own target name for your custom dataset. <br> You can also increase max_train_time for longer training.


In [16]:
model = target + "_model0"
status, job_id = ds.create_model(dataset_names = dataset_name, \
                                 model_name =  model, \
                                 max_train_time = '00:02')
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

403: FORBIDDEN - {"message": "Your token is out of date"}



## Extra Training (Optional)
Run the following cell for extra training, no need to specify parameters

In [None]:
# Train some more
status, job_id = ds.resume_training_model(dataset_names = dataset_name,
                                          model_name = model,
                                          max_train_time = '00:05')
                                          
if status:
    ds.wait_for_job(job_id['job_name'])
else:
    print(job_id)

## Analyze Model
Analyze model provides feature importance ranked by the model. <br> It indicates a general view of which features pose a bigger impact on the model

In [13]:
# Retrieve feature importance of built model
status, artifact = ds.analyze_model(model)
sleep(1)
if status:
    ds.wait_for_job(artifact['job_name'])
else:
    print(artifact)
status, feature_importance = ds.download_artifact(artifact['artifact_name'])

{'status': 'Running', 'starttime': '2019-04-13T20:02:36.091855', 'endtime': None, 'percent_complete': 0, 'job_type': 'AnalyzeModel', 'loss': 0.04829813167452812, 'generations': 6, 'dataset_names': None, 'artifact_names': ['c57d6794a5214286bb954f638ecec615'], 'model_name': 'Assessed_Value_model0', 'job_error': ''}
{'status': 'Complete', 'starttime': '2019-04-13T20:02:36.091855', 'endtime': '2019-04-13T20:02:41.332896', 'percent_complete': 100, 'job_type': 'AnalyzeModel', 'loss': 0.04829813167452812, 'generations': 6, 'dataset_names': None, 'artifact_names': ['c57d6794a5214286bb954f638ecec615'], 'model_name': 'Assessed_Value_model0', 'job_error': ''}


Show the 10 most important features of the model.

In [14]:
feature_importance[:10]

Interior_Finish = Normal         0.427644
Living_Area                      0.205453
Number_of_Floors                 0.157710
Gross_Area                       0.049753
Exterior_Finish = Brick/Stone    0.030753
Lot_Area                         0.027207
ZIPCODE = 02130_                 0.011637
ZIPCODE = 02129_                 0.007876
ST_NAME_SUF = ST                 0.006695
Has_AC = 1                       0.004818
dtype: float64

## Predictions
**Perform model prediction on the the training dataset.**

In [15]:
status, artifact = ds.run_model(dataset_name, model)
sleep(1)
ds.wait_for_job(artifact['job_name'])

{'status': 'Running', 'starttime': '2019-04-13T20:05:13.46979', 'endtime': None, 'percent_complete': 0, 'job_type': 'RunModel', 'loss': 0.04829813167452812, 'generations': 6, 'dataset_names': ['boston.csv'], 'artifact_names': ['d35b9ad9f6a64370b285d6c0d7e56b64'], 'model_name': 'Assessed_Value_model0', 'job_error': ''}
{'status': 'Complete', 'starttime': '2019-04-13T20:05:13.46979', 'endtime': '2019-04-13T20:05:22.785682', 'percent_complete': 100, 'job_type': 'RunModel', 'loss': 0.04829813167452812, 'generations': 6, 'dataset_names': ['boston.csv'], 'artifact_names': ['d35b9ad9f6a64370b285d6c0d7e56b64'], 'model_name': 'Assessed_Value_model0', 'job_error': ''}


(True, 'Job completed')

Download predictions from Darwin's server.

In [None]:
status, prediction = ds.download_artifact(artifact['artifact_name'])
prediction.head()

Create plots comparing predictions with actual target

In [None]:
#Plot predictions vs actual
plt.plot(df[target], prediction[target], '.')
plt.plot([0,2.3e7],[0,2.3e7],'--k')
print('R^2 : ', r2_score(df[target], prediction[target]))

## Find out which machine learning model did Darwin use:

In [None]:
status, model_type = ds.lookup_model_name(model)
print(model_type['description']['best_genome'])