# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [54]:
from azureml.core import Workspace, Experiment, Dataset
from azureml.data.dataset_factory import TabularDatasetFactory

ws = Workspace.from_config()

## Dataset

### Overview

This project uses the data from a DrivenData competition - [Pump it Up: Data Mining the Water Table](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).

The training data is divided into two files, with the target variable (labels) and the other variables (values). The target variable describe the functioning status of each pump (*functional* and *non functional*). Descriptive variables include pump location, its founder, water quality and quantity, water point type, etc.

As one need to be logged in to DrivenData in order to access the data, it cannot be downloaded via direct links and was stored as .csv files in the *data* folder. The 

In [55]:
#local paths to train data
path_labels = "data/train_labels.csv"
path_values = "data/train_values.csv"

# get the datastore to upload prepared data
datastore = ws.get_default_datastore()

# upload the local file from src_dir to the target_path in datastore
datastore.upload(src_dir='data', target_path='data', overwrite=True)

# create datasets referencing the cloud location
ds_labels = Dataset.Tabular.from_delimited_files(path = [(datastore, (path_labels))])
ds_values = Dataset.Tabular.from_delimited_files(path = [(datastore, (path_values))])

Uploading an estimated of 3 files
Uploading data/train_labels.csv
Uploaded data/train_labels.csv, 1 files out of an estimated total of 3
Uploading data/train_pump.csv
Uploaded data/train_pump.csv, 2 files out of an estimated total of 3
Uploading data/train_values.csv
Uploaded data/train_values.csv, 3 files out of an estimated total of 3
Uploaded 3 files


In [56]:
ds_labels.take(3).to_pandas_dataframe()

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional


In [57]:
ds_values.take(3).to_pandas_dataframe()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe


In [58]:
# join the target variable with other variables
df_labels = ds_labels.to_pandas_dataframe()
df_values = ds_values.to_pandas_dataframe()
df_joined = df_values.join(df_labels.set_index('id'), on='id')

In [59]:
df_joined.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [60]:
# store the merged data locally
path_df_joined = "data/train_pump.csv"
df_joined.to_csv(path_df_joined,index=False)

In [61]:
# upload the local file from src_dir to the target_path in datastore
datastore.upload(src_dir='data', target_path='data', overwrite=True)
ds_joined = Dataset.Tabular.from_delimited_files(path = [(datastore, (path_df_joined))])

Uploading an estimated of 3 files
Uploading data/train_labels.csv
Uploaded data/train_labels.csv, 1 files out of an estimated total of 3
Uploading data/train_pump.csv
Uploaded data/train_pump.csv, 2 files out of an estimated total of 3
Uploading data/train_values.csv
Uploaded data/train_values.csv, 3 files out of an estimated total of 3
Uploaded 3 files


In [62]:
# register dataset
# source: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets
ds_joined = ds_joined.register(workspace=ws,
                               name='train_pump',
                               description='Training data for the Pump it Up project',
                               create_new_version=True)

In [63]:
# create experiment
experiment_name = 'pump_up'
experiment=Experiment(ws, experiment_name)

## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

In [None]:
# TODO: Put your automl settings here
automl_settings = {}

# TODO: Put your automl config here
automl_config = AutoMLConfig()

In [None]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [None]:
#TODO: Save the best model

## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

TODO: In the cell below, send a request to the web service you deployed to test it.

TODO: In the cell below, print the logs of the web service and delete the service