# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [1]:
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.dataset import Dataset
from train import prep_data
from azureml.train.automl.run import AutoMLRun
from azureml.widgets import RunDetails
from azureml.train.automl import AutoMLConfig


ws = Workspace.from_config()

In [2]:
# Create compute cluster
cluster_name = "compute-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print(f"Existing Compute using {cluster_name}")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_D2_V2', max_nodes=6, min_nodes=1)
    compute_target = ComputeTarget.create(workspace=ws, name=cluster_name, provisioning_configuration=compute_config)

Existing Compute using compute-cluster


## Dataset

### Overview
This dataset comes from data extracted from the US Census of 1994. The purpose is to be able to predict whether a person makes a salry of more than $50K per year. The specific features were extracted from a the broader 1994 census data. The dataet was retrieved from kaggle: https://www.kaggle.com/datasets/ayessa/salary-prediction-classification.

Dataset Features:
* **age**: The age of the person (int)
* **workclass**: The class of job for the person (string category)
* **fnlwgt**: Weight assigned by US Census Bureau for number of people represented by that entry (int)
* **education**: The type of education someone has received (string category)
* **education-num**: A nunerical representation of the education column (int)
* **marital-status**: The marital status of a person (string category)
* **occupation**: The occucpation of a person (string category)
* **relationship**: The family relationship of a person (string category)
* **race**: The ethnicity of a person in the report (string category)
* **sex**: The gender of the person in the report (string category)
* **capital-gain**: The capital gains of a person in the report (int)
* **capital-loss**: The capital losses of a person in the report (int)
* **hour-per-week**: Hours worked per work (int)
* **native-country**: The native country for the person in the report (int)
* **salary**: The salary <=50K or >50K we are trying to predict (category)


TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [3]:
# choose a name for experiment
experiment_name = 'capstone-automl'

experiment=Experiment(ws, experiment_name)
dataset = Dataset.get_by_name(ws, name='salary')
df = dataset.to_pandas_dataframe()
X, y = prep_data(df)

X['Salary'] = y

ds = ws.get_default_datastore()

train_dataset = Dataset.Tabular.register_pandas_dataframe(X, ds, 'automl-dataset')
X.head()

Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/b0bcc5ed-8416-484b-953d-43588e4e7034/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


Unnamed: 0,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Other,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,marital-status_Divorced,...,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,Salary
0,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429,0
1,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153,0
2,0,0,0,0,1,0,0,0,0,1,...,1,0,0,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429,0
3,0,0,0,0,1,0,0,0,0,0,...,1,0,0,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429,0


## AutoML Configuration

* **task**: Set to 'classification' since our goal is to classify a person's salary
* **experiment_timeout_minutes**: Set to 30 minutes to get results in a timely manner and to not waste resources
* **primary_metric**: Set to 'accuracy' because its a reasonable metric to determine the outcome in a classification
* **label_column**: Set to the result column from the dataset, the salary being more or less than 50k, which is what is being predicted
* **compute_target**: The compute cluster to run the experiment on
* **n_cross_validations**: Set to 5 to remove bias when doing validation
* **enable_early_stopping**: Terminate the experiment early if the score isn't improving
* **max_concurrent_iterations**: Allow for multiple runs to be done in parallel

In [4]:
# TODO: Put your automl settings here
automl_settings = {
  "max_concurrent_iterations": 5
}

# TODO: Put your automl config here
automl_config = AutoMLConfig(
  task="classification",
  experiment_timeout_minutes=30,
  training_data=train_dataset,
  primary_metric='accuracy',
  label_column_name='Salary',
  compute_target=compute_target,
  n_cross_validations=5,
  enable_early_stopping=True,
  **automl_settings
)

In [5]:


remote_run: AutoMLRun = experiment.submit(automl_config)
RunDetails(remote_run).show()

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
capstone-automl,AutoML_43b2d79d-d87f-4ad6-89f0-655d88b0b052,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [6]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [7]:
# Retrieve and save best automl model
best_model, model = remote_run.get_output()
print(best_model)
print(model.steps)
registered_model = remote_run.register_model()
registered_model

Run(Experiment: capstone-automl,
Id: AutoML_43b2d79d-d87f-4ad6-89f0-655d88b0b052_38,
Type: azureml.scriptrun,
Status: Completed)
[('datatransformer', DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, task='classification')), ('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=array([0, 1]), estimators=[('0', Pipeline(memory=None, steps=[('maxabsscaler', MaxAbsScaler(copy=True)), ('lightgbmclassifier', LightGBMClassifier(min_data_in_leaf=20, n_jobs=1, problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'cpu'}), random_state=None))], verbose=False)), ('27', Pipeline(memory=None, steps=[('standardscalerwrapper', StandardScalerWrapper(copy=True, with_mean=False, with_std=False)), ('xgboostclassifier', XGBoostClassifier(booster='gbtree', colsample_bytree=1, eta=0.

Model(workspace=Workspace.create(name='quick-starts-ws-194966', subscription_id='9a7511b8-150f-4a58-8528-3e7d50216c31', resource_group='aml-quickstarts-194966'), name=AutoML43b2d79dd38, id=AutoML43b2d79dd38:1, version=1, tags={}, properties={})

**Submission Checklist**
- ~~I have registered the model.~~
- (skipped) I have deployed the model with the best accuracy as a webservice.
- (skipped) I have tested the webservice by sending a request to the model endpoint.
- (skipped) I have deleted the webservice and shutdown all the computes that I have used.
- (skipped) I have taken a screenshot showing the model endpoint as active.
- ~~The project includes a file containing the environment details.~~ [file](./conda_dependencies_automl.yml)
