Skip to content

aspatton/udacity_Optimizing_a_Pipeline_in_Azure-Starter_Files

 
 

Repository files navigation

Optimizing an ML Pipeline in Microsoft Azure - Udemy course

Tony Patton - 03/01/2024

Project 1

Utilized personal subscription since Udacity lab enviornments never accessible - basically spin and never start

Overview

This project is part of the Udacity Machine Learning Engineer in Microsoft Azure course.

In this project, we build and optimize an Azure ML pipeline using Python and AzureML SDK.

The code compares models using Skikit-Learn and Azure AutoML.

This is the first project in the course.

Azure Machine Learning allows for hyperparameter training through Hyperdrive experiments. This process launches multiple child runs, each with a different hyperparameter configuration. After all, runs are complete; the best model can be evaluated and registered to the Azure Machine Learning Studio. This project uses this approach to find the best model to use with provided dataset.

--> Output of running experiments, environment available in the img subdirectory

Summary

The data contains bank marketing data.

The project leverages data stemming from the direct marketing endeavors of a banking institution in Portugal, predominantly conducted via phone calls. The dataset encompasses 20 variables, including age, occupation, and marital status, with the goal column delineating two categories—Yes and No—to signify whether the client subscribed to a term deposit at the bank.

We used the Python SDK via Hyperdrive along with AutoML to predict whether a potential client would subscribe to a term deposit. Theoretically, this allows the bank to focus resource allocation towards clients with a higher propensity to subscribe.

A Voting Ensemble model discovered through the AutoML run exhibited the highest accuracy at 91.706% - see img/BestModel.png and img/VotingEnsemble.png. However, the Logistic classifier, shaped using Hyperdrive, closely followed with an accuracy of 90.59%.

Scikit-learn Pipeline

Initially, a Logistic Regression model was formulated and trained employing Scikit-learn in the train.py. The process delineated in the Python script encompasses:

  • Incorporation of the banking dataset via Azure TabularDataset Factory
  • Application of a cleaning function for data purification and transformation
  • Division of processed data into training and testing subsets
  • Utilzied C and max_iter parameters allowing subsequent optimization via Hyperdrive.
  • Preservation of the trained model

With parameters C=1000 and max_iter=100, the model attained an accuracy of 90.62%.

Accuracy - 0.9062879 [HD_354605b0-1961-4b5b-b876-3a7c79c9cd12_4]

Note: see img/SecondBest.png and img/LogisticRegression/png for details on the run.

HyperDrive

The initially trained model underwent optimization with Hyperdrive, allowing for efficient, parallel hyperparameter tuning. The Hyperdrive implementation consisted of:

  • Azure cloud resources configuration
  • Hyperdrive configuration
  • Execution of the Hyperdrive
  • Retrieval of the optimal model

The Hyperdrive configuration used RandomParameterSampling and BanditPolicy thus enabling efficient exploration of hyperparameter spaces and early termination of underperforming runs.

AutoML

Best Model Description: The best model generated by AutoML is a Voting Ensemble model. A Voting Ensemble (or majority voting ensemble) is a model that combines the predictions from multiple machine learning algorithms. It is often used to achieve higher accuracy and model stability. This is especially beneficial in situations where different models perform well for different types of input data, as the ensemble can leverage the strengths of each constituent model.

Accuracy - 0.91706 [AutoML_2e4bada9-b5a6-437e-a9a0-516e6a526204_21]

Parameters of the Best Model:

  1. voting: 'soft' The ensemble model utilized Soft Voting which means the weighted average of probabilities for class membership is predicted by the individual models. The results are used to predict the class with the highest average probability.

  2. weights: The weights assigned to the models in the ensemble. A higher weight means the model has more influence on the final prediction.

  3. n_jobs: None This parameter is used to specify the number of jobs to run in parallel. None usually means 1, i.e., the computations will not run in parallel.

  4. flatten_transform: None This parameter signals whether the transformation is flattened. In this case, None says to use the default behavior thus no flattening.

Pipeline comparison

In my testing, AutoML is the winner with regards to accuracy. In addition. AutoML is easier to use - simpler coding. However, AutoML does take longer to run thus more expensive in terms of resource usage.

The big advantage of AutoML is the ease it provides for evaluating many different algorythms. This project provides a clear example as a model we never tested outperformed our selected model via HyperDrive.

Future work

Progress can be achieved by refining the Voting Ensemble algorithm from AutoML using Hyperdrive for hyperparameter tuning. Hyperdrive can explore varied parameter sampling methods, possibly unveiling a hyperparameter set that surpasses the accuracy attained by AutoML.

The bank in this example seems to focus on accurately locating subscribers wanting to sign up thus they avoid wasting resources. The model's precision is most important.

Proof of cluster clean up

The cluster was deleted, I used my own subscription as I could never get the Udacity labs to work. An image of the cluster cleanup is provided (img/ML_Job_Cluster_Cleanup_Confirm.png)

About

Udacy ML Engineer in Azure - Project 1

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 96.1%
  • Python 3.9%