- Overview
- Summary
- Scikit-learn Pipeline
- AutoML
- Pipeline comparison
- Future work
- Proof of Cluster Clean Up
This project is part of the Udacity Azure ML Nanodegree. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run. The architecture of how the process flow happens in the project is as follows:
The problem statement consist of data from a Marketing sector of a bank where there are various features that are captured as a part of the exercise and contains labels that Categorizes whether a customer agrees to a certain marketing proposal. Our goal here is to not only to correctly classify the agreement of the marketing proposal but also make sure that we look into the true positive rate (recall) and false positive rate for the same. Hence, we see that we not only look at the accuracy, but also at the AUC-ROC curve that tries to perform a trade off between true positive rate (recall) and false positive rate so that the high imbalanace in the dataset doesn't bias our accuracy!
The best performing model from the Hyperdrive methodology [Run ID: HD_e354b747-f90c-4124-a95b-40a1e6c38010_7] is the one with 0.916 accuracy and AUC-ROC score of 0.933. On the other hand the best performing model using the AutoML methodology [Run ID: AutoML_b8a1f7b2-b6de-40eb-b7c0-7d90d727ded4_30] is the one with 0.916 accuracy and AUC-ROC score of 0.95
Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.
In architecturing the pipeline for the logistic regression using Hyperdrive, we have used the data from the mentioned csv file in the location: [https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv] and cleaned up the data so that it is ready to be ingested as a easily consumable data. the cleaning of data included various steps like encoding some month numbers, so that only numbers are fed into the system instead of text, using encoding to make the categorization as well and finally return the dataframe on which the Logistic Regression model will run. We would use 2 hyper parameters to run a RandomParameterSampling. The hyperparameters are:
- C: Range of [0.001, 0.01, 0.1, 1, 10, 50, 100, 150, 200, 300, 400, 500, 1000]
- max_iter: Range of [50, 100, 150, 200, 300]
The reason of using RandomParameterSampling is that this method helps in terminating any low-performance runs early. We could have used GridParameterSampling or BayesianParameterSampling to perform exhaustive search over the values provided or exploring the hyperparameter space respectively. But since time was a constraint on the VM, I preferred to use the RandomParameterSampling as this led to faster execution of the code.
The classification algorithm used was Logistic regression where the target variable was y and was assigned 1 for a YES and 0 for a NO.
What are the benefits of the parameter sampler you chose?
- C: This is a regularization parameter that enables training of models that is able to generalize better on unseen data.
- max_iter: This is the maximum number of iterations that can be taken for the solvers to converge.
What are the benefits of the early stopping policy you chose?
Syntax:
policy = BanditPolicy(evaluation_interval=1, slack_factor=0.1)
Early stopping policy was chosen so that any run whch is performing poorly can automatically terminate thus improving the computational efficiency and make sure that resources utilized are at optimum level. For this a BanditPolicy was chosen and 2 parameters were passed because of the below mentioned benefits.
- evaluation_interval: This is to denote how frequently the policy has to be applied.
- slack_factor: This ratio allows runs to automatically terminate if they cross beyond the factor as specified by the factor mentioned in this parameter.
In 1-2 sentences, describe the model and hyperparameters generated by AutoML.
The configurations in the AutoML has been given the following parameters to build the corresponding model.
- task='classification': To signify what kind of model needs to be built.
- primary_metric='AUC_weighted': This is the metric that is weighted against when the final ranking is done.
- training_data=ds: This is to signify what data gets fed into the AutoML pipeline.
- label_column_name='y': This is to direct the config to the corresponding column of the target variable.
- n_cross_validations=4: This is make sure that we don't over/under estimate any metric over just one set, and different folds are set so that the metric is mellowed down by apprehensing it through different sets of data to test on.
Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?
From the above table we see that the accuracy are fairly very similar to each other but what is strinkingly different is that the AutoML is able to identify a model using VotingEnsemble so which is an ensemble model of different ML models and able to generate a better AUC-ROC. In terms of architecture, Hyperdrive had a limited number of choices to look from [as only 1 model was getting used] in contrast to AutoML which had reach over to a lot of different types of models like LightGBM, other stacking techniques etc. These pointers reflects thedifferences that gets created as a result of these models running.
The reason as to why we went for AUC score is because of the highly imbalanced dataset we had.
Evaluation metric from best model of AutoML:
Evaluation metric from best model of Hyperdrive:
AutoML showing the high imbalance in the dataset:
Feature importance as a by-product of model explanation:
What are some areas of improvement for future experiments? Why might these improvements help the model?
-
Class Imbalance
In future experiments I would like to explore the following ideas to enable better prediction on the dataset so that we improve on the AUC as well as the accuracy in a holistic way
- Using techniques like SMOTE to oversample the data making sure that we don't bias the data: Using SMOTE, we will be generating synthetic samples for the minority class and using the k-NN based oversampling, we can create some synthetic data points in the close vicinity of other data points so that we attain a class balance.
- Using undersampling techniques like NearMiss to reduce the imbalance. This helps in improving the model by removing some sample points from the majority class so as to increase the gaps between 2 classes. Near-neighbour method also takes care of lesser information loss during undersampling.
- Understanding other metrics to look into the aspect of evaluating the model like Precision/Recall/F1 score and also understand if precision is of more importance or recall (of course getting in touch with business stakeholders to know their focus area) and try to focus our model's development in that direction. This will enable us to be closer to the business case solving hence improving adaptibility by business.
-
Feature Engineering Understand different features can can be derived from the pool of features which might be able to explain the model even better. Synthesizing of the such features will make the explanability even richer. However this might increase the computation time and elevate expenses and thus a trade-off needs to be done so that we can prove the benefit of the extra computation time.