# Introduction to Machine Learning

*Machine Learning* is the foundation for most artifical intelligence solutions, and the creation of an intelligent solution often begins with the use of *supervised* machine learning to train a predictive model using historic data that you have collected.

Supervised machine learning techniques involve training a model to operate on a set of *features* and predict a *label* using a dataset that includes some already-known label values. The training process *fits* the features to the known labels to define a model that can be applied to new features for which the labels are unknown, and predict them.

You can think of this as a function, in which ***y*** represents the label we want to predict and ***x*** represents the features the model uses to predict it - like this:

$$y = f(x)$$

## An Example

That all seems a little abstract, so let's take a look at an example that might be simpler to understand.

<p style='text-align:center'><img src='./images/adventureworks.jpg' alt=='Adventure Works cycle rental location, on a cloudy day in January'/></p>

Suppose *Adventure Works Cycles* is a business that rents cycles in a city, and its owners want to leverage historic data to train a model that predicts daily rental demand so they can ensure that sufficient staff and cycles are available.

The demand would be measured in terms of the number of rentals for a given day (this is our *y* value, or *label*).

The factors that affect demand might include things like:

 - What day of the week it is
 - The season of the year
 - The temperature
 - The level of rainfall
 - The humidity level
 - windspeed

These are our *x* values, or *features*.

So what we're looking for is a function (*f*) that performs some kind of calculation based on the various seasonal and weather features on a given day (*x*) that produces a label that indicates the likely number of rentals that day (*y*) as a result.

The specific operation that the ***f*** function performs on *x* to calculate *y* depends on the type of machine learning model you are trying to create. This kind of machine learning, where the label you're trying to predict is a numeric value (in this case, the number of rentals), is called *regression*.

## Azure Machine Learning

Data scientists expend a lot of effort exploring and pre-processing data, and trying various types of model-training algorithm to produce accurate models, which is time consuming, and often makes inefficient use of expensive compute hardware.

Azure Machine Learning is a cloud-based platform for building and operating machine learning solutions in Azure. It includes capabilities that help data scientists prepare data, train models, publish predictive services, and monitor their usage.

Most importantly, it helps data scientists increase their efficiency by automating many of the time-consuming tasks associated with training models; and it enables them to use cloud-based compute resources that scale effectively to handle large volumes of data while incurring costs only when actually used.

## Create an Azure Machine Learning Workspace

To use Azure Machine Learning, you create a *workspace* in your Azure subscription. You can then use this workspace to manage data, compute resources, code, models, and other artefacts related to your machine learning workloads.

Follow these steps to create a workspace:

1. Sign into the [Azure portal](https://portal.azure.com) using your Microsoft credentials.
2. Click the **&#65291;Create a resource** button, search for *Machine Learning*, and create a new **Machine Learning** resource the following settings:
    - **Workspace Name**: *A unique name of your choice*
    - **Subscription**: *Your Azure subscription*
    - **Resource group**: *Create a new resource group with a unique name*
    - **Location**: *Choose any available location*
    - **Workspace edition**: Enterprise
3. Wait for your workspace to be created. Then go to it in the portal.

## Create Azure Machine Learning Resources

Now that you have an Azure Machine Learning workspace, you can use it to manage the various assets and resources you need to create machine learning solutions. At its core, Azure Machine Learning is a platform for training and managing machine learning models, for which you need two fundamental things: data from which to train the models, and compute on which to run the training process. You'll manage both of these for the cycle rental prediction model using the Azure Machine Learning *studio* web interface.

> **Important**: You are going to create some compute resources in your workspace that will run in the background as you complete the lab. If you decide not to complete the lab, make sure you perform the **Reset Resources** exercise at the end of this notebook to avoid leaving your compute running and incurring unnecessary charges to your Azure subscription. 

1. In the [Azure portal](https://portal.azure.com), in the **Overview** page for your Azure Machine Learning workspace, click the link to launch Azure Machine Learning **studio**. Alteratively, browse to [https://ml.azure.com](https://ml.azure.com]), sign in using your Microsoft credentials, and select your Azure subscription and Azure Machine Learning workspace.
2. View the **Compute** page (under **manage**). This is where you manage the compute targets for your data science activities. There are four kinds of compute resource you can create:
    - **Compute Instances**: Development workstations that data scientists can use to work with data and models.
    - **Training Clusters**: Scalable clusters of virtual machines for on-demand processing of model training code.
    - **Inference Clusters**: Deployment targets for predictive services that use your trained models.
    - **Attached Compute**: Links to existing Azure compute resources, such as Virtual Machines or Azure Databricks clusters.
3. Switch to the **Training Clusters** tab, and add a new training cluster with the following settings:
    - **Compute name**: aml-cluster
    - **Virtual Machine size**: Standard_DS1_v2
    - **Virtual Machine priority**: Dedicated
    - **Minimum number of nodes**: 2
    - **Maximum number of nodes**: 2
    - **Idle seconds before scale down**: 120
4. While the training cluster is being created, view the **Datasets** page (under **Assets**), and create a new dataset ***from web files*** with the following settings:
    - **Basic Info**:
        - **Web URL**: https://aka.ms/bike-rentals
        - **Name**: bike-rentals
        - **Dataset type**: Tabular
        - **Description**: Bicycle rental data
    - **Settings and preview**:
        - **File format**: Delimited
        - **Delimiter**: Comma
        - **Encoding**: UTF-8
        - **Column headers**: Use headers from first file
        - **Skip rows**: None
    - **Schema**:
        - Include all columns other than **Path**
        - Review the automatically detected types
    - **Confirm details**:
        - Do not profile the dataset after creation
4. After the dataset has been created, open it and view the **Explore** page to see a sample of the data. This data contains historical features and labels for bike rentals.

> **Citation**: *This data is derived from [Capital Bikeshare](https://www.capitalbikeshare.com/system-data) and is used in accordance with the published data [license agreement](https://www.capitalbikeshare.com/data-license-agreement).*

## Train a Model Using Automated Machine Learning

Azure Machine Learning includes an *automated machine learning* capability that leverages the scalability of cloud compute to automatically try multiple model-training algorithms and pre-processing techniques in parallel to find the best performing model for your data. Perform the following steps to use it to train a model that predicts bike rentals.

1. In [Azure Machine Learning studio](https://ml.azure.com), view the **Automated ML** page (under **Author**).
2. Create a new Automated ML run with the following settings:
    - **Dataset**: bike-rentals
    - ** Experiment name**: auto-train-bike-rental
    - **Target column**: rentals
    - **Training compute target**: aml-cluster
    - **Task type**: Regression
    - **Additional configuration settings:**
        - **Primary metric**: Normalized root mean square error (more about this later!)
        - **Automated featurization**: Selected
        - **Explain best model**: Unselected
        - **Blocked algorithms**: *block <u>all</u> other than **RandomForest** and **LightGBM** - normally you'd want to try as many as possible, but doing so will take more time than we have available in this lab!*
3. When you finish submitting the automated ML run details, it will start automatically. Wait for the run status to change from *Preparing* to *Running* (this may take 10 minutes or so, as the cluster nodes need to be initialized before training can begin - now might be a good time for a coffee break!). You may need to click **&#8635;Refresh** periodically.
4. When the run status changes to *Running*, click the **Models** tab and observe as each possible combination of training algorithm and pre-processing steps is tried and the performance of the resuting model is evaluated. The page will auto-refresh periodically, but you can also click **&#8635;Refresh**.
5. After a few models have been trained and evaluated (with a status of **Completed**), click **&#10754;Cancel** to cancel the remaining iterations (otherwise you could be here for quite a while!)

## Review and Deploy the Best Model

Although you cancelled the automated machine learning run, some models were trained; so you can review the best performing one and deploy it as a predictive service.

1. On the **Details** tab of the automated machine learning run, note the recommended model.

    This recommendation is based on the *Normalized root mean square error* metric you specified.

    To calculate this, the training process used some of the data to train the model, and applied a technique called *cross-validation* to test the trained model with data it wasn't trained with (but for which the actual label value is known), and compare the predicted value with the actual known value.
    
    The difference between the predicted and actual value (known as the *residuals*) indicates the amount of *error* in the model, and our performance metric is calculated by squaring the errors across all of the test cases, finding the mean of these squares, and then taking the square root. What all of this means is that smaller this value is, the more accurately the model is predicting.

2.  Click **View model details**, and note that you can see all of the run metrics that give statistical information about the performance of the model.
3. Click the **Visualizations** tab and review the charts that show the performance of the model by comparing the predicted values against the true values, and showing the *residuals* (differences between predicted and actual values) as a histogram.
    - The **Predicted vs. True** chart should show a diagonal trend in which the predicted value correlates closely to the true value. A dotted line shows how a perfect model should perform, and the closer the line for your model's average predicted value is to this, the better its performance.
    - The **Residual Histogram** shows the frequency of residual value ranges. Residuals represent variance between predicted and true values that can't be explained by the model - in other words, errors; so what you should hope to see is that the most frequently occuring residual values are clustered around 0 (in other words, most of the errors are small), with fewer errors at the extreme ends of the scale.
4. Return to the **Model details** tab, and click **Deploy model**. Then deploy the model with the following settings:
    - **Name**: predict-rentals
    - **Description**: Predict cycle rentals
    - **Compute type**: ACI
    - **Enable authentication**: Selected.
5. Wait for the deployment to complete - this may take a few minutes.
6. In Azure Machine Learning studio, view the **Endpoints** page and find the **predict-rentals** real-time endpoint.
7. Click the **predict-rentals** endpoint and verify that the Deployment state is *Healthy* (if it is *Transitioning*, wait a few more minutes and refresh the page until it is *Healthy*). Then click the **Consume** tab and note the information there. You need this to connect to your deployed service from a client application.
8. Copy the REST endpoint for your service (you can use the &#10697; link next to it), and paste it in the code below (replacing YOUR_ENDPOINT).
9. Copy the Primary Key for your service and paste it in the code below, replacing YOUR_KEY.
10. Run the code in the cell below by clicking its green &#9655 button.

In [None]:
endpoint = 'YOUR_ENDPOINT' # Replace with your endpoint
key = 'YOUR_KEY' # Replace with your key

print ('Ready to use', endpoint, 'with key', key)

Now you're ready to use your service.

Run the code cell below, which defines features for a five day period using hypothetical weather forecast data, and uses the **predict-rentals** service you created to predict cycle rentals for those five days.

> **Note** Don't worry too much about the details of the code - the point is just to verify that your published model works!

In [None]:
import json
import requests

# An array of features based on five-day weather forecast
x = [[1,1,2022,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446],
     [2,1,2022,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539],
     [3,1,2022,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309],
     [4,1,2022,1,0,2,1,1,0.2,0.212122,0.590435,0.160296],
     [5,1,2022,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869]]

# Convert the array to JSON format
input_json = json.dumps({"data": x})

# Set the content type and authentication for the request
headers = {"Content-Type":"application/json",
          "Authorization":"Bearer " + key}

# Send the request
response = requests.post(endpoint, input_json, headers=headers)

# If we got a valid response, display the predictions
if response.status_code == 200:
    y = json.loads(response.json())
    print("Predictions:")
    for i in range(len(x)):
        print (" Day: {}. Predicted rentals: {}".format(i+1, round(y["result"][i])))
else:
    print(response)

Your machine learning model is predicting rentals based on the features you submit to your service, making it possible for Adventure Works Cycles to ensure they have enough staff and cycles in place to meet demand.

## Reset Resources

The web service is hosted in an *Azure Container Instance*. If you don't intend to experiment with it further, you should delete the endpoint to avoid accruing unnecessary Azure charges. You should also stop the training cluster until you need it again.

1. In [Azure Machine Learning studio](https://ml.azure.com), on the **Endpoints** tab, select the **predict-rentals** endpoint. Then click the **Delete** (&#128465;) button and confirm that you want to delete the endpoint.
2. On the **Compute** page, on the **Training clusters** tab, open the **aml-cluster** compute target and click **Edit**. Then set the **Minimum number of nodes** setting to **0** and click **Update**.
