# Building a Regression Analysis using Azure Machine Learning

This first notebook shows how to perform a fairly straightforward regression analysis using Azure Machine Laerning.  We will use `scikit-learn`'s `DecisionTreeRegressor` algorithm to train a model and see how that model fares.

Because we are running this notebook directly from Azure Machine Learning `Workspace.from_config()` "just works."  As we'll see later, in order to run this locally, we'd need to set up a configuration file.

In [None]:
from azureml.core import Workspace, Environment, Datastore, Dataset
from azureml.core.experiment import Experiment
from azureml.data.datapath import DataPath
from azureml.data import DataType
from azureml.core.run import Run
from azureml.core.model import Model

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import joblib
import numpy as np
import pandas as pd
from math import sqrt

ws = Workspace.from_config()

This section assumes that you already have an Azure SQL Database set up and have created a datastore named `expense_reports`.

In [None]:

expenses_datastore = Datastore.get(ws, datastore_name="expense_reports")

In this section, we will get two separate datasets.  The first dataset is all expense report data prior to the year 2017, and the next dataset is all data after 2017.  It turns out that there was some expense reporting fraud starting in the year 2018, and so we want to train a model on pre-fraudulent data.

In [None]:
query = """SELECT
    er.EmployeeID,
    CONCAT(e.FirstName, ' ', e.LastName) AS EmployeeName,
    ec.ExpenseCategoryID,
    ec.ExpenseCategory,
    er.ExpenseDate,
    YEAR(er.ExpenseDate) AS ExpenseYear,
    -- Python requires FLOAT values--it does not support DECIMAL
    CAST(er.Amount AS FLOAT) AS Amount
FROM dbo.ExpenseReport er
    INNER JOIN dbo.ExpenseCategory ec
        ON er.ExpenseCategoryID = ec.ExpenseCategoryID
    INNER JOIN dbo.Employee e
        ON e.EmployeeID = er.EmployeeID
WHERE
	YEAR(er.ExpenseDate) < 2017;"""
queryTraining = DataPath(expenses_datastore, query)

data_types = {
    'EmployeeID': DataType.to_long(),
    'EmployeeName': DataType.to_string(),
    'ExpenseCategoryID': DataType.to_long(),
    'ExpenseCategory': DataType.to_string(),
    'ExpenseDate': DataType.to_datetime('%Y-%m-%d'),
    'ExpenseYear': DataType.to_long(),
    'Amount': DataType.to_float()
}

queryTesting = DataPath(expenses_datastore, query.replace("YEAR(er.ExpenseDate) < 2017;", "YEAR(er.ExpenseDate) >= 2017;"))

training = Dataset.Tabular.from_sql_query(queryTraining, set_column_types=data_types).to_pandas_dataframe()
testing = Dataset.Tabular.from_sql_query(queryTesting, set_column_types=data_types).to_pandas_dataframe()

Here, we run our experiment, named `ExpenseReportsNotebook`.  We create a `DecisionTreeRegressor()` and fit the expense category and the year as inputs to a model, trying to predict the amount spent.  Once we do that, we calculate the quality of the model using Root Mean Squared Error (RMSE) and log this result.

From there, we look at the RMSE of each employee, in an attempt to see if there is anything additional we might be able to glean, such as which people might have engaged in fraudulent behavior.  It turns out that just by RMSE, we're able to find the fraudsters.

Finally, we'll save this model and store it in `outputs/model.pkl` and register this model.  This way, we'd be able to deploy the model later if we so desire.

In [None]:
# Begin experiment
experiment = Experiment(workspace=ws, name="ExpenseReportsNotebook")
run = experiment.start_logging()

# Fit the data to a decision tree
reg = DecisionTreeRegressor()
reg.fit(training[["ExpenseCategoryID", "ExpenseYear"]], training[["Amount"]].values.ravel())

# Generate predictions based on the trained model
pred = pd.DataFrame({"AmountPrediction": reg.predict(testing[["ExpenseCategoryID", "ExpenseYear"]]) })
# Concatenate testing data with predictions
testdf = pd.concat([testing, pred], axis=1)
# Calculate the root mean squared error
rmse = sqrt(mean_squared_error(testdf["Amount"], testdf["AmountPrediction"]))

# Log the overall rmse
run.log('RMSE', rmse)

print()
print('#############################')
print('RMSE is {}'.format(rmse))
print('#############################')
print()

# Log each employee's name, expense category, and RMSE
employees = testdf.groupby(['EmployeeName', 'ExpenseCategory'])
for cat, grp in employees:
    empname, expcat = cat
    rmse = sqrt(mean_squared_error(grp["Amount"], grp["AmountPrediction"]))
    rescat = ('{}, {}, RMSE'.format(empname, expcat))
    run.log(rescat, rmse)

# Save the model and upload it to the run
model_file_name = 'outputs/model.pkl'
joblib.dump(value = reg, filename = model_file_name)

# Typically, the run.upload_file() method would be used to capture saved files
# However, as per the Azure documentation, files stored in the outputs/ directory are automatically captured by the current Run

# Complete the run
run.complete()

# Register the model with the workspace
model = run.register_model(model_name = 'ExpenseReportsNotebookModel', model_path = model_file_name)