# Step 1: Experiment in a notebook
<div class="alert alert-warning"> This notebook has been last tested on a SageMaker Studio JupyterLab instance using the <code>SageMaker Distribution Image 3.6.1</code> and with the SageMaker Python SDK version <code>2.255.0</code></div>

In this step you run data processing and model training and evaluation in the notebook locally. You don't use `sagemaker` or `boto3` packages.

**From idea to production in six steps:**
||||
|---|---|---|
|1. |Experiment in a notebook |**<<<< YOU ARE HERE**|
|2. |Scale with SageMaker AI processing jobs and SageMaker SDK ||
|3. |Operationalize with ML pipeline, model registry, and feature store ||
|4. |Add a model building CI/CD pipeline ||
|5. |Add a model deployment pipeline ||
|6. |Add model and data monitoring ||

<div class="alert alert-info"> Make sure you using <code>Python 3</code> kernel in JupyterLab for this notebook.</div>



In [None]:
# We use the opensource xgboost algorithm to implement the model
%pip install -q xgboost

In [None]:
# Need to install mlflow so this notebook can run standalone as a SageMaker job
%pip install --upgrade "mlflow>=2,<3" sagemaker-mlflow



In [None]:
import pandas as pd
import numpy as np 
import json
import joblib
import xgboost as xgb
import sagemaker
import boto3
import os
import matplotlib.pyplot as plt
import mlflow
import mlflow.sklearn
from mlflow.models import infer_signature
from time import gmtime, strftime, sleep
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

(sagemaker.__version__, boto3.__version__, mlflow.__version__)

In [None]:
%store -r 

%store

try:
    initialized
except NameError:
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN 00-start-here notebook   ")
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")
    user_profile_name = sagemaker.get_execution_role()

In [None]:
target_col = "y"

In [None]:
%store target_col

In [None]:
session = sagemaker.Session()
sm = session.sagemaker_client

## Load data
The following cell is tagged with `parameters` as the cell tag to enable parametrization for headless execution of the notebook as [SageMaker Notebook-based workflow](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run.html). Refer to the section **Run the notebook as a SageMaker job** for details and an example. Ignore this for now.

In [None]:
# This cell is tagged with `parameters` tag and will be overwritten if the notebook executed headlessly
file_source = "local"
file_name = "bank-additional-full.csv"
input_path = "./data/bank-additional" 
output_path = "./data"

In [None]:
# If run the notebook as a job, non-interactivel or headlessly, the notebook cannot access the JupyterLab EBS volume, download the dataset from S3 instead
# See the section "Run the notebook as a SageMaker job" for more details
if file_source != "local":
    session.download_data(
        path=os.path.join(input_path, ""), 
        bucket=bucket_name,
        key_prefix=f"{bucket_prefix}/input/{file_name}"
    )

## EDA
Let's do some explotary data analysis on this dataset.

In [None]:
df_data = pd.read_csv(os.path.join(input_path, file_name), sep=";")

pd.set_option("display.max_columns", 500)  # View all of the columns
df_data  # show first 5 and last 5 rows of the dataframe

In [None]:
# see column metadata
df_data.info()

In [None]:
# see column statistics
df_data.describe()

In [None]:
# see target distribution
df_data[target_col].value_counts().plot.bar()

plt.show()

In [None]:
# see if there are any missing values
df_data.isna().sum()

In [None]:
cat_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

fig, axs = plt.subplots(3, 3, sharex=False, sharey=False, figsize=(20, 15))

counter = 0
for cat_column in cat_columns:
    value_counts = df_data[cat_column].value_counts()
    
    trace_x = counter // 3
    trace_y = counter % 3
    x_pos = np.arange(0, len(value_counts))
    
    axs[trace_x, trace_y].bar(x_pos, value_counts.values, tick_label = value_counts.index)
    
    axs[trace_x, trace_y].set_title(cat_column)
    
    for tick in axs[trace_x, trace_y].get_xticklabels():
        tick.set_rotation(90)
    
    counter += 1

plt.show()

In [None]:
num_columns = ['duration', 'campaign', 'pdays', 'previous']

fig, axs = plt.subplots(2, 2, sharex=False, sharey=False, figsize=(20, 15))

counter = 0
for num_column in num_columns:
    
    trace_x = counter // 2
    trace_y = counter % 2
    
    axs[trace_x, trace_y].hist(df_data[num_column])
    
    axs[trace_x, trace_y].set_title(num_column)
    
    counter += 1

plt.show()

In [None]:
j_df = pd.DataFrame()

j_df['yes'] = df_data[df_data[target_col] == 'yes']['marital'].value_counts()
j_df['no'] = df_data[df_data[target_col] == 'no']['marital'].value_counts()

j_df.plot.bar(title = 'Marital status and deposit')

## Tracking experiments with SageMaker and MLflow integration
You can [manage machine learning experiments using Amazon SageMaker with MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) and use full functionality of MLflow with a central managed MLflow server.

[MLflow tracking](https://mlflow.org/docs/latest/tracking.html) allows automatically track the inputs, parameters, configurations, and models of your iterations as `experiments` and `runs`. See [MLflow concepts](https://mlflow.org/docs/latest/tracking.html#concepts) to understand how runs and experiments are organized.

In [None]:
if file_source == "local":
    # Check that the MLflow server is in the status 'Created' or 'Started'
    sm = boto3.client("sagemaker")
    
    while sm.describe_mlflow_tracking_server(TrackingServerName=mlflow_name)['TrackingServerStatus'] not in ['Created', 'Started']:
        print(f"The MLflow server {mlflow_name} is not in the status 'Created' or 'Started'")
        sleep(30)
    else:
        print(f"Using server {mlflow_name}")

In [None]:
mlflow.set_tracking_uri(mlflow_arn)

In [None]:
experiment_suffix = strftime('%d-%H-%M-%S', gmtime())
experiment_name = f"from-idea-to-prod-experiment-{experiment_suffix}"
registered_model_name = f"from-idea-to-prod-experiment-model-{experiment_suffix}"

In [None]:
%store experiment_name

In [None]:
experiment = mlflow.set_experiment(experiment_name=experiment_name)

## Feature engineering

As an example, the processing script implements the following feature engineering:
1. Create a new column called `no_previous_contact`. Set value to `1` when `pdays` is `999` and `0` otherwise
1. Generate a new column to show whether the customer is working based on `job` column
1. Remove the economic features from the dataset as they would need to be forecasted with high precision to be used as features during inference time
1. Remove `duration` as it is not know before a call is performed
1. Convert categorical variables to numeric using **one hot encoding**
1. Move the target column `y` to the front

In real world you implement additional processing, data quality handling, and feature engineering. You also go via multiple "try & fail" iterations.

In [None]:
# Indicator variable to capture when pdays takes a value of 999
df_data["no_previous_contact"] = np.where(df_data["pdays"] == 999, 1, 0)

# Indicator for individuals not actively employed
df_data["not_working"] = np.where(
    np.in1d(df_data["job"], ["student", "retired", "unemployed"]), 1, 0
)

# remove unnecessary data
df_model_data = df_data.drop(
    ["duration", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"],
    axis=1,
)


bins = [18, 30, 40, 50, 60, 70, 90]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70-plus']

df_model_data['age_range'] = pd.cut(df_model_data.age, bins, labels=labels, include_lowest=True)
df_model_data = pd.concat([df_model_data, pd.get_dummies(df_model_data['age_range'], prefix='age', dtype=int)], axis=1)
df_model_data.drop('age', axis=1, inplace=True)
df_model_data.drop('age_range', axis=1, inplace=True)

scaled_features = ['pdays', 'previous', 'campaign']
df_model_data[scaled_features] = MinMaxScaler().fit_transform(df_model_data[scaled_features])

df_model_data = pd.get_dummies(df_model_data, dtype=int)  # Convert categorical variables to sets of indicators

# Replace "y_no" and "y_yes" with a single label column, and bring it to the front:
df_model_data = pd.concat(
    [
        df_model_data["y_yes"].rename(target_col),
        df_model_data.drop(["y_no", "y_yes"], axis=1),
    ],
    axis=1,
)

In [None]:
df_model_data

In [None]:
df_model_data.describe()

## Split data

[SageMaker XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost) expects data in the libSVM or CSV formats, with:

- The target variable in the first column, and
- No header row

In [None]:
# Shuffle and splitting dataset
train_data, validation_data, test_data = np.split(
    df_model_data.sample(frac=1, random_state=1729),
    [int(0.7 * len(df_model_data)), int(0.9 * len(df_model_data))],
)

print(f"Data split > train:{train_data.shape} | validation:{validation_data.shape} | test:{test_data.shape}")

In [None]:
# Save data to Studio filesystem
train_data.to_csv(os.path.join(output_path, "train.csv"), index=False, header=False)
validation_data.to_csv(os.path.join(output_path, "validation.csv"), index=False, header=False)
test_data.to_csv(os.path.join(output_path, "test.csv"), index=False, header=False)

## Model training and validation

In [None]:
train_features = train_data.drop(target_col, axis=1)
train_label = pd.DataFrame(train_data[target_col])

In [None]:
dtrain = xgb.DMatrix(train_features, label=train_label)

In [None]:
hyperparams = {
                "max_depth": 5,
                "eta": 0.5,
                "alpha": 2.5,
                "objective": "binary:logistic",
                "subsample" : 0.8,
                "colsample_bytree" : 0.8,
                "min_child_weight" : 3
              }

num_boost_round = 150
nfold = 3
early_stopping_rounds = 10

First, train the model on `nfold` number of folds of the training dataset and run a cross-validation.

In [None]:
# Cross-validate on training data
cv_results = xgb.cv(
    params=hyperparams,
    dtrain=dtrain,
    num_boost_round=num_boost_round,
    nfold=nfold,
    early_stopping_rounds=early_stopping_rounds,
    metrics=["auc"],
    seed=10,
)

In [None]:
metrics_data = {
    "binary_classification_metrics": {
        "validation:auc": {
            "value": cv_results.iloc[-1]["test-auc-mean"],
            "standard_deviation": cv_results.iloc[-1]["test-auc-std"]
        },
        "train:auc": {
            "value": cv_results.iloc[-1]["train-auc-mean"],
            "standard_deviation": cv_results.iloc[-1]["train-auc-std"]
        },
    }
}

In [None]:
print(f"Cross-validated train-auc:{cv_results.iloc[-1]['train-auc-mean']:.2f}")
print(f"Cross-validated validation-auc:{cv_results.iloc[-1]['test-auc-mean']:.2f}")

In [None]:
cv_results

Now retrain the model on the full training dataset instead of splitting the training dataset across a number of folds. Use the test dataset for early stopping.

In [None]:
test_features = test_data.drop(target_col, axis=1)
test_label = pd.DataFrame(test_data[target_col])
dtest = xgb.DMatrix(test_features, label=test_label)

### Create an experiment run
Create a new run using the [`mlflow.start_run()`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.start_run) API and call the [`log_params()`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_params) and [`log_artifact()`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_artifact) MLflow logging functions to record information into the run. Note that `mlflow.log_artifact()` uploads a local file to the MLflow artifact store under the S3 URI that you specified when you created the MLflow server.

You can also use [`log_input()`](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.log_input) method to persistently log a dataset to the MLflow artifact store.

In [None]:
run_suffix = strftime('%d-%H-%M-%S', gmtime())

with mlflow.start_run(
    run_name=f"feature-engineering-{run_suffix}",
    description="feature-engineering in the notebook 01 ideation") as run:
    mlflow.log_params(
        {
            "train": 0.7,
            "validate": 0.2,
            "test": 0.1
        }
    )
    # Log input dataset metadata and output
    mlflow.log_artifact(local_path=os.path.join(input_path, file_name))
    mlflow.log_artifact(local_path=os.path.join(output_path, "train.csv"))
    mlflow.log_artifact(local_path=os.path.join(output_path, "validation.csv"))
    mlflow.log_artifact(local_path=os.path.join(output_path, "test.csv"))

### Train a model
Use [MLflow model flavor](https://mlflow.org/docs/latest/python_api/index.html#python-api) and [logging functions](https://mlflow.org/docs/latest/tracking/tracking-api.html#tracking-logging-functions) to log parameters, model, model metrics, and various metadata in your experiment runs.

In [None]:
# in the production code you need to use the unique ids
run_suffix = strftime('%d-%H-%M-%S', gmtime())
max_metric = 0.0
best_model_run_id = 0

with mlflow.start_run(
    run_name=f"training-{run_suffix}",
    description=f"Fit estimator with different max_depth") as parent_run:
    mlflow.set_tags({'mlflow.user':user_profile_name})
    
    # Train the model for different max_depth values
    for i, d in enumerate([2, 5, 10, 15, 20]):
        hyperparams["max_depth"] = d
        print(f"Fit estimator with max_depth={d}")
    
        with mlflow.start_run(
            run_name=f"max_depth={d}",
            description=f"Fit estimator with max_depth={d}",
            nested=True) as child_run:
            mlflow.set_tags({'mlflow.user':user_profile_name})
            
            mlflow.xgboost.autolog(log_model_signatures=False, log_datasets=False)
            
            # Train the model
            model = xgb.train(
                params=hyperparams, 
                dtrain=dtrain, 
                evals = [(dtrain,'train'), (dtest,'eval')], 
                num_boost_round=num_boost_round, 
                early_stopping_rounds=early_stopping_rounds, 
                verbose_eval = 0
            )
    
            # Calculate metrics
            test_auc = roc_auc_score(test_label, model.predict(dtest))
            train_auc = roc_auc_score(train_label, model.predict(dtrain))
            
            # Log hyperparameters and metrics to the run
            mlflow.log_params(hyperparams)
            mlflow.log_metrics({"test_auc":test_auc, "train_auc":train_auc}, step=i)
    
            if test_auc > max_metric:
                best_model_run_id = child_run.info.run_id
                max_metric = test_auc
    
            print(f"Test AUC: {test_auc:.4f} | Train AUC: {train_auc:.4f}")

### Register the model in the MLflow model registry
Use [MLflow model registry](https://mlflow.org/docs/latest/model-registry.html#adding-an-mlflow-model-to-the-model-registry) to register a model.
In this example we use [mlflow.register_model()](https://mlflow.org/docs/latest/python_api/mlflow.html#mlflow.register_model) API to register the best model as a new version under the registered model name. `mlflow.register_model()` also automatically registers a model with the SageMaker Model Registry. When registering the MLflow model, a corresponding model package group and model package version are created in SageMaker.

In [None]:
model_uri = f"runs:/{best_model_run_id}/xgboost"
mv = mlflow.register_model(model_uri, registered_model_name)

## Explore experiments with the MLflow UI
As a starting point, you can access all experiments in the Studio UI in the **SageMaker Home** > **Experiments** widget. 

For example, select your experiment:

![](img/experiments-studio.png)

The experiment is opened in the MLflow UI in a new browser window:

![](img/experiment-mlflow.png)

You can select runs you would like to analyse and click **Compare**. In Comparing window you can analyze metrics and parameters logged in your runs:

![](img/comparing-runs.png)

Change to the **Models** to see the registered models and all versions:

![](img/models-mlflow.png)

Refer to the [Announcing the general availability of fully managed MLflow on Amazon SageMaker](https://aws.amazon.com/blogs/aws/manage-ml-and-generative-ai-experiments-using-amazon-sagemaker-with-mlflow/) launch blog post and [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-track-experiments.html) for more examples and details on SageMaker Experiments with MLflow.

## Optional: Run the notebook as a SageMaker job
Sometimes there are scenarios in which you might want to run your notebooks as a non-interactive, scheduled jobs. Studio provides fast and simple tools built from the existing Amazon EventBridge, SageMaker Training and SageMaker Pipelines services to help you schedule your notebook jobs interactively. You don’t have to craft your own custom solution or enlist features from other services that may require additional overhead in time and costs to deploy.

You can run your notebook as a SageMaker job on-demand on based on any schedule you choose. You can also run multiple notebooks in parallel, and parametrize cells in your notebooks.

### Adapt the notebook to run headlessly
A headless notebook runs in a shell outside of the Studio environment. Therefore, your code in the notebook cannot depend on or access the Studio local storage, environment variables, or Python store. You must accordingly change any code which uses the local Studio environment.

### How to run
Follow the instructions in [Notebook-based Workflows](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run.html) in the Developer Guide to run this notebook in non-interactive mode as a SageMaker job:
1. [Configure](https://docs.aws.amazon.com/sagemaker/latest/dg/scheduled-notebook-policies.html) the trust policy and additional IAM permissions for the Studio execution role. If you run this notebook in the domain in the AWS-preprovisioned account, the required permissions are automatically deployed
2. Provide the parameters as specified below
3. Run the notebook on-demand or schedule a job
4. Explore the results

### Set parameters

In [None]:
# output the name of the S3 bucket used by SageMaker – you need this value as bucket_name parameter
print(bucket_name)

In [None]:
# output MLflow server ARN
print(mlflow_arn)

In [None]:
# If running interactively, upload data to S3 to have it here for a headless run
if file_source == 'local':
    input_s3_url = session.upload_data(
        path=os.path.join(input_path, file_name),
        bucket=bucket_name,
        key_prefix=f"{bucket_prefix}/input"
    )
    
    print(input_s3_url)

To parameterize your notebook, you [set](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-auto-run-troubleshoot-override.html) a tag `parameters` on a single cell in your notebook that marks it as the "parameter cell". SageMaker notebook execution will insert a new generated cell directly after that cell tagged with `parameters` at runtime. The generated cell will have code which sets the parameters with values you specifiy when you start an execution job.

The notebook execution job has no access to the JupyterLab EBS volume. Any data you need to pass to the notebook must be copied to an S3 bucket, where the notebook can access it.

To run this notebook as a SageMaker job, choose the **Create a notebook job** icon in the notebook taskbar: 

![](img/notebook-as-sm-job-run.png)

Set the following parameters to specified values in **Parameter** section of the form:

![](img/notebook-as-sm-job-parameters.png)

Parameters and values:

```
mlflow_arn = <SET TO YOUR MLFLOW SERVER ARN>
file_source = S3
input_path = /opt/ml/input/data/sagemaker_headless_execution 
output_path = /opt/ml/output/data
bucket_name = <SET TO YOUR SAGEMAKER BUCKET NAME>
bucket_prefix = from-idea-to-prod/xgboost
```

Select **Run now** or **Run on a schedule** and choose **Create**.

You can also [create a notebook job programmatically with SageMaker Python SDK](https://docs.aws.amazon.com/sagemaker/latest/dg/create-notebook-auto-run-sdk.html). 

---

## Continue with the step 2
open the step 2 [notebook](02-sagemaker-containers.ipynb).

## Further development ideas for your real-world projects
- Try different models, for example some of the [SageMaker built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html), such as [CatBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/catboost.html), [AutoGluon-Tabular](https://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular.html), or [Linear Learner Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html)
- Try [SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot/) to automatically explore different solutions to find the best model. Refer to this hands-on tutorial: [Automatically Create Machine Learning Models](https://aws.amazon.com/getting-started/hands-on/machine-learning-tutorial-automatically-create-models/)
- Implement batch inference using [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html)

## Additional resources
- [Build and Train a Machine Learning Model Locally](https://aws.amazon.com/getting-started/hands-on/machine-learning-tutorial-build-model-locally/)
- [Amazon SageMaker XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html)
- [Automatically Create Machine Learning Models](https://aws.amazon.com/getting-started/hands-on/machine-learning-tutorial-automatically-create-models/)
- [Operationalize your Amazon SageMaker Studio notebooks as scheduled notebook jobs](https://aws.amazon.com/blogs/machine-learning/operationalize-your-amazon-sagemaker-studio-notebooks-as-scheduled-notebook-jobs/)
- [Dataset transformations](https://scikit-learn.org/stable/data_transforms.html)
- [Extracting, transforming and selecting features](https://spark.apache.org/docs/latest/ml-features.html)

# Shutdown kernel

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>