# Today you are a Machine Learning Engineer at Walmart!
For your latest assignemt, your manager has asked your team to work alongside the Data Engineering team to design an optimal Machine Learning pipeline for Automated Inventory, i.e. predicting purchases per customer accurately. The Data Engineering team has provided you with some clean online shopping data gathered from the servers over the last few months. 

You are expected to successfully complete the following tasks:


1. EDA - Feature Selection
2. Build, and evaluate, a Classifier using Sagemaker


# Table of Contents
1. **Introduction**
    1. Imports
    2. Loading the Data
2. **Task 1: Perform Exploratory Data Analysis (EDA)**
    1. Method 1: RandomForestClassifier Feature Importance
    2. Method 2: Select KBest
    3. Trimming the Data and EDA Cont.
3. **Task 2: Build a classifier model training pipeline with Sagemaker**
    1. Step 1: Write Data to S3  
    2. Convenience Functions
    3. Step 2: Initiate model training pipeline
        1. *Option 1*
        2. *Option 2*
    

## Introduction <a name="introduction"></a>

<a name="Imports"></a>
### Imports
Your first step should always be: Importing required libraries!

In [39]:
import io
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.feature_selection as fs

import boto3
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.image_uris import retrieve as retrieve_image_uris

<a name="LoadingData"></a>
### Loading the data

Since our data engineers have provided us with clean data by means of a `.csv` file, we can simply read it into a `pd.Dataframe`!

> **A note on data and SageMaker:**
>
>  One of the benefits of using a tool like Sagemaker, especially in conjuction with S3 and other remote storage options, is that you don't have to ever actually store data on your local machine. This can help a lot when it comes to data privacy considerations, and is a huge benefit to a remote cloud based ML solution.

In [None]:
### START CODE HERE ###

# read the data from the .csv file into a pandas dataframe
raw_data = None

### END CODE HERE ###

In [None]:
print(f"The data has {raw_data.shape[0]} rows and {raw_data.shape[1]} columns")
raw_data.head()

As you can see, the Data Engineering team has sent over ~1M records corresponding to users and their interactions with products. Notice two things:
1. All features are numeric (so the data has already been 1 hot encoded).
2. The last column depicts if the user-product interation resulted in a purchase or not.


This kind of data is referred to as a "user journey"!

# Task 1: Perform Exploratory Data Analysis (EDA) to find features that are important. 
You can choose any one of (or both of) the two methods given below.


1. [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for feature importance that uses Entropy based measure called GINI Index to rank features of importance

2. [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) method using the sklearn feature_selection library


Observe some differences in the ranked methods!

Separate the `raw_data` DataFrame into XData (minus the two 'index columns' of `user_id` and `product_id`) and YData

In [7]:
### START CODE HERE ###

# separate the data into features and labels (remember that we do not need the index columns)
XData = None
YData = None

### END CODE HERE ###

Split the data into training and test data, stratifying across YData to ensure equal proportion labels in the training/test sets

In [8]:
### START CODE HERE ###

# split the data into training and testing sets using a split of 0.3 for the testing set
X_train, X_test, y_train, y_test = None

### END CODE HERE ###

Next, you'll want to scale the data using [this](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html?highlight=minmax#sklearn.preprocessing.MinMaxScaler) approach

In [10]:
### START CODE HERE ###

# scale the data using the a MinMaxScaler
# make sure to be careful to not leak 
# information about the training data to the test data
MMscaler = None
X_train = None
X_test = None

### END CODE HERE ###

#### Method 1: RandomForestClassifier Feature Importance

Initialize, and fit a RandomForestClassifer estimator for use in finding feature importance

In [11]:
### START CODE HERE ###

# create a SelectFromModel object to select the most important features from the RandomForestClassifier
sel = None

#fit the model
None

### END CODE HERE ###


Now, you'll want to get the feature importances from the estimator.

In [None]:
### START CODE HERE ###

# extract all of the importances from the model
importances = None

### END CODE HERE ###

Plot the selected features by importance in descending order

In [None]:
indices = np.argsort(importances)[::-1] 
colname = XData.columns[indices]
plt.figure(figsize=(15,9))
plt.title("Feature importances",size=20)
sns.barplot(x=colname, y=importances[indices],palette="deep")
plt.xticks(rotation=90,size=20)
plt.show()

#### Method 2: SelectKBest

Next you'll want to grab the `k = num features` best features, fit SelectKBest and get a list of feature names and scores!

In [None]:
### START CODE HERE ###

# use SelectKBest to select all of the features and get their importances
kb = None

# fit kb on the training data

# extract the names, and scores from kb
names = None
scores = None

### END CODE HERE ###

names_scores = list(zip(names, scores))

Plot the selected features by importance in descending order

In [17]:
fScoreDF = pd.DataFrame(data = names_scores, columns=['Feat_names','F_Scores'])
fScoreDF_sorted = fScoreDF.sort_values(['F_Scores','Feat_names'], ascending =[False, True])
plt.figure(figsize=(15,9))
sns.barplot(x= "Feat_names", y="F_Scores",data=fScoreDF_sorted)
plt.xticks(rotation=90,size=20)
plt.show()

### Trimming the Data and EDA Cont.
Regardless of which feature importance method you select: You'll find that the temportal data provided does not contribute significantly to the result, so we can safely remove those columns.

In [19]:
### START CODE HERE ###

# create a new dataframe with only the most important features
X_train_1= None
X_test_1= None

### END CODE HERE ###

Next up, let's view the distribution of results

In [None]:

plt.hist(y_train)
plt.hist(y_test)
plt.show()
print("Fraction of Purchases in train data=", np.sum(y_train)/np.shape(y_train)[0])
print("Fraction of Purchases in test data=", np.sum(y_test)/np.shape(y_test)[0])

## Task 2: Build a classifier model training pipeline with Sagemaker.

Finally, the time has come to build your model training pipeline!

> **A note on AutoML and Autopilot**
> 
> A key feature of SageMaker is SageMaker Autopilot, and automated-ML solution that can help you find and tune hyperparameters - and a lot more! While these kinds of features are great, and are an excellent tool in your toolbelt; they are also very compute/time intensive, which means they're outside of the scope of this lab.

### Step 1: Write Data to S3  

It's time for you to convert the processed training data to protobuf and write it to S3 for a [linear-learner](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html) model pipeline using Sagemaker. 

Ensure the `bucket_new` variable is the S3 bucket you made earlier in the lab. (FIRSTNAMEmlops)

In [22]:
### START CODE HERE ###

bucket_new = None

# get an execution role for the model
role = None

### END CODE HERE ###

prefix = 'sagemaker/ecommerce'
s3_train_key = "{}/train/recordio-pb-data".format(prefix)
s3_train_path = os.path.join("s3://", bucket_new, s3_train_key)
vectors = np.array([t.tolist() for t in X_train_1]).astype("float32")
labels = np.array(y_train).astype("float32")
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, vectors, labels)
buf.seek(0)
boto3.resource("s3").Bucket(bucket_new).Object(s3_train_key).upload_fileobj(buf)


#### Convenience Functions

The first convenience function will be a wrapper for training that takes in the S3 location of the training data, the model hyperparameters that define our training job, and the S3 output path for model artifacts. Inside the function, we'll hardcode the algorithm container, the number and type of EC2 instances to train on, and the input and output data formats:

In [44]:
def predictor_from_hyperparams(s3_train_data, hyperparams, output_path):
    """
    Create an Estimator from the given hyperparams, fit to training data, and return a deployed predictor
    """
    # specify algorithm containers and instantiate an Estimator with given hyperparams
    container = retrieve_image_uris("linear-learner", boto3.Session().region_name)

    linear = sagemaker.estimator.Estimator(
        container,
        role,
        instance_count=1,
        instance_type="ml.m4.xlarge",
        output_path=output_path,
        sagemaker_session=sagemaker.Session(),
    )
    linear.set_hyperparameters(**hyperparams)
    # train model
    linear.fit({"train": s3_train_data})
    # deploy a predictor
    linear_predictor = linear.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")
    linear_predictor.serializer = CSVSerializer()
    linear_predictor.deserializer = JSONDeserializer()
    return linear_predictor

The second convenience function is for setting up a hosting endpoint, making predictions, and evaluating the model. To make predictions, we need to set up a model hosting endpoint. Then we feed test features to the endpoint and receive predicted test labels. To evaluate the models we create in this exercise, we'll capture predicted test labels and compare them to actual test data using some common binary classification metrics:

In [32]:
def evaluate(linear_predictor, test_features, test_labels, model_name, verbose=True):
    """
    Evaluate a model on a test set given the prediction endpoint.  Return binary classification metrics.
    """
    # split the test data set into 100 batches and evaluate using prediction endpoint
    prediction_batches = [
        linear_predictor.predict(batch)["predictions"]
        for batch in np.array_split(test_features, 100)
    ]
    # parse raw predictions json to exctract predicted label
    test_preds = np.concatenate(
        [np.array([x["predicted_label"] for x in batch]) for batch in prediction_batches]
    )

    # calculate true positives, false positives, true negatives, false negatives
    tp = np.logical_and(test_labels, test_preds).sum()
    fp = np.logical_and(1 - test_labels, test_preds).sum()
    tn = np.logical_and(1 - test_labels, 1 - test_preds).sum()
    fn = np.logical_and(test_labels, 1 - test_preds).sum()

    # calculate binary classification metrics
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    f1 = 2 * precision * recall / (precision + recall)

    if verbose:
        print(pd.crosstab(test_labels, test_preds, rownames=["actuals"], colnames=["predictions"]))
        print("\n{:<11} {:.3f}".format("Recall:", recall))
        print("{:<11} {:.3f}".format("Precision:", precision))
        print("{:<11} {:.3f}".format("Accuracy:", accuracy))
        print("{:<11} {:.3f}".format("F1:", f1))

    return {
        "TP": tp,
        "FP": fp,
        "FN": fn,
        "TN": tn,
        "Precision": precision,
        "Recall": recall,
        "Accuracy": accuracy,
        "F1": f1,
        "Model": model_name,
    }

The last convenience function is to delete prediction endpoints after we're done with them:

In [33]:
def delete_endpoint(predictor):
    try:
        boto3.client("sagemaker").delete_endpoint(EndpointName=predictor.endpoint)
        print("Deleted {}".format(predictor.endpoint))
    except:
        print("Already deleted: {}".format(predictor.endpoint))

### Step 2: Initiate model training pipeline

Next up you're going to set up a pipeline for training your [linear learner](https://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) model. 

You have two options:
1. Option 1: A binary classifier (Linear Regression) with automated threshold tuning
2. Option 2: A binary classifier with hinge loss, balanced class weights, and automated threshold tuning

Both options will use "Early Stopping Criteria" (read more [here](https://github.com/aws/amazon-sagemaker-examples/blob/main/scientific_details_of_algorithms/linear_learner_class_weights_loss_functions/linear_learner_class_weights_loss_functions.ipynb)) with a target `Recall` of `0.8`. Since you're using early stopping, you don't have to worry as much about setting your Epochs too high and over training your model. Due to taking that into account, you can train the model for 10 epochs.





### Model Training Options
---
***ONLY SELECT ONE OPTION GOING FORWARD.***

---

#### Option 1

In [None]:
# [OPTION 1]: Training a binary classifier (Logistic Regression) with automated threshold tuning

### START CODE HERE ###

autothresh_hyperparams = {
    "feature_dim": # replace with the number of features,
    "predictor_type": # replace with the appropriate predictor,
    "binary_classifier_model_selection_criteria": # replace with the appropriate model selection criteria,
    "target_recall": # replace with the desired recall,
    "epochs": # replace with the number of epochs,
}

# make a prediction endpoint, ensure the path points to the correct bucket, and prefix
autothresh_output_path = f"s3://{None}/{None}/autothresh/output"

# call the convenience function to create a predictor with the correct parameters
autothresh_predictor = predictor_from_hyperparams(
    None, None, None
)

### END CODE HERE ###

In [None]:
# [OPTION 1:] Evaluate the model

### START CODE HERE ###

# set the predictor you created
predictors = {
    "Logistic with auto threshold": None,
         
}

# call the evaluate function with the appropriate parameters
metrics = {
    key: evaluate(None, None, None, None, False)
    for key, predictor in predictors.items()
}

### END CODE HERE ###

pd.set_option("display.float_format", lambda x: "%.3f" % x)
display(
    pd.DataFrame(list(metrics.values())).loc[:, ["Model", "Recall", "Precision", "Accuracy", "F1"]]
)

#### Option 2

In [None]:
# [OPTION 2]: Training a binary classifier with hinge loss, balanced class weights, and automated threshold tuning

### START CODE HERE ###

linear_balanced_hyperparams = {
    "feature_dim": # replace with the number of features,
    "predictor_type": # replace with the appropriate predictor,
    "loss": # replace with the appropriate loss,
    "binary_classifier_model_selection_criteria": # replace with the appropriate model selection criteria,
    "target_recall": # replace with the desired recall,
    "positive_example_weight_mult": # replace with the appropriate positive example weight,
    "epochs": # replace with the number of epochs,
}

# make a prediction endpoint, ensure the path points to the correct bucket, and prefix
linear_balanced_output_path = f"s3://{None}/{None}/linear_balanced/output"

# call the convenience function to create a predictor with the correct parameters
linear_balanced_predictor = predictor_from_hyperparams(
    None, None, None
)

### END CODE HERE ###

In [None]:
# [OPTION 2:] Evaluate the model

### START CODE HERE ###

# set the predictor you created
predictors = {
    "Hinge with class weights": None,
    
}

# call the evaluate function with the appropriate parameters
metrics = {
    key: evaluate(None, None, None, None, False)
    for key, predictor in predictors.items()
}

### END CODE HERE ###


pd.set_option("display.float_format", lambda x: "%.3f" % x)
display(
    pd.DataFrame(list(metrics.values())).loc[:, ["Model", "Recall", "Precision", "Accuracy", "F1"]]
)

### Conclusion

Now that you've trained your model, your can report your results back to your boss!

Before you do, though, don't forget the most important step: Clean up!

### Clean Up

Make sure you delete your predictor endpoints using the convenience function below. 

Once you've completed the lab, please be sure to stop, and then delete, your Sagemaker Notebook instance - as well as remove the contents of, and delete, the S3 bucket you created!

In [None]:
#Finally, clean up all the predictors
for predictor in [
    autothresh_predictor,
    linear_balanced_predictor,
]:
    delete_endpoint(predictor)