## Introduction to training a model using data from AWS Data Exchange and an algorithm from AWS Marketplace

We have a tendency to get attracted to certain fragrant elements or a combination of elements. These combination of elements play an important role in our psychophysiological activity as explained in this [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5198031/). Although each of us is unique and has unique fragrance preferences, there are certain combinations of elements that are widely popular. Understanding these combinations is quite important when you are creating products that would appeal to masses. Having a decision support system that can tell you whether products in a product line you wish to launch contain those elements or not can be immensly beneficial.

Today, we will conduct an experiment to identify combination of elements that are widely popular. 

As part of our experiment, we would use product popularity dataset containing information of Bath and Body works products, a popular bath products company, and train a machine learning model.

Our model would be based on two simple features:
1. Name of the product
2. Category of the product.

For training a machine learning model, we would use a third-party decision forest classification algorithm.

Note: You may use any algorithm supported by Amazon SageMaker for training a model. However, for this experiment, we would use a third-party algorithm from AWS Marketplace.


### Contents:
* [Pre-requisites](#Pre-requisites)
* [Step 1: Export data from AWS Data Exchange to an Amazon S3 bucket](#Step-1:-Export-data-from-AWS-Data-Exchange-to-an-Amazon-S3-bucket)
* [Step 2: Data analysis & Feature Engineering](#Step-2:-Data-analysis-&-Feature-Engineering)
    * [Step 2.1: Remove unnecessary features](#Step-2.1:-Remove-unnecessary-features)
    * [Step 2.2: Create the outcome variable](#Step-2.2:-Create-the-outcome-variable)
    * [Step 2.3: Feature engineer categorical columns](#Step-2.3:-Feature-engineer-categorical-columns)
* [Step 3: Train a machine learning model](#Step-3:-Train-a-machine-learning-model)
    * [Step 3.1 Set up environment](#Step-3.1-Set-up-environment)
    * [Step 3.2 Prepare input dataset](#Step-3.2-Prepare-input-dataset)
    * [Step 3.3 Train a model](#Step-3.3-Train-a-model)
    * [Step 3.4: Tune your model! (Optional)](#Step-3.4:-Tune-your-model!-(Optional))
* [Step 4: Deploy model and verify results](#Step-4:-Deploy-model-and-verify-results)
* [Step 5: Cleanup](#Step-5:-Cleanup)


#### Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

### Pre-requisites:

You need to provide following IAM permissions to the sagemaker execution role to run this notebook successfully.
1. "dataexchange:CreateJob"
2. "dataexchange:StartJob"
3. "dataexchange:GetJob"
4. "dataexchange:ListRevisionAssets"


This sample notebook requires subscription to following entities Marketplace:
1. A Dataset : [VK Retail Data Sets Trial product](https://console.aws.amazon.com/dataexchange/home?region=us-east-1#/products/prodview-gq5plolrup4va)
2. An Algorithm : [Intel®DAAL DecisionForest Classification](https://aws.amazon.com/marketplace/pp/prodview-begzvcpjty3g2?qid=1569615711264&sr=0-2&ref_=srh_res_product_title)

If your AWS account has not been subscribed to these listings, here is the process you can follow:


#### Subscribe to data from AWS Data Exchange:

1. Open the [VK Retail Data Sets Trial product](https://console.aws.amazon.com/dataexchange/home?region=us-east-1#/products/prodview-gq5plolrup4va) from AWS Data Exchange console
2. Read the overview and other information such as pricing, usage, support. 
3. Choose __Continue to Subscribe__
4. If your organization agrees to subscription terms, pricing information, and  Data subscription agreement, then review/update the renewal settings and choose __Subscribe__
5. Once subscription has been successfully created (This step may take 5-10 minutes), you will find the dataset listed in the [__Subscriptions__](https://console.aws.amazon.com/dataexchange/home?region=us-east-1#/subscriptions) section of the console
6. From [subscription page](https://console.aws.amazon.com/dataexchange/home?region=us-east-1#/subscriptions), open **Retail Data Sets (TRIAL)**,  and for this use-case, choose the __retail_trials-bathbodyworks__ dataset. This is the dataset we would use to train a machine learning model.


#### Subscribe to algorithm from AWS Marketplace:
1. Open the [Intel®DAAL DecisionForest Classification listing](https://aws.amazon.com/marketplace/pp/prodview-begzvcpjty3g2?qid=1569615711264&sr=0-2&ref_=srh_res_product_title) from AWS Marketplace
2. Read the **Highlights** section and then **product overview** section of the listing.
3. View **usage information** and then **additional resources**.
4. Note the supported instance types.
5. Next, click on **Continue to subscribe**.
6. Review **End user license agreement**, **support terms**, as well as **pricing information**.
7. **"Accept Offer"** button needs to be clicked if your organization agrees with EULA, pricing information as well as support terms.

**Notes**: 
1. If **Continue to configuration** button is active, it means your account already has a subscription to this listing.
2. Once you click on **Continue to configuration** button and then choose region, you will see that a **Product Arn** will appear. This is the model package ARN that you need to specify while creating a deployable model. However, for this notebook, the algorithm ARN has been specified in **src/model_package_arns.py** file and you do not need to specify the same explicitly.

In [None]:
import sys
!{sys.executable} -m pip install gensim

Congratulations! you are now ready to import the data from AWS Data Exchange to your S3 bucket and train a machine learning model.

In [None]:
#Import necessary libraries.
import math
import re
import os
import json
import time

import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sagemaker.tuner import HyperparameterTuner, IntegerParameter,ContinuousParameter,CategoricalParameter

import boto3
import sagemaker as sage
from sagemaker import AlgorithmEstimator
from sagemaker.amazon.amazon_estimator import RecordSet
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer
from src.algorithm_arns import AlgorithmArnProvider

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer 
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

#Download necessary libraries.
nltk.download('stopwords')
nltk.download('punkt')

In [None]:
#Define common variables.

scaler = MinMaxScaler()

#NLP specific variables
stop_words = stopwords.words('english') 
ps = PorterStemmer() 

#visualization variables
palette=sns.color_palette("RdBu", n_colors=7)

#Amazon SageMaker interaction variables
region_name = boto3.Session().region_name
bucket_name=sage.Session().default_bucket()
role = get_execution_role()


## Step 1: Export data from AWS Data Exchange to an Amazon S3 bucket

In this step, we copy the data from AWS Data Exchange into our S3 bucket. Once subscription to the dataset has been created, you can open the [subscription](https://console.aws.amazon.com/junto/home?region=us-east-1#/subscriptions/prodview-gq5plolrup4va). Choose [retail_trials-bathbodyworks](https://console.aws.amazon.com/junto/home?region=us-east-1#/subscriptions/prodview-gq5plolrup4va/data-sets/b2ef8479168ba3d93979a779431fecd0) and review the value of **dataset_id**. Choose **Revisions** tab and review revision_id.

In [None]:
#Declare Dataset specific variables
dataset_id='b2ef8479168ba3d93979a779431fecd0'
revision_id='1bb695f6aa40aa7ba0d234e849dafcf3'

assets=[]
dataexchange=boto3.client(service_name='dataexchange',region_name='us-east-1')

An asset in AWS Data Exchange is a piece of data that can be stored as an Amazon S3 object. A revision of a dataset contains one or more assets.
In this step, we will list the assets part of the specified revision, put them in an array, and then run a job which would export the assets to an S3 bucket.

In [None]:
list_assets_response = dataexchange.list_revision_assets(
    DataSetId=dataset_id,
    RevisionId=revision_id
)
#Add asset-ids to an array
for asset in list_assets_response['Assets']:
    assets.append({'AssetId':asset['Id'],'Bucket': bucket_name})

#Next, create a job that exports the data from AWS Data Exchange to Amazon S3
create_job_response = dataexchange.create_job(
    Details={
        'ExportAssetsToS3': {
            'AssetDestinations': assets,
            'DataSetId': dataset_id,
            'RevisionId': revision_id
        }
    },
    Type='EXPORT_ASSETS_TO_S3'
)
job_id=create_job_response['Id']

#Trigger the job.
dataexchange.start_job(JobId=job_id)
#Wait while import job runs 

max_time = time.time() + 60*60 # 1 hour
while time.time() < max_time:
    response = dataexchange.get_job(JobId=job_id);
    status = response['State']
    print('get_job_status'+": {}".format(status))
    if status == "COMPLETED" or status == "ERROR":
        break
    time.sleep(30)

Now that data is available in S3, let us download it to our notebook instance.

## Step 2: Data analysis & Feature Engineering

In [None]:
!aws s3 cp --recursive s3://$bucket_name/'retail_trials/bathbodyworks' ./data/raw

We can see that there are two types of files in the dataset. One is a product file and another is a variants file. Let us take a look at each of these files.

In [None]:
!head data/raw/products-2018-01-15.csv

We can see that there are multiple columns in the file which are useful. Number of reviews as well as average rating is useful infromation. Also, the difference between first_discovered and last_discovered can be used to see how long the product lasted in the market.

We can clearly see that the product name itself also contains the sub-category information as well. Sub-category would make a relevant feature.

In [None]:
!head data/raw/products-2018-01-08.csv

A quick look at another file from the dataset shows the trend available, i.e. whether number of reviews/rating went up for the product or not. We are not interested in the trend but how long did the product stay in the market and how popular it became. Which is why we are interested only in the latest numbers available for the product.

Now, let us analyze a sample variants file.

In [None]:
!head data/raw/variants-2018-01-15.csv

You can see that there is additional information available in this file that we could use, such as promotional text, price, and size of the product. These features can potentially improve the model.

However, our experiment setup is  and we will limit our features to name and category. We will not use variant files for our experiment. 

In [None]:
!rm -rf data/raw/variants-*

Next, let us load these files into a pandas dataframe and keep interesting attributes. 

In [None]:
#This function combines CSV files into a single dataframe and adds an additional feature 
#with name extract-date based on the date available in the file's name.
def read_csv_files_from_folder(location):
    
    df_list = []
    files = glob.glob(location + "/*.csv")
    
    #Read each file into a dataframe and then add the dataframe to the list
    for filename in files:
        df = pd.read_csv(filename, index_col=None, header=0)
        df['extract-date']=pd.to_datetime(filename.replace('data/raw/products-', '').replace('.csv', ''))
        df_list.append(df)
    
    #Concatenate the list of dataframes.
    frame = pd.concat(df_list, axis=0, ignore_index=True, sort=False)
    
    return frame

In [None]:
df=read_csv_files_from_folder('data/raw')

### Step 2.1: Remove unnecessary features 

In [None]:
df.head()

In [None]:
#Let us create a new column that indicates how long did the product last in the market
df['to']=pd.to_datetime(df['last_discovered'])
df['from']=pd.to_datetime(df['first_discovered'])
df['lasted_for']=(df['to'] - df['from'] ).dt.days

#Let us drop all unnecessary columns
df.drop(['id','brand_name','site_url','first_discovered','last_discovered','to','from','external_id','extract-date'],axis=1,inplace=True)


In [None]:
print("Products")
print(df.shape)
print(df.dtypes)

In [None]:
#Next, let us keep only the latest stats available for a product. 
df.drop_duplicates(subset='name', keep='last',inplace=True)

#Drop rows containing data with products that lasted for 0 days as the data might be erroneous.
df.drop(df[(df.lasted_for == 0) & (df.review_count==0)].index, inplace=True)

#Print percent missing values.
print("Missing data before removing the missing category data")
print((df.isna().sum()/df.shape[0])*100)

#Missing category data is ~0.074%. Let us drop the missing category data.
df.drop(df[df['category'].isna()].index, inplace=True)

df.reset_index(drop=True)

#Print percent missing values.
print()
print("Missing data")
(df.isna().sum()/df.shape[0])*100

The average_rating column has missing information, and we have multiple columns which indicate popularity. We will address missing data in average_rating column while creating the label column.

In [None]:
print(df.shape)

df.head()

Data looks much better! We will do the feature engineering for categorical columns after we define the outcome variable.

### Step 2.2: Create the outcome variable

Our goal is to determine popularity based on review_count, lasted_for, average_rating, which are indicators of popularity. Let us analyze these and create an outcome variable .

#### Analyze review count for products

In [None]:
review_count_threshold_percentile=df['review_count'].quantile(0.70)
review_count_50_percentile=df['review_count'].quantile(0.50)

print(df['review_count'].quantile([0.01,.1, 0.25, .5,0.70,0.75,0.9,0.99]))
sns.set(rc={'figure.figsize':(14,1.27)})
sns.set(style="whitegrid")
bplot = sns.boxplot(x=df['review_count'],orient ='h', palette=palette)

#### Analyze how long did the products last

In [None]:
lasted_for_threshold_percentile=df['lasted_for'].quantile(0.70)
lasted_for_50_percentile=df['lasted_for'].quantile(0.50)
print(df['lasted_for'].quantile([0.01,.1, 0.25, .5,0.70,0.75,0.9,0.99]))

bplot = sns.boxplot(x=df['lasted_for'],orient ='h',palette=palette)

#### Analyze product review ratings 

In [None]:
rating_threshold_percentile=df['average_rating'].quantile(0.70)
print(df['average_rating'].quantile([0.01,.1, 0.25, .5,0.70,0.75,0.9,0.99]))

bplot = sns.boxplot(x=df['average_rating'],orient ='h',palette=palette)

Our goal is to predict whether based on the name, the product will become popular or not. Let us define popularity_status using following rules:
1. Product is __popular__ if:

    A. The product's number of reviews are higher than __70%__ of the products OR   
    
    B. It lasted longer than __70%__ of the products OR
    
    C. It's rating is higher than __70%__ of the products and lasted longer than __50% products__ with number of reviews higher than __50% products__. 
    

2. Else:
    Product is __not popular__.
    
With this criteria, the problem becomes a simple binary classification problem.

In [None]:
#This method accepts a row and returns the outcome value based on rules defined in the 
#previous cell.
def is_popular(row):
    if  row['lasted_for'] > lasted_for_threshold_percentile:
        return 'popular'
    
    if  row['review_count'] > review_count_threshold_percentile:
        return 'popular'

#note that the sequence of evaluation starts from the left most predicate. 
#If the review_count and lasted for predicate are false, and average_rating is NaN,
#then we automatically mark the product as not being popular.
  
    if (row['review_count'] > review_count_50_percentile \
        and row['lasted_for'] > lasted_for_50_percentile \
        and not math.isnan(row['average_rating']) \
        and (row['average_rating']>=rating_threshold_percentile)):
        return 'popular'
    
    return 'not_popular'

In [None]:
df['label']=df.apply(lambda row:is_popular(row),axis=1)

df['label'].value_counts()

In [None]:
#As we have a clear outcome variable, we dont need the following three variables anymore.
#Let us drop review_count, lasted_for, and average_rating features from the dataframe.

df.drop(['review_count','lasted_for','average_rating'],axis=1,inplace=True)

In [None]:
df.head()

Next, let us perform feature engineering on categorical columns.

### Step 2.3: Feature engineer categorical columns

#### Create a product-length feature 
Let us create an additional feature that contains the length of the original name. 

In [None]:
df['length']=df['name'].apply(lambda x:len(x.split()))

Let us visualize the data in newly created column.

In [None]:
sns.set(rc={'figure.figsize':(14,1.27)})
sns.set(style="whitegrid")

sns.boxplot(y='label',x='length',data=df, orient ='y',order=['popular','not_popular'])

Simply by looking at the data we can see that 75% of most popular products have names less than 6 words. 

#### Clean the 'name' feature

In [None]:
df.head()

Based on a quick look at the dataframe, we can clearly see that last few words are repeating indicating presence of a subcatgory. Let us clean the name feature and then extract a subcategory.

In [None]:
num2words = {1: 'one', 2: 'two', 3: 'three', 4: 'four', 5: 'five', \
             6: 'six', 7: 'seven', 8: 'eight', 9: 'nine'}
#The following method accepts a text and performs following tasks:
#1. Convert text into lowercase.
#2. Replaces dashes.
#3. Removes all special characters.
#4. Removes any words that contain less than 3 characters or is a stopword.
#5. Creates stem of each word using porterstemmer algorithm.
#6. Convert numbers into their word representations
def clean_text(text):
    
    text=text.lower().strip()
    
    text = re.sub('-',' ', text)

    text = re.sub('[^A-Za-z0-9 ]+','', text)
    
    tokenized_name=nltk.word_tokenize(text)
    
    clean_words=[]
    
    for word in tokenized_name:
        stem=ps.stem(word) 
        if ((stem not in stop_words) and (len(stem)>=3)):
            clean_words.append(stem)  
        elif(stem.isnumeric()):
            if  (int(stem) in num2words):
                clean_words.append(num2words[int(stem)])  
            
    return clean_words

#Let us perform a test!
#print(clean_text('Vanilla Spiced Pear Wallflowers Fragrance Refill 3-wick candle'))
#print(clean_text('Mahogany Apple 3-Wick Candle'))

In [None]:
#Let us create a new column that contains clean representations
df['name']=df['name'].apply(lambda x:clean_text(x))
df.tail()

#### Create subcategory feature 

Now that we have cleaned the 'name' column, we can create a sub-category column based on number of occurances of a suffix.

In [None]:
#Based on the length of the 'name' feature, emit one or more suffixes to identify the 
#popular sub-categories.

#This function accepts dataframe and extracts sub-categories from the name.
def get_sub_categories(df):
    
    potential_sub_categories=[]
    
    for index,row in df.iterrows():
        name=row['name']
        sub_category=''
    
        if(row['length']>=4):
            potential_sub_categories.append(' '.join(name[-3:]).strip())
            potential_sub_categories.append(' '.join(name[-2:]).strip())
            potential_sub_categories.append(' '.join(name[-1:]).strip())
        
        elif(row['length']>2):
            potential_sub_categories.append(' '.join(name[-2:]).strip())
            potential_sub_categories.append(' '.join(name[-1:]).strip())
        
        elif(row['length']==2):
            potential_sub_categories.append(' '.join(name[-1:]).strip())
    
    #For this experiment, we would consider only those words that have occured fifteen
    #or more times, as valid subcategories.
    
    sub_category_counts=pd.Index(potential_sub_categories).value_counts()>15
    sub_categories=sub_category_counts[sub_category_counts].index
    
    #Print size and a few sample categories from the list.
    print(len(sub_categories))
    print(sub_categories)
    
    return sub_categories

In [None]:
pd.set_option('display.max_colwidth', -1)

sub_categories=get_sub_categories(df)

In [None]:
df.tail()

Based on a quick look, these look like real sub-categories. Let us remove these from the name and populate it in a separate column.

In [None]:
#Next, let us create additional features such as length and a sub-category. The preference is 
#given to the longest sub-category if more than one sub-category has been found in the name.

def set_sub_category_and_name(row):

    name=row['name']
    length=row['length']
    
    if(length>=4):
        sub_category = ' '.join(name[-3:]).strip()
        if sub_category in sub_categories:
            row['sub_category']=sub_category.replace(" ","_")
            row['name']= name[:-3]
            return row
    
    if (length>=3):
        sub_category = ' '.join(name[-2:]).strip()

        if sub_category in sub_categories:
            row['sub_category']=sub_category.replace(" ","_")
            row['name']= name[:-2]
            return row
    
    if (length>=2):
        sub_category = ' '.join(name[-1:]).strip()
        if sub_category in sub_categories:
            row['sub_category']=sub_category.replace(" ","_")
            row['name']= name[:-1]
            return row
    return row
#row={'name':['mahogani','appl','three','wick','candl'],'length':5}
#print(set_sub_category_and_name(row))

In [None]:
df=df.apply(lambda row: set_sub_category_and_name(row),axis=1) 
df.tail()

In [None]:
pd.set_option('display.max_colwidth', -1)

In [None]:
sns.countplot(x=df['label'],orient ='h',hue=df['category'],order=['popular','not_popular'])
sns.set(rc={'figure.figsize':(14,8.27)})


Here is a quick analysis based on the data:
1. We can clearly see that specific categories such as __Body care__ do predictably better than __Home fragrance__. 
2. We can also see that __Gifts__ category is less popular which might be due to seasonal aspect associated with it. 
3. We can also see that __Hidden categories__ do somewhat well.
3. We can also see that __Hand Soaps__ do not seem to perform that well.

In [None]:
df.groupby('label').describe()

We can see that products with shorter names tend to do better. 

In [None]:
print("Missing data")
print((df.isna().sum()/df.shape[0])*100)

df=df.fillna("None")
print((df.isna().sum()/df.shape[0])*100)

In [None]:
df.head()

In [None]:
#Let us perform one-hot encoding for category and sub_category features
df = pd.concat([df,pd.get_dummies(df['category'], prefix='category')],axis=1)
df = pd.concat([df,pd.get_dummies(df['sub_category'], prefix='sub_category')],axis=1)

# drop original columns
df.drop(['sub_category','category'],axis=1, inplace=True)

We would be generating embeddings for product names using Machine learning. Before we do so, let us split the data into two sets: Train & Test data

In [None]:
#Let us split the dataset into train and test
np.random.seed(0)

#Shuffle the dataset.
df=df.sample(frac=1)

train, test = train_test_split(df, test_size=0.2)

Next, we will create embeddings for the name column using training dataset. Embeddings is a language modelling technique of mapping words or phrases to vectors of real numbers.

In [None]:
#Create embeddings for name using train dataset.
labels={"popular":1,"not_popular":0}
documents =[]
for index,row in train.iterrows():
    name=row['name']
    documents.append(TaggedDocument(words=name, tags=[labels.get(row['label'])]))

model = Doc2Vec(vector_size=12, min_count=4, window=2,dbow_words=1)
model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)


In [None]:
print(model.infer_vector(['blossom','shea','butter']))
print(model.wv.most_similar(['orang']))

In [None]:
#This function sets embeddings for the row in the dataframe
def set_name_embeddings(row):
    #Extract embedding
    embedding= model.infer_vector(name)
    i=0
    #Set the column with value based on the embedding
    for entry in embedding:
        row['name_'+str(i)]=entry
        i=i+1
    return row

In [None]:
train=train.apply(lambda row: set_name_embeddings(row),axis=1) 
test=test.apply(lambda row: set_name_embeddings(row),axis=1) 

Now that we have embeddings, we dont need the original "name" feature anymore, lets drop it.

In [None]:
train.drop(['name',],axis=1, inplace=True)
test.drop(['name',],axis=1, inplace=True)

In [None]:
le = LabelEncoder()

#Add class as the label-encoded column to the dataframe and drop the original label column.
train['class'] = le.fit_transform(train['label'])
test['class'] = le.transform(test['label'])

train.drop(['label'],axis=1, inplace=True)
test.drop(['label'],axis=1, inplace=True)

list(le.classes_)

## Step 3: Train a machine learning model

Now that our dataset is ready and all columns are in numeric format, we are ready to train a machine learning model. 

### Step 3.1 Set up environment

In [None]:
#Import necessary libraries and initialize variables
sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()
role = get_execution_role()
output_location = 's3://{}/{}'.format(bucket, 'output')

### Step 3.2 Prepare input dataset

In [None]:
#Let us split the dataset into train and test and upload the training data to an S3 bucket.
file='data/training.csv'

np.savetxt(file,train,delimiter=',')

data=sagemaker_session.upload_data(file, bucket=bucket, key_prefix='data_file')

### Step 3.3 Train a model

In [None]:
#Define hyperparameters
hyperparameters={"nClasses": 2, \
                 "nTrees":571,\
                 "maxTreeDepth":13,\
                 "varImportance":"none",\
                 "resultsToCompute":"computeOutOfBagError"}

In [None]:
#Let us load the algorithm's ARN into a variable.
algo_arn = AlgorithmArnProvider.get_decision_forest_algorithm_arn(region_name)
algo_arn

In [None]:
#Create an estimator object for running a training job
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="daal-decision-forest",
    role=role,
    train_instance_count=1,
    train_instance_type='ml.m5.2xlarge',
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters=hyperparameters
)
#Run the training job.
estimator.fit({"training": data})

Since this is an experiment,, you do not need to run a hyperparameter tuning job. However, if you would like to see how to tune a model trained using a third-party algorithm with Amazon SageMaker's hyperparameter tuning functionality, you can run the optional tuning step.

### Step 3.4: Tune your model! (Optional)

The Algorithm's [product detail page](https://aws.amazon.com/marketplace/pp/prodview-begzvcpjty3g2) specifies that **minObservationsInLeafNode**, **maxTreeDepth**, **nTrees**, and **minObservationsInLeafNode** are tunable parameters supported. Let us specify ranges for these parameters and perfom hyperparameter tuning.  

In [None]:
hyperparameter_ranges = {'maxTreeDepth':IntegerParameter(10, 500),'nTrees':IntegerParameter(100, 1000),'featuresPerNode':IntegerParameter(10, 110)}

tuner = HyperparameterTuner(estimator=estimator, base_tuning_job_name='decision-forest',
                                objective_metric_name='OutOfBagError',
                            objective_type='Minimize',
                                hyperparameter_ranges=hyperparameter_ranges,
                                max_jobs=50, max_parallel_jobs=7)

#Uncomment following two lines to run Hyperparameter optimization job. 
#tuner.fit({'training':  data})
#tuner.wait()

## Step 4: Deploy model and verify results

Let us deploy the model for performing real-time inference.

In [None]:
predictor = estimator.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer)

Calculate metrics such as accuracy on the test dataset.

In [None]:
#Extract features from the test dataset
features_test = np.array(test.drop(["class"], axis=1)).astype("float32")

#perform prediction on the features 
prediction = predictor.predict(features_test).decode('utf-8')

#Extract predictions and put them into an Array
predicted=[]

predictions = np.fromstring(prediction, dtype=np.float64, sep=' ').reshape(1,features_test.shape[0])[0]

for pred in predictions:
    predicted.append(int(pred))

#Extract labels from the test dataset
actual = np.array(test["class"]).astype("float32")

#Print metric
print("Accuracy on test data: ", str(accuracy_score(actual, predicted)))

In [None]:
#perform_inference method accepts name and category and converts them into the 
#payload format supported by model. 
#To do so, it performs following steps:
# 1. Extracts length feature
# 2. cleans the name 
# 3. extracts sub-category and cleaned name
# 4. sets one-hot encoded category & subcategory column to 1.
# 5. Sets embeddings for the name column.
# 6. Performs prediction 
# 7. Returns label -  popular/not_popular
def perform_inference(name,category):
    row={}
    
    row["length"]=len(name.split())
    
    row["name"]=clean_text(name)

    row['category_'+category]=1
    
    row = set_sub_category_and_name(row)
    if 'sub_category' in row:
        row['sub_category_'+row['sub_category']]=1
        del row["sub_category"]
    else:
        row["sub_category_None"]=1
    
    row=set_name_embeddings(row)
    del row["name"]
    
    df_infer = pd.DataFrame(data=None, columns=df.columns)
    df_infer=df_infer.append(row,ignore_index=True)
    df_infer=df_infer.fillna(0)
    features = df_infer.values.astype("float64")
        
    prediction = predictor.predict(features).decode('utf-8')
    
    prediction_label = int(np.fromstring(prediction, dtype=np.float64, sep=' ')[0])

    print(le.inverse_transform([prediction_label]))

In [None]:
perform_inference("Ginger orange 3-Wick Candle","Home Fragrance")

You just used data available from AWS Data Exchange to train a machine learning model that can predict popularity of a product based on its name within the category.

### Step 5: Cleanup

In [None]:
predictor.delete_endpoint()
predictor.delete_model()

Finally, if the AWS Marketplace subscription was created just for the experiment and you would like to unsubscribe to the product, here are the steps that can be followed.
Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to un-subscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=lbr_tab_ml)
2. Locate the listing that you would need to cancel subscription for, and then __Cancel Subscription__ can be clicked to cancel the subscription.
