# SageMaker Endpoint Demo
Create a text classification model endpoint using [BlazingText](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html) by Amazon SageMaker. BlazingText is Amazon's implementation of [fastText](https://fasttext.cc/). The supervised version implements [Word2Vec](https://en.wikipedia.org/wiki/Word2vec) to featurize a text corpus and multinomial logistic regression to predict an inputs class.

## Dependencies
* Create an [AWS account](https://aws.amazon.com) if you don't already have one
* [Create an admin user in AWS IAM](https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started_create-admin-group.html)
* ``pip install awscli --upgrade --user``
* [Configure the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration)
* ``pip install boto3``
* ``pip install sagemaker``
* [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html)

In [None]:
import pandas as pd

import boto3
import sagemaker 

from utils import clean_text

### Format training data
BlazingText requires a ``.txt`` file as an input for training. Each line in the file should be in the following format: **"\__label__class_name text input to categorize"**

**Here's an example:**
```
__label__positive the service was fantastic and the pizza was to die for
__label__negative i got food poisoning here definitely not going back wish i could give 0 stars
```

In [None]:
# Read in Yelp dataset.
df = pd.read_csv('./data/train.csv', header='infer', names=['rating', 'item_text'])

# Subset the data for purposes of this demo.
df = df.sample(frac=.1)

# Format text for BlazingText input.
df['item_text'] = df['item_text'].map(clean_text)
df['rating'] = df['rating'].astype(str)
df['data'] = '__label__' + df['rating'] + ' ' + df['item_text']

# Write data locally.
with open('./data/train.txt', 'w') as f:
    for item in df['data']:
        f.write('{}\n'.format(item))

In [None]:
# Create SageMaker session.
sess = sagemaker.Session()

### Training data to S3
Create a bucket and filepath for your data on AWS S3.

In [None]:
# Define the S3 bucket to use, along with path for model files.
bucket = 'bwl-sage'
prefix = 'sagemaker/blaze-demo'
train_channel = prefix + '/train'
s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

In [None]:
# Upload the training data to S3.
sess.upload_data(path='./data/train.txt', bucket=bucket,
                 key_prefix=train_channel)

### Containers
SageMaker comes with several built-in model containers or you can define your own.

In [None]:
# Define the AWS region.
region = boto3.Session().region_name

# Use the BlazingText built-in container.
container = sagemaker.amazon.amazon_estimator.get_image_uri(
    region, "blazingtext", "latest"
)

### SageMaker IAM roles
Create a SageMaker execution role by following the instructions [here](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role-sagemaker-notebook.html). Once the role is created, copy the role ARN below.

In [None]:
# Your IAM role ARN here.
role = 'arn:aws:iam::484039584206:role/service-role/AmazonSageMaker-ExecutionRole-20190723T163578'

### Build and train the model
The purpose of this demo is to build a model API with SageMaker. Because of that, I've ignored hyperparameter tuning and model validation. With SageMaker, you can perform [Bayesian hyperparameter optimization](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html) -- check it out!

In [None]:
blazing = sagemaker.estimator.Estimator(
    container, role, train_instance_count=1,
    train_instance_type='ml.m4.xlarge', train_volume_size=5,
    train_max_run=36000, input_mode='File', output_path=s3_output_location,
    sagemaker_session=sess
)
blazing.set_hyperparameters(
    mode="supervised", epochs=5, min_count=5, sampling_threshold=.0001,
    learning_rate=.03, window_size=5, vector_dim=100, negative_samples=5,
    subwords=True
)

In [None]:
train_data = sagemaker.session.s3_input(
    s3_train_data, distribution='FullyReplicated', content_type='text/plain',
    s3_data_type='S3Prefix'
)
data_channels = {'train': train_data}

In [None]:
# Train the model.
blazing.fit(inputs=data_channels, logs=True)

In [None]:
# Deploy the model and print the endpoint.
blazing_endpoint = blazing.deploy(initial_instance_count=1, instance_type='ml.t2.medium')
print('Model deployed to endpoint: {}'.format(blazing_endpoint.endpoint))

In [None]:
import json
blazing_endpoint.predict(json.dumps({'instances': ['great food!', 'bad service']}))

### WARNING
Leaving these resources up costs $$$! If you don't intend to leave them up, be sure to remove them by navigating to SageMaker and S3 in the AWS console and deleting the resources you've provisioned.

In [None]:
# You can delete the endpoint by uncommenting the code below and executing the following this cell. 
# You'll still need to delete your other resources -- S3, Lambda, API Gateway.

# blazing.delete_endpoint()

# Next
* **IAM policy and role for Lambda to invoke SageMaker**
```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "sagemaker:InvokeEndpoint",
            "Resource": "*"
        }
    ]
}
```
* **Lambda function defined -- use ``lambda_function.py`` or build your own**
* **Configure API Gateway**
* **Test the API**
```json
{
    "data": [
        "Pizza was ammaazzzzing. Def would recommend.",
        "my dog loves the beer. atmosphere was great.",
        "food was ok. horrible service. not going back.",
        "Service was way too slow!"
    ]
}
```