Skip to content

Latest commit

 

History

History
327 lines (216 loc) · 32.5 KB

aws-sagemaker.md

File metadata and controls

327 lines (216 loc) · 32.5 KB

AWS SageMaker ⭕ sp19-616-111 ✋

Amazon SageMaker is a fully managed machine learning service. Using Amazon SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment.

Amazon SageMaker provides following advantages:

  • An integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you do not have to manage servers
  • Common machine learning algorithms that are optimized to run efficiently against extremely large data in a distributed environment
  • Native support for bring-your-own-algorithms and frameworks, Amazon SageMaker offers flexible distributed training options that adjust to your specific workflows
  • Deploy a model into a secure and scalable environment by launching it with a one click from the Amazon SageMaker console
  • Training and hosting are billed by minutes of usage, with no minimum fees and no upfront commitments.

Machine Learning with Amazon SageMaker

We explained a typical machine learning workflow and summarizes how you accomplish those tasks with Amazon SageMaker.

In general, machine learning is all about you teach a computer to make predictions, or inferences. As a first step, you use an algorithm and example data to train a model. Then you integrate your model into your application to generate inferences in real time and at scale. In a production environment, a model typically learns from millions of example data items and produces inferences in hundreds to less than 20 milliseconds.

The following ⭕ use proper image notation with caption and citation as discussed in notation.md diagram illustrates the typical workflow for creating a machine learning model:

AWS SageMaker

As the above diagram illustrates, you typically perform the following activities:

Generate example data—To train a model, you need example data. The type of data that you need depends on the business problem that you want the model to solve (the inferences that you want the model to generate). For example, suppose that you want to create a model to predict a number given an input image of a handwritten digit. To train such a model, you need example images of handwritten numbers.

Data scientists: often spend a lot of time exploring and preprocessing, or "wrangling," example data before using it for model training. To preprocess data, you typically do the following:

  • Fetch the data: You might have in-house example data repositories, or you might use datasets that are publicly available. Typically, you pull the dataset or datasets into a single repository.

  • Clean: the data—To improve model training, inspect the data and clean it up as needed. For example, if your data has a country name attribute with values United States and US, you might want to edit the data to be consistent.

  • Prepare or transform: the data—To improve performance, you might perform additional data transformations. For example, you might choose to combine attributes. If your model predicts the conditions that require de-icing an aircraft instead of using temperature and humidity attributes separately, you might combine those attributes into a new attribute to get a better model.

In Amazon SageMaker, you preprocess example data in a Jupyter notebook on your notebook instance. You use your notebook to fetch your dataset, explore it and prepare it for model training.

Train a model—Model training includes both training and evaluating the model, as follows:

  • Training the model: To train a model, you need an algorithm. The algorithm you choose depends on a number of factors. For a quick, out-of-the-box solution, you might be able to use one of the algorithms that Amazon SageMaker provides.

You also need compute resources for training. Depending on the size of your training dataset and how quickly you need the results, you can use resources ranging from a single, small general-purpose instance to a distributed cluster of GPU instances. For more information, refer the sub-section Train a Model with Amazon SageMaker.

Evaluating the model—After you've trained your model, you evaluate it to determine whether the accuracy of the inferences is acceptable. In Amazon SageMaker, you use either the AWS SDK for Python (Boto) or the high-level Python library that Amazon SageMaker provides to send requests to the model for inferences.

You use a Jupyter notebook in your Amazon SageMaker notebook instance to train and evaluate your model.

  • Deploy the model: You traditionally re-engineer a model before you integrate it with your application and deploy it. With Amazon SageMaker hosting services, you can deploy your model independently, decoupling it from your application code. For more information, see Deploy a Model on Amazon SageMaker Hosting Services.

Machine learning is a continuous cycle. After deploying a model, you monitor the inferences, then collect "ground truth," and evaluate the model to identify drift. You then increase the accuracy of your inferences by updating your training data to include the newly collected ground truth, by retraining the model with the new dataset. As more and more example data becomes available, you continue retraining your model to increase accuracy.

Get Start with SageMaker

In this section, we will explain on how you create your first Amazon SageMaker notebook instance, and train a model. You train the model using an algorithm provided by Amazon SageMaker, deploy it, and validate it by sending inference requests to the model's endpoint.

You use this notebook instance for all kind of machine learning models that are available as part of AWS SageMaker notebook instance or customer machine learning libraries.

Train a Model with Amazon SageMaker

To train a model in Amazon SageMakar, you can Download the MNIST dataset to your Amazon SageMaker notebook instance, then review the data and preprocess it. For efficient training, you convert the dataset from the numpy.array format to the RecordIO protobuf format. A numpy.array is an n-dimensional array object that the NumPy scientific computing library uses. RecordIO protobuf is a binary data format that the Amazon SageMaker K-Means algorithm expects as input.

  • Start an Amazon SageMaker training job.

  • Deploy the model in Amazon SageMaker.

  • Validate the model by sending inference requests to the model's endpoint. You send images of handwritten, single-digit numbers. The model returns the number of the cluster (0 through 9) that the images belong to.

Important to note that, for model training, deployment, and validation, you can use either of the following:

  • The high-level Python library provided by Amazon SageMaker

  • The AWS SDK for Python (Boto)

The high-level library abstracts several implementation details, and is easy to use. This exercise provides separate code examples using both libraries. If you're a first-time Amazon SageMaker user, we recommend that you use the high-level Python library.

Basically, there are two ways to practice this exercise:

Follow the steps to create, deploy, and validate the model. You create a Jupyter notebook in your Amazon SageMaker notebook instance, and copy code, paste it into the notebook, and run it.

If you're familiar with using sample notebooks, open and run the following example notebooks that Amazon SageMaker provides in the SageMaker Python SDK section of the SageMaker Examples tab of your notebook instance:

  1. kmeans_mnist.ipynb

  2. kmeans_mnist_lowlevel.ipynb

Create a Jupyter Notebook and Initialize Variables

Now, create a Jupyter notebook in your Amazon SageMaker notebook instance and initialize variables.

To create a Jupyter notebook ,sign in to the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.

Open the notebook instance, by choosing Open next to its name. The Jupyter notebook server page appears:

AWS SageMaker

  • To create a notebook, in the Files tab, choose New, and conda_python3. This pre-installed environment includes the default Anaconda installation and Python 3.

  • In the Jupyter notebook, under File, choose Save as, and name the notebook.

Copy the following Python code and paste it into your notebook. Add the name of the S3 bucket that you created in Set Up Amazon SageMaker, and run the code. The get_execution_role function retrieves the IAM role you created when you created your notebook instance.

from sagemaker import get_execution_role
role = get_execution_role()
bucket = 'bucket-name' # Use the name of your s3 bucket here

Download, Explore, and Transform the Training Data

Now download the MNIST dataset to your notebook instance. Then review the data, transform it, and upload it to your S3 bucket.

You transform the data by changing its format from numpy.array to RecordIO. The RecordIO format is more efficient for the algorithms provided by Amazon SageMaker.

MNIST dataset

To download the MNIST dataset, copy and paste the following code into the notebook and run it:

%%time import pickle, gzip, numpy, urllib.request, json

# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz", "mnist.pkl.gz")
with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

The above code does the following:

  • Downloads the MNIST dataset (mnist.pkl.gz) from the deeplearning.net website to your Amazon SageMaker notebook instance.

  • Unzips the file and reads the following three datasets into the notebook's memory:

  • train_set—You use these images of handwritten numbers to train a model.

  • valid_set—After you train the model, you validate it using the images in this dataset.

  • test_set—You don't use this dataset in this exercise.

Training Dataset

Typically, you explore training data to determine what you need to clean up and which transformations to apply to improve model training. For this exercise, you don't need to clean up the MNIST dataset. Simply display one of the images in the train_set dataset.

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (2,10)


def show_digit(img, caption='', subplot=None):
    if subplot == None:
        _, (subplot) = plt.subplots(1,1)
    imgr = img.reshape((28,28))
    subplot.axis('off')
    subplot.imshow(imgr, cmap='gray')
    plt.title(caption)

show_digit(train_set[0][30], 'This is a {}'.format(train_set[1][30]))

The code uses the matplotlib library to get and display the 31st image from the training dataset.

Amazon SageMaker Examples

This repository contains example notebooks that show how to apply machine learning and deep learning in Amazon SageMaker

Examples

Introduction to Ground Truth Labeling Jobs

These examples provide quick walkthroughs to get you up and running with the labeling job workflow for Amazon SageMaker Ground Truth.

Introduction to Applying Machine Learning

These examples provide a gentle introduction to machine learning concepts as they are applied in practical use cases across a variety of sectors.

  • Targeted Direct Marketing predicts potential customers that are most likely to convert based on customer and aggregate level metrics, using Amazon SageMaker's implementation of XGBoost.
  • Predicting Customer Churn uses customer interaction and service usage data to find those most likely to churn, and then walks through the cost/benefit trade-offs of providing retention incentives. This uses Amazon SageMaker's implementation of XGBoost to create a highly predictive model.
  • Time-series Forecasting generates a forecast for topline product demand using Amazon SageMaker's Linear Learner algorithm.
  • Cancer Prediction predicts Breast Cancer based on features derived from images, using SageMaker's Linear Learner.
  • Ensembling predicts income using two Amazon SageMaker models to show the advantages in ensembling.
  • Video Game Sales develops a binary prediction model for the success of video games based on review scores.
  • MXNet Gluon Recommender System uses neural network embeddings for non-linear matrix factorization to predict user movie ratings on Amazon digital reviews.
  • Fair Linear Learner is an example of an effective way to create fair linear models with respect to sensitive features.
  • Population Segmentation of US Census Data using PCA and Kmeans analyzes US census data and reduces dimensionality using PCA then clusters US counties using KMeans to identify segments of similar counties.

SageMaker Automatic Model Tuning

These examples introduce SageMaker's hyperparameter tuning functionality which helps deliver the best possible predictions by running a large number of training jobs to determine which hyperparameter values are the most impactful.

  • XGBoost Tuning shows how to use SageMaker hyperparameter tuning to improve your model fits for the Targeted Direct Marketing task.
  • TensorFlow Tuning shows how to use SageMaker hyperparameter tuning with the pre-built TensorFlow container and MNIST dataset.
  • MXNet Tuning shows how to use SageMaker hyperparameter tuning with the pre-built MXNet container and MNIST dataset.
  • Keras BYO Tuning shows how to use SageMaker hyperparameter tuning with a custom container running a Keras convolutional network on CIFAR-10 data.
  • R BYO Tuning shows how to use SageMaker hyperparameter tuning with the custom container from the Bring Your Own R Algorithm example.
  • Analyzing Results is a shared notebook that can be used after each of the above notebooks to provide analysis on how training jobs with different hyperparameters performed.

Introduction to Amazon Algorithms

These examples provide quick walkthroughs to get you up and running with Amazon SageMaker's custom developed algorithms. Most of these algorithms can train on distributed hardware, scale incredibly well, and are faster and cheaper than popular alternatives.

  • k-means is our introductory example for Amazon SageMaker. It walks through the process of clustering MNIST images of handwritten digits using Amazon SageMaker k-means.
  • Factorization Machines showcases Amazon SageMaker's implementation of the algorithm to predict whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier.
  • Latent Dirichlet Allocation (LDA) introduces topic modeling using Amazon SageMaker Latent Dirichlet Allocation (LDA) on a synthetic dataset.
  • Linear Learner predicts whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier from Amazon SageMaker Linear Learner.
  • Neural Topic Model (NTM) uses Amazon SageMaker Neural Topic Model (NTM) to uncover topics in documents from a synthetic data source, where topic distributions are known.
  • Principal Components Analysis (PCA) uses Amazon SageMaker PCA to calculate eigendigits from MNIST.
  • Seq2Seq uses the Amazon SageMaker Seq2Seq algorithm that's built on top of Sockeye, which is a sequence-to-sequence framework for Neural Machine Translation based on MXNet. Seq2Seq implements state-of-the-art encoder-decoder architectures which can also be used for tasks like Abstractive Summarization in addition to Machine Translation. This notebook shows translation from English to German text.
  • Image Classification includes full training and transfer learning examples of Amazon SageMaker's Image Classification algorithm. This uses a ResNet deep convolutional neural network to classify images from the caltech dataset.
  • XGBoost for regression predicts the age of abalone (Abalone dataset) using regression from Amazon SageMaker's implementation of XGBoost.
  • XGBoost for multi-class classification uses Amazon SageMaker's implementation of XGBoost to classify handwritten digits from the MNIST dataset as one of the ten digits using a multi-class classifier. Both single machine and distributed use-cases are presented.
  • DeepAR for time series forecasting illustrates how to use the Amazon SageMaker DeepAR algorithm for time series forecasting on a synthetically generated data set.
  • BlazingText Word2Vec generates Word2Vec embeddings from a cleaned text dump of Wikipedia articles using SageMaker's fast and scalable BlazingText implementation.
  • Object Detection illustrates how to train an object detector using the Amazon SageMaker Object Detection algorithm with different input formats (RecordIO and image). It uses the Pascal VOC dataset. A third notebook is provided to demonstrate the use of incremental training.
  • Object detection for bird images demonstrates how to use the Amazon SageMaker Object Detection algorithm with a public dataset of Bird images.
  • Object2Vec for movie recommendation demonstrates how Object2Vec can be used to model data consisting of pairs of singleton tokens using movie recommendation as a running example.
  • Object2Vec for multi-label classification shows how ObjectToVec algorithm can train on data consisting of pairs of sequences and singleton tokens using the setting of genre prediction of movies based on their plot descriptions.
  • Object2Vec for sentence similarity explains how to train Object2Vec using sequence pairs as input using sentence similarity analysis as the application.
  • IP Insights for suspicious logins shows how to train IP Insights on a login events for a web server to identify suspicious login attempts.
  • Semantic Segmentation shows how to train a semantic segmentation algorithm using the Amazon SageMaker Semantic Segmentation algorithm. It also demonstrates how to host the model and produce segmentaion masks and probability of segmentation.

Amazon SageMaker RL

The following provide examples demonstrating different capabilities of Amazon SageMaker RL.

  • Cartpole using Coach demonstrates the simplest usecase of Amazon SageMaker RL using Intel's RL Coach.
  • AWS DeepRacer demonstrates AWS DeepRacer trainig using RL Coach in the Gazebo environment.
  • HVAC using EnergyPlus demonstrates the training of HVAC systems using the EnergyPlus environment.
  • Knapsack Problem demonstrates how to solve the knapsack problem using a custom environment.
  • Mountain Car Mountain car is a classic RL problem. This notebook explains how to solve this using the OpenAI Gym environment.
  • Distributed Neural Network Compression This notebook explains how to compress ResNets using RL, using a custom environment and the RLLib toolkit.
  • Turtlebot Tracker This notebook demonstrates object tracking using AWS Robomaker and RL Coach in the Gazebo environment.
  • Portfolio Management This notebook uses a custom Gym environment to manage multiple financial investments.
  • Autoscaling demonstrates how to adjust load depending on demand. This uses RL Coach and a custom environment.
  • Roboschool is an open source physics simulator that is commonly used to train RL policies for robotic systems. This notebook demonstrates training a few agents using it.
  • Stable Baselines In this notebook example, we will make the HalfCheetah agent learn to walk using the stable-baselines, which are a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines.
  • Travelling Salesman is a classic NP hard problem, which this notebook solves with AWS SageMaker RL.

Scientific Details of Algorithms

These examples provide more thorough mathematical treatment on a select group of algorithms.

  • Streaming Median sequentially introduces concepts used in streaming algorithms, which many SageMaker algorithms rely on to deliver speed and scalability.
  • Latent Dirichlet Allocation (LDA) dives into Amazon SageMaker's spectral decomposition approach to LDA.
  • Linear Learner features shows how to use the class weights and loss functions features of the SageMaker Linear Learner algorithm to improve performance on a credit card fraud prediction task

Advanced Amazon SageMaker Functionality

These examples that showcase unique functionality available in Amazon SageMaker. They cover a broad range of topics and will utilize a variety of methods, but aim to provide the user with sufficient insight or inspiration to develop within Amazon SageMaker.

  • Data Distribution Types showcases the difference between two methods for sending data from S3 to Amazon SageMaker Training instances. This has particular implication for scalability and accuracy of distributed training.
  • Encrypting Your Data shows how to use Server Side KMS encrypted data with Amazon SageMaker training. The IAM role used for S3 access needs to have permissions to encrypt and decrypt data with the KMS key.
  • Using Parquet Data shows how to bring Parquet data sitting in S3 into an Amazon SageMaker Notebook and convert it into the recordIO-protobuf format that many SageMaker algorithms consume.
  • Connecting to Redshift demonstrates how to copy data from Redshift to S3 and vice-versa without leaving Amazon SageMaker Notebooks.
  • Bring Your Own XGBoost Model shows how to use Amazon SageMaker Algorithms containers to bring a pre-trained model to a realtime hosted endpoint without ever needing to think about REST APIs.
  • Bring Your Own k-means Model shows how to take a model that's been fit elsewhere and use Amazon SageMaker Algorithms containers to host it.
  • Bring Your Own R Algorithm shows how to bring your own algorithm container to Amazon SageMaker using the R language.
  • Installing the R Kernel shows how to install the R kernel into an Amazon SageMaker Notebook Instance.
  • Bring Your Own scikit Algorithm provides a detailed walkthrough on how to package a scikit learn algorithm for training and production-ready hosting.
  • Bring Your Own MXNet Model shows how to bring a model trained anywhere using MXNet into Amazon SageMaker.
  • Bring Your Own TensorFlow Model shows how to bring a model trained anywhere using TensorFlow into Amazon SageMaker.
  • Inference Pipeline with SparkML and XGBoost shows how to deploy an Inference Pipeline with SparkML for data pre-processing and XGBoost for training on the Abalone dataset. The pre-processing code is written once and used between training and inference.
  • Inference Pipeline with SparkML and BlazingText shows how to deploy an Inference Pipeline with SparkML for data pre-processing and BlazingText for training on the DBPedia dataset. The pre-processing code is written once and used between training and inference.
  • Experiment Management Capabilities with Search shows how to organize Training Jobs into projects, and track relationships between Models, Endpoints, and Training Jobs.
  • Creating Algorithm and Model Package - Listing on AWS Marketplace provides a detailed walkthrough on how to package a scikit learn algorithm to create SageMaker Algorithm and SageMaker Model Package entities that can be used with the enhanced SageMaker Train/Transform/Hosting/Tuning APIs and listed on AWS Marketplace.
  • Using Algorithm and Model Packages - From AWS Marketplace provides a detailed walkthrough on how to use Algorithm and Model Package entities with the enhanced SageMaker Train/Transform/Hosting/Tuning APIs by choosing a canonical product listed on AWS Marketplace.

Amazon SageMaker Pre-Built Framework Containers and the Python SDK

Pre-Built Deep Learning Framework Containers

These examples show you to write idiomatic TensorFlow or MXNet and then train or host in pre-built containers using SageMaker Python SDK.

Pre-Built Machine Learning Framework Containers

These examples show you how to build Machine Learning models with frameworks like Apache Spark or Scikit-learn using SageMaker Python SDK.

Using Amazon SageMaker with Apache Spark

These examples show how to use Amazon SageMaker for model training, hosting, and inference through Apache Spark using SageMaker Spark. SageMaker Spark allows you to interleave Spark Pipeline stages with Pipeline stages that interact with Amazon SageMaker.

Additional Resources

What do I need in order to get started?

Will these examples work outside of Amazon SageMaker Notebook Instances?

  • Although most examples utilize key Amazon SageMaker functionality like distributed, managed training or real-time hosted endpoints, these notebooks can be run outside of Amazon SageMaker Notebook Instances with minimal modification (updating IAM role definition and installing the necessary libraries).