# Intro to SageMaker

Fully managed cloud based machine learning service, made up of 3 capabilities

* Build - Jupyter Notebook development environment
    * Extensive collection of popular machine learning algorithms
    * Preconfigured to run TensorFlow and Apache MxNet
    * Bring your own algorithm
* Train - managed training infrastructure
    * Distribute training across one or more instances
    * Managed model training infrastructure
    * Scales to petabytes
    * Compute instances automatically launched and release, artifacts stored in S2
* Deploy - scalable hosting infrastructure
    * Real time prediction
        * For interactive and low latency used cases
        * Autoscaling to maintain adequate capacity, replace unhealthy instances, scale-out and scale-in based on workload
    * Batch transform
        * Non-interactive use-cases
        * Suitabke where you need inference for your entire dataset, don't need a persistent real-time endpoint, don't need sub-second latency performance
        * Manages all resources needed for batch transform

## Instance Types and Pricing

Instance families

* Standard
    * balanced CPU, memory, and network
    * T2, T3, M5 - T for bursty, M can handle sustained load
* Compute optimized
    * Highest CPU perf
    * Latest CPUs - C4, C5
    * Good for both training and hosting
* Accelerated computing
    * graphics/GPU compute
    * Speed up algs optimized for GPUs
    * P2, P3
    * Costs more, but can reduce training time, can also serve GPU optimized inferencce
* Inference acceleration
    * Add-on Fractional GPUs
    * Some algorithms are GPU intensive during training but need only fractional GPU during inference
    
    
How to decide?

* CPU vs GPU
* Try difference sizes when family selected
* Instance type and size
    * <instance type><hardware gen>.<size> e.g. c5.2xlarge
    
Pricing components

* Instance type and size
* Fractional GPUs
* Storage
* Data transfer
* Region

Training - On Demand Pricing

* Instance hourly cost
* Storage
* Instances are automatically launched and terminated

Hosting - Realtime

* Instance + Fractional GPU 
* Storage
* Data transfer

Hosting - Batch

* Instance + Fractional GPU 
* Storage
* Data transfer
* Automatic termination

## SageMaker Supported Data Formats

Training

* CSV
* Record IO
* Algorithm specific formats (LibSVM, JSON, Parquet)

Training data needs to be stored in S3, single file or split across files in a folder.

Two ways to transfer data from s3 to training instance.

* File mode: copies entire dataset from s3 to training instance, space needs are entire data set size plus final model artifacts
* Pipe mode: streams data from s3 to training instance. Faster start time and better throughput, space needs are for final model artifacts.

## Build-In Algorithms

SageMaker Training and Hosting Options

* Use built in algorithms
* Use pre built container images with popular frameworks like MxNet, TensorFlow,  scikit-leann, PyTorch
* Extend prebuilt containers
* Use customer container images - custom algorithm, language, frameworks

Built In Algorithms

* Provided by SageMaker
* Easy to scale and use
* Optimized for the AWS cloud
* GPU support

Blazing Text

* Used for text
* Cloud optimized version of fasttext
* Unsupervised version: Convert word to vector (Word2Vec)
    * Text preprocessing step for downstream NLP, sentiment analysis, named entity recognition and translatopn
    * Words semantically similar have vectors that are closer to each other, for example vegatable name locations in vector space
* Supervised: multi-class, multi-label clasification
    * Classification based on text (single label), for example spam detection (spam/notspam)
    * Single instance can belong to many classes (multi-label), for example a movice can belong to multiple generes
* See SageMaker blazing text and [here](https://fasttext.cc)

Object2Vec

* Supervised
* Can be used for classicication, regression
* Extends Word2Vec: learns relationship between pairs of objects, captures structure of sentences
* Examples: similartity based on customer-product, movie-ratings, etc.

Factorization Machines

* Supervised
* Used for regression, classification
* Works very well with high diminsional sparse datasets
* Popular for building recommender systems
* Collaborative filtering
* Example: movie recommendations based on your viewing habits, cross recommend based on similar users

K-Nearest Neighbors

* Supervise, used for regression and clasifcation
* Classification - queries K-nearest neighbors and assigns majority class for the instance
* Regression - queries k-nearest neighbors and returns average value for the instance
* Does not scale well for large datasets

Linear Learner

* Supervised
* Regression, classification
* Linear models used for regression, binary classification, and multi-class classification

XGBoost

* Supervised
* Regression, classification
* Gradient boosted trees algorithm, very popular, won serveral competitions

DeepAR

* Supervised, used for timeseries forecasting
* Train multiple related time series using a single model
* Generate predictions for new, similar timeseries

Object Detection

* Supervised, classification
* Used for image analysis, detects and classifies objects in an image, returns bounding box of each object location

Image Classification

* Supervised, classification
* Image analysis algorithm, classifies entire image, supports multilabels

Semantic Segmentation

* Supervised, classificsation
* Image analysis algorithm for computer vision applications
* Tags each pixel in an image with a class label
* Example: identify shape of car

Sequence to Sequence (seq2seq)

* Supervised, convert a sequence of tokens
* Input: sequence of tokens
* Output: another sequence of tokens
* Examples: text summarization, language translation, speech to text

K-Means

* Unsupervised, clustering
* Identifying discrete groups within data
* Members of a group are as similar as possible to one another and as different as possible from members of other groups

Latent Dirichlet Allocation (LDA)

* Unsupervised, topic modeling
* Group documents by user specified "number" of topics
* For documents, assigns a probability scope for each topic

Neural Topic Modeling

* Unsupervised, topic modeling
* Similar to LDA

Principal Component Analysis (PCA)

* Unsupervised, Diminionality Reduction
* Reduces dimensionality of dataset while retaining as much information as possible
* Returns components - new sets of features that are composites of original features and that are uncorrelated to on another
* Examples: reduce the dimensions of a dataset, visualize high dimensional datasets, remove highly correlated features

Random Cut Forest (RCF)

* Unsupervised, anomaly detection
* Anomalous points are observations that diverge from otherwise well-structured or patterned data
* For each data point, RCF assigns an anomaly score
* Low score indicates normal data and high score indicates an anomaly.

IP Insights

* Unsupervised, detect unusual network activity
* Learns from (entity, IPv4 address) pairs
* Entity can be account id, user id
* For a given pair returns a score
* High score indicates unusual event - website can trigger an MFA




## SageMaker Ground Truth

Automatic labeling

* Learns based on examples
* Very cost effective

Manual Labeling

* Human Labelling - Mechanical Turk
* Manages workflow

## SageMaker Neo

* Run machine learning algorithms anywhere in the cloud and at edge location
* Edge - where latency is critical
* Cross compilation capability that can optimize your algorithms to run on Intel, nvidia, arm, and other hardware

## Bring Your Own Algorithms

* SageMaker makes extensive use of Docker containers for build and runtime tasks
* [Bring your own algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html) is also based on containers
* Popular pattern - Apache Spark for preprocesing, train and host with SageMaker

Popular Framework Support

* TensorFlow
* MxNet
* scikit-learn
* PyTorch
* Chainer
* SparkML

SageMaker provides SDKs and prebuilt docker images to train and host models using these frameworks

Can develop your own algorithms with frameworks and languages of your choice by conforming with SageMaker container interfaces.

Deep Leaning AMIs

* Launch EC2 instances preconfigured with all the tools and frameworks 
* Use cases include modifying DK frameworks or extending them, troubleshooting frameworks, contributing to framework projects

