# AWS Certified Machine Learning - Specialty

This notebooks include an outline of the key AWS services and techniques to know for the certification exam. More details please refer to AWS websites. Part of the materials can also be found from the **aws-cloud-practitioner-essentials notebooks** saved in this repository.


## Domain 1: Data Engineering
### AWS Kinesis
#### Kinesis Streams
- Use case: Real-time application
- Streams limit 
    - Producer: 1MB/s or 1000 messages/s at write per shard
    - Consumer: 2MB/s at read per shard and 5 API calls per second per shard across all consumers
    
#### Kinesis Firehose
- Use case: Data ingestion

#### Kinesis Analytics
- Use case: Streaming ETL, continuous metric generation, responsive analytics
- Machine learning functions:
    - RANDOM_CUT_FOREST fo anomaly detection
    - HOTSPOTS
    
#### Kinesis Video Streams

### Glue 
#### Glue Data Catelog, Glue Crawlers, Glue ETL
- Glue ETL runs Run Apache Spark code

### AWS Data Stores
#### Redshift
- Data warehouse
- SQL analytics, OLAP - online analytical processing

#### RDS, Aurora
- Relational database
- OLTP - Online Transaction Processing

#### DynamoDB
- NoSQL data store

#### S3
- Objectie store
- S3 storage tiers
- S3 encryption (using KMS)
- S3 security (using relevant AWS services IAM policeis)

#### ElasticSearch
#### ElasticCache

### AWS Data Pipeline
- Orchestration service

### AWS Batch
- Run batch jobs as Docker images

### AWS DMS (Database Migration Service)

### AWS Step Functions

## Domain 2: Exploratory Data Analysis
### Amazon Athena
- Pay-as-you-go: 5 dollar per TB scanned

### Amazon QuickSight
- Data visualization tool for everyone
- 10 GB of super-fast, parallel, in-memory calculated (SPICE) engine
- Machine learning features: anomaly detection, forecasting, auto-narratives

### EMR (Elastic MapReduce)
- Hadoop framework on EC2 instances
- Nodes
- Storage
- Spark
- Zeppeline
- EMR notebook
- Instance types: m4.large if < 50 nodes, m4.xlarge if > 50 nodes for master node
- Spot instances are good choice for task nodes

### SageMaker Ground Truth

## Domain 3: Modeling

### Deep learning on EC2 / EMR
- EMR supports Apache MXNet and **GPU** instance types


### Use SageMaker through SageMaker Notebooks
#### 1. Open SageMaker on AWS Management Console
#### 2. Create a notebook instance and set up the following accordingly. (The notebook instance is different from the sagemaker instances that train or host the model.)
- Name and instance type
- Permission and encryption - IAM role 
- Network - VPC
- Git repository
- Tags


#### 3. Open Jupyter from the notebook instance 
#### 4. Create/upload python training script 
```python
if '__name__' == '__main__':
   ### Training codes ###

# Save model
```
    
#### 5. In sagemaker notebook, import estimators from sagemaker, create an estimator with the training script, and fit the model. (A Docker container is then started with a SageMaker instance that trains the model and a docker image registered on ECR.)
```python
from sagemaker.tensorflow import TensorFlow

ts_estimator = TensorFlow(entry_point='train.py',
                          role=role,
                          train_instance_count=1,
                          trian_instance_type='local', # or use AWS instance Ex: m1.p3.2xlarge
                          frameworkf_version='1.12',
                          py_version='py3',
                          script_mode=True,
                          hyperparameters={})

ts_estimator.fit({'training': s3_training_input_path, 'validation': s3_validation_input_path})

```
##### Pre-built models are docker images. The training options include:
- Built-in training algorithms
- scickit-learn
- Spark MLlib
- Tensorflow, MXNet, Chainer, PyTorch
- Custom Python Tensorflow / MXNet code
- Your own Docker image
- Algorithm purchased from AWS

#### 6. Deploy the model after training (A Docker container is then started with a SageMaker instance that hoss the model and a docker image registered on ECR. ***In addition, an endpoint is created for the model.***)
```python
tf_predictor = tf_estimator.deploy(initial_instance_count=1,
                                   instance_type='m1.c5.large',
                                   accelerator_type='ml.eia1.medium',
                                   endpoint_name=tf_endpoint_name)
```

### SageMaker Built-In Algorithms

#### 1. Classification and Regression
- Linear Learner - CPU or GPU
- XGBoost - CPU
```python
Sagemaker.xgboost
```
- KNN - CPU or GPU
    - Dimensional reduction stage included

#### 2. Deep Learning
- Seq2Seq - GPU only
    - From a sequence of tokens to sequence of tokens with CNN and RNN
- DeepAR - CPU or GPU
    - Forecasting one-dimensional time series data with RNN
- BlazingText
    - Text classification: supervised; predict labels on a sentence 
    - Word2vec
- Object2Vec
    - Unsupervised; clustering
- Object Detection - GPU; CPU or GPU for inference
    - Identify all objects in an image
- Image Classification - GPU for training; CPU or GPU for inference
    - Assign one or more labels to an image
- Semantic Segmentation - GPU for training; CPU or GPU for inference
    - Pixel-level object classification
    
#### 3. Unsupervised Machine Learning
- Random Cut Forest - CPU (M4, C4, C5)
    - Anomaly detection
- Neural Topic Model - CPU or GPU
    - Unsupervised; Topic modeling by Neural Variational Inference
- LDA (Latent Dirichlet Allocation) - CPU
    - Unsupervised; Topic modeling
- K-Means - CPU or GPU
    - Large scale web-scale clustering is available
- PCA - CPU or GPU
    - Singular value decomposition (SVD)
- Factorization Machines - CPU or GPU
    - Pair-wise data; Recommendation systems
- IP Insights - CPU or GPU
    - Unsupervised; Identify suspicious behavior from IP addresses

#### 4. Reinforced Learning
- Reinforcement Learning
    - Q-Learning
    - Markov Decision Process

### Automatic Model Tuning with SageMaker
```Python
```

### SageMaker + Spark
```python
sagemaker_pyspark
sagemaker_pyspark.algorithms
sagemaker_pyspark.transformation
```

### High-Level AL/ML Services
- Amazon Comprehend
    - NLP and text analytics: entities, key phrases, language, sentiment, and syntax
- Amazon Translate
- Amazon Transcribe
    - Speech to text, speaker identification, and channel identification
- Amazon Polly
    - Text to speech
- Rekognition
    - Images or videos
- Amazon Forecast
    - Time-series analysis with ARIMA, DeepAR, ETS, NPTS, and Prophet
- Amazon Lex
    - Natural-language chatbot engine

## Domain 4: Machine Learning Implementation and Operations

All models in SageMaker are hosted in Docker containers. Containers are isolated and contain all dependencies and resources needed to run. ***Docker containers are created from images, which are built from a Dockerfile and saved in a repository, such as Amazon Elastic Container Registry (ECR).***

### Production Variants

### SageMaker Neo + AWS IoT Greengrass
Greengrass the model compiled by Neo to edge devices.
