# This notebook can be used to referesh your concepts regarding various ML services provided by AWS and their use and benefits

# General

<b> The AWS Machine learning speciality evaluates the candiate primarily in 4 sections</b>
1. Data engineering - 20%
2. Exploratory data analysis - 24% 
3. Modeling - 36%
4. ML implementation & operations - 20%

## Important Tips

1. Read and understand the question before reading answer options (pretend the answer options aren't even there at first).
2. Identify the key phrases and qualifiers in the question.
3. Try to answer the question before even looking at the answer choices, then see if any of those answer choices match your original answer.
4. Eliminate answer options based on what you know about the question, including the key phrases and qualifiers you highlighted earlier.
5. If you still don't know the answer, consider flagging the question and moving on to easier questions. But remember to answer all questions before the time is up on the exam, as there are no penalties for guessing.

<b> As a data scientist you should understand the following nine stages of hierarchy of data:
1. data collection. 
2. reliable data movement from one location to another. 
3. data storage in structured or unstructured databases as per the business requirement.
4. transformation of raw data to remove any data anomalies. 
5. data aggregation from different sources and performs basic analytics. 
6. label the data by performing feature engineering and segregating testing data from cleaning data. 
7. optimize ML algorithms
8. interpret results and improve the algorithms. 
9. implement this AI and deep learning system.

### AWS ML Stack  

The stack for Amazon machine learning has three tiers

![image.png](attachment:image.png)

# Exam Topics

## 1. Data Engineering

## Part-One Create data repositories for machine learning

**Important Topics**
* AWS Lake Formation
* Amazon S3 (as storage for a data lake)
* Amazon FSx for Lustre
* Amazon EFS
* Amazon EBS volumes
* Amazon S3 lifecycle configuration
* Amazon S3 data storage options

### a. AWS Lake Formation
AWS Lake Formation is your data lake solution as it simplifies and automates many of the complex manual steps that are usually required to create data lakes. These steps include collecting, cleansing, moving, and cataloging data, and securely making that data available for analytics and machine learning.

1. With Data lake you can store structured and unstructured data
2. AWS Lake Formation easily creates secure data lakes, making data available for wide-ranging analytics.
3. Amazon S3 is the preferred storage option for data science processing on AWS
4. However it isn't your only storage solution for model training

### b. Amazon S3
You can use Amazon S3 while you’re training your ML models with Amazon SageMaker. Amazon S3 is integrated with Amazon SageMaker to store your training data and model training output.

![image.png](attachment:image.png)

### c. Data Storage options in Amazon S3 and their Features

<b> Lets discuss all the storage options available starting with Amazon S3 and its types </b>

*Amazon S3 Standard Features:*
1. For active and frequently accessed data
2. Milliseconds access
3. => 3 Availability Zones

*Amazon S3 Intelligent-Tiering (INT) Features:*
1. For data with changing access paterns
2. Milliseconds access
3. => 3 Availability Zones

*Amazon S3 Standard-Infrequent Access (S-IA) Features:*
1. For infrequently accessed data
2. Milliseconds access
3. => 3 Availability Zones

*Amazon S3 One Zone Infrequent Access (1Z-IA) Features:*
1. For re-creatable, less accessed data
2. Milliseconds access
3. 1 Availability Zones

*Amazon Glacier Features:*
1. For archive data
2. Select minutes or hours
3. => 3 Availability Zones

![image.png](attachment:image.png)



<br> </br>

### d. Amazon S3 lifecycle configuration

AWS S3 lifecycle configuration is a collection of rules that define various lifecycle actions that can automatically be applied to a group of Amazon S3 objects. These actions can be either transition actions (which makes the current version of the S3 objects transition between various S3 storage classes) or they could be expiration actions (which defines when an S3 object expires).

*Transition actions* – These actions define when objects transition to another storage class. For example, you might choose to transition objects to the S3 Standard-IA storage class 30 days after creating them, or archive objects to the S3 Glacier Flexible Retrieval storage class one year after creating them.

*Expiration actions* – These actions define when objects expire. Amazon S3 deletes expired objects on your behalf.

### e. Amazon FSx for Lustre
When your training data is already in Amazon S3 and you plan to run training jobs several times using different algorithms and parameters, consider using Amazon FSx for Lustre, a file system service. FSx for Lustre speeds up your training jobs by serving your Amazon S3 data to Amazon SageMaker at high speeds. The first time you run a training job, FSx for Lustre automatically copies data from Amazon S3 and makes it available to Amazon SageMaker. You can use the same Amazon FSx file system for subsequent iterations of training jobs, preventing repeated downloads of common Amazon S3 objects.

![image.png](attachment:image.png)

### f. Amazon Elastic File System (Amazon EFS)
It has the benefit of directly launching your training jobs from the service without the need for data movement, resulting in faster training start times. This is often the case in environments where data scientists have home directories in Amazon EFS and are quickly iterating on their models by bringing in new data, sharing data with colleagues, and experimenting with including different fields or labels in their dataset. For example, a data scientist can use a Jupyter notebook to do initial cleansing on a training set, launch a training job from Amazon SageMaker, then use their notebook to drop a column and re-launch the training job, comparing the resulting models to see which works better.

![image-2.png](attachment:image-2.png)

### g. Amazon EBS Volumes

An Elastic Block Storage (EBS) Volume is like a storage disk with the ability to contain various sizes of data. These virtual storage devices usually replicate within one AWS region to increase their availability. EBS volumes provide additional storage for EC2 instances, similar to a hard drive. 

**Types of EBS Volumes**

*General Purpose SSD* - Cloud users can utilize them when booting their systems, creating virtual machines, and storing medium-size databases. It’s a good cloud space for development and testing operations.

*Provisioned IOPS SSD* - Since the General Purpose SSD can’t perform beyond 3,000 IOPS, Provisioned IOPS SSD performs up to 10,000 IOPS per volume. As a result, this EBS volume can sustain important cloud operations without disruptions. It can store large amounts of databases.

*Magnetic Storage volume* - Use the magnetic storage volume for passive operations. Keep all nonessential workloads in magnetic storage. The average performance is 100 IOPS per volum

#### When choosing a file system, take into consideration the training load time 

The table below shows an example of some different file systems and the relative rate that they can transfer images to a compute cluster. 

| File System | Relative Speed* |
| :--- | ---: |
| Amazon S3 | <1.00 |
| Amazon EFS| 1 |
| Amazon EBS | 1.29 |
| Amazon FSx | >1.6 |

**Comparison of the relative (to Amazon EFS) images per second that each file system can load*

## Part-Two Identify and implement a data ingestion solution

**Important Topics**
* Data Ingestion Solution & Types
* Amazon Kinesis
* Amazon Kinesis Data Streams
* Amazon Kinesis Data Firehose
* Amazon Kinesis Data Analytics
* Amazon Kinesis Video Streams
* AWS Glue
* Apache Kafka

One of the core benefits of a data lake solution is the ability to quickly and easily ingest multiple types of data. In some cases, your data will reside outside your Amazon S3 data lake solution, in databases, on-premises storage platforms, data warehouses, and other locations. To use this data for ML, you may need to ingest it into a storage service like Amazon S3.

### a. Batch and stream processing are two kinds of data ingestion

#### 1. Batch processing 
With batch processing, the ingestion layer periodically collects and groups source data and sends it to a destination like Amazon S3. You can process groups based on any logical ordering, the activation of certain conditions, or a simple schedule. Batch processing is typically used when there is no real need for real-time or near-real-time data, because it is generally easier and more affordably implemented than other ingestion options.

![image.png](attachment:image.png)

**Several services can help with batch processing into the AWS Cloud**

### b. AWS Glue 
A serverless event-driven ETL (extract, transform, and load) data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.

![image-2.png](attachment:image-2.png)

**AWS Database Migration Service (AWS DMS)** - Another service to help with batch ingestions. This service reads from historical data from source systems, such as relational database management systems, data warehouses, and NoSQL databases, at any desired interval. You can also automate various ETL tasks that involve complex workflows by using AWS Step Functions.

#### 2. Stream processing
Stream processing, which includes real-time processing, involves no grouping at all. Data is sourced, manipulated, and loaded as soon as it is created or recognized by the data ingestion layer. This kind of ingestion is less cost-effective, since it requires systems to constantly monitor sources and accept new information. But you might want to use it for real-time predictions using an Amazon SageMaker endpoint that you want to show your customers on your website or some real-time analytics that require continually refreshed data, like real-time dashboards.

![image.png](attachment:image.png)

### c. Amazon Kinesis
Amazon Kinesis is a platform for streaming data on AWS so you can get timely insights and react quickly to new information. Amazon Kinesis offers key capabilities to cost-effectively process streaming data at any scale, along with the flexibility to choose the tools that best suit the requirements of your application.

![image.png](attachment:image.png)


### d. Amazon Kinesis Data Streams
Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources. 
With Amazon Kinesis you can use the Kinesis Producer Library (KPL), an intermediary between your producer application code and the Kinesis Data Streams API data, to write to a Kinesis Data Stream. With the Kinesis Client Library (KCL), you can build your own application to preprocess the streaming data as it arrives and emit the data for generating incremental views and downstream analysis.


### e. Amazon Kinesis Data Firehose
Amazon Kinesis Video Streams makes it easy to securely stream video and audio from connected devices to AWS for analytics, machine learning (ML), and other processing.
For example, a leading home security provider ingests audio and video from their home security cameras using Kinesis Video Streams. They then attach their own custom ML models running in Amazon SageMaker to detect and analyze objects to build richer user experiences.


### f. Amazon Kinesis Data Analytics
Amazon Kinesis Data Analytics is the easiest way to process data streams in real time with SQL or Apache Flink without having to learn new programming languages or processing frameworks. This lets you gain actionable insights in near-real time from the incremental stream before storing it in Amazon S3.


### g. Amazon Kinesis Video Streams
Amazon Kinesis Data Firehose is the easiest way to capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools.
As data is ingested in real time, you can use Amazon Kinesis Data Firehose to easily batch and compress the data to generate incremental views. Kinesis Data Firehose also allows you to execute custom transformation logic using AWS Lambda before delivering the incremental view to Amazon S3.



**Difference between Data Streams and Data Firhose**
![image-2.png](attachment:image-2.png)

### h. Apache Kafka
Apache Kafka is an open-source distributed event streaming platform used for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Kafka provides three main functions to its users:
* Publish and subscribe to streams of records
* Effectively store streams of records in the order in which records were generated
* Process streams of records in real time

Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data.  That is why Kafka is also often used as a message broker solution, which is a platform that processes and mediates communication between two applications.

## Part-Three Identify and implement a data transformation solution

**Important Topics**
* Apache Spark on Amazon EMR
* Apache Spark and Amazon SageMaker

*Data Tranformation*<br>
The raw data ingested into a service like Amazon S3 is usually not ML ready as is. The data needs to be transformed and cleaned, which includes deduplication, incomplete data management, and attribute standardization. Data transformation can also involve changing the data structures, if necessary, usually into an OLAP model to facilitate easy querying of data. 

### a. Apache Spark on Amazon Elastic Map-Reduce (EMR)
**Apache Spark** is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. Spark is designed to handle real-time data efficiently. Spark is a low latency computing and can process data interactively. The Spark framework is often used within the context of machine learning workflows to run data transformation or feature engineering workloads at scale.

Using Apache Spark on Amazon EMR provides a managed framework that can process massive quantities of data. Amazon EMR supports many instance types that have proportionally high CPU with increased network performance, which is well suited for HPC (high-performance computing) applications.

![image.png](attachment:image.png)

**ETL processing services:**
* Amazon Athena
* AWS Glue
* Amazon Redshift Spectrum

The choice of ETL processing tool is largely dictated by the type of data you have. For example, tabular data processing with Athena lets you manipulate your data files in Amazon S3 using SQL. If your datasets or computations are not optimally compatible with SQL, you can use AWS Glue to seamlessly run Spark jobs (Scala and Python support) on data stored in your Amazon S3 buckets.

**Amazon Athena** is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.

### b. Apache Spark and Amazon SageMaker

Amazon SageMaker provides a set of prebuilt Docker images that include Apache Spark and other dependencies needed to run distributed data processing jobs on Amazon SageMaker. With the Amazon SageMaker Python SDK, you can easily apply data transformations and extract features (feature engineering) using the Spark framework. 