# Data Lakes in AWS

## Overview

Typical ML project - spends 70% of the effort on data prep and cleaning. A data lake can help streamline the data related parts of an ML project.

Data Lake vs Data Warehouse

* A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processes for a specific purpose.

AWS whitepaper - [Data Lake on AWS](https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html)

Aspects of data lakes

* Storage 
* Governance
* Analytics

AWS data lake related services

* Storage
    * S3 - storage
    * Glacier - backup and archival
    
* Ingestion
    * Kinesis Firehose
    * Storage Gateway - for on premise data to cloud. Three modes - file share, block storage, or virtual tape device.
    * Large data migration - Snowball appliance (petabyte scale transfers) and Snowmobile (exabyte scale).
    * SDK, CLI and more to store data in S3
    
* Data Catalog
    * Without a metadata catalog to aid in discover of data in the data lake you end up with a data swamp.
    * A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value.
    * DIY - make data discoverable and usable. Build your own using s3, lambda, elastic search, dynamodb to maintain metadata.
    * Glue - data catalog (metadata repository). Automatically crawl and collect metadata from s3, ddb, and any database that supports jdbc connectivity,
    
## Kinesis

Allows you to ingest, buffer, and process streaming data in real-time.

Streaming data

* Generated continuously from thousands of sources. 
* Small payloads

Batch processing - data ingested and stored, processed in batches at various points in time (hourly, daily, etc).
Stream processing - analyze data as it arrives, response in seconds.

Kinesis Capabilities

* Video streams - use for video playback, monitoring, rekognition, etc
* Data streams - data streaming, use kinesis data analytics, spark on EMR, EC2, lambda
* Firehose - collect data and directly load in the s3, redshift, elasticsearch, and splunk.
* Kinesis Data Analytics - process using SQL or Flink

## Data Formats and Tools

Optimal format can lower storage cost, improve query performance.

One of the core values of a data lake is that it is the collection point and repository for all of an organization's data assetes, in whatever their native formats are.

Recommendation:

* Collect in native format
* Transform data in data lake

Data Organization

* Row storage - optimized for reading entire row
* Column storage - optimized for read subset of columns




In [1]:
%%html
<style>
  table {margin-left: 0 !important;}
</style>

### Data Formats - Text


| Format | Organization | Use |
| --- | --------- | :------------ |
| csv, tsv | row | easy to use, no data type support, duplication when used for hierarchical, data not optimized for reading specific columns|
| json | row | format of choice for comm. between web services, supports data types, efficiently represent hierarchical data |
| json lines | row | new line delimited json, convenient for processing one record at a time |

### Data Formats - Binary

| Format | Organization | Use |
| --- | --------- | :------------ |
| Parquet | Columnar | ideal for use cases that require only subset of columns, efficiently query large amounts of data, WORM, compressed storage, extensive tool support, data type support |
| ORC | Columnar | optimized row columnar storage, like parquet |
| avro | row | ideal for write-heavy use cases, ideal for scenarios where you need to read the entire record, data type support |

Example of columnar storage [here](https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html). 

### Data Transformation

Recommended Approach

* Collect in native format
* Transform in data lake

Approaches

| Service | Purpose | Use |
| --- | --------- | :------------ |
| Amazon EMR | big data prep and processing | managed hadoop environment, support for tools like spark, hive, hbase, support for ml tools like tensorflow and mxnet |
| Glue | ETL | automatically generate etl scripts, schedule and run on managed spark environment |
| Kinesis Firehose | streaming data transformation | transform streaming data to parquet, orc, deliver transformed data to aws data stores, backup original data to s3 |


EMR format conversion approaches

* source format to s3 -> load to hive tables -> export to target format
* source format to s3 -> spark -> transform data to desired format
* point glue to source and target, glue generates the scripts, creates the spark cluster, runs the job etc.


## In-Place Analytics and Portfolio of Tools


A couple options:

* Athena -> SQL -> S3
* Redshift spectrum -> SQL -> S3

| Service | Purpose | Use |
| --- | --------- | :------------ |
| Athena | In-place SQL query | Query data in s3 without needed to ETL into separate service or platform, charged based on amount of data scanned |
| Redshift Spectrum | In-place SQL query (redshift compatible SQL) | Query data in s3 without needed to ETL into separate service or platform, more suitable for complex queries and large datasets (up to exabytes)|

Recommendations

* Athena - ad hoc queries, discovery
* Redshift - more complex queries, large numbers of users

### Streaming Query

| Service | Purpose | Use |
| --- | --------- | :------------ |
| Kinesis data analytics | Streaming data SQL query | Query and analyze streaming data with SQL |

### Broader  Analytics Portfolio

| Service | Purpose | Use |
| --- | --------- | :------------ |
| Amazon EMR | Hadoop ecosystem tools | Run a variety of workloads using spark, hive, pig, hbase, tensorflow, mxnet, etc.|
| SageMaker | Machine learning | Managed machine learning service with a variety of algoritms |
| Artificial Intelligence | Video, Image, Natural Langauge | Pre-trainiend, ready to use AI services for video analysis, speech and natural language processing, etc |
| Quicksight | Business intelligence | Managed BI tool to create interactive dashboards |
| Redshift | Data warehouse (columnar storage) | Managed petabyte scale data warehouse. SQL based querying and easily integrates with your existing BI tools |
| Lambda | Business logic | Serverless backend processing logic with trigger-based code execution |

## Monitoring and Optimization

CloudWatch

* Track operational metrics
* Set alarms
* Use metrics to scale resources

Logs

* Consolidata logs
* Scrape and alert

Cloud Trail

* Audit trail - who, what, when
* Logs land in S3, can query using Athena

Cost Optimization

* S3 lifecycle management
    * Migration from standard, infrequent action, glacier
* S3 storage class analysis
* Intelligent tiering
* Amazon glacier and glacier deep archive
* data formats

## Security and Protection

By nature data lake consilidates all data into a single place. Therefore it is important to protect the data lake from a security perspective.

Can use resource based policies and user based policies

Resource based policy

* Attached directly to a resource such as a bucket or object


User based policies

* Permissions granted to users and groups

Easier to manage permissions at the user/group level, but consider resource level policies that must be enforced for all.

S3 Data Encryption

* Protection of data at rest
* Integrated with KMS
* Use CMKs to protect data, data is protected even if bucket made public

### Protection

 A data lake must protect data against corruption, loss, accidental or malicious overwrites, modifications, and deletions.
 
 S3
 
 * S3 11 nines durability 99.999999999% durability
 * Versioning - protect against accidental and malicious deletes
     * Can use lifecycles rules to manage versions
     * Can use MFA for delete protection
 * Cross-region replication
 * Object tagging for object level metadata, define access policies based on tags
 
 