Skip to content

danieladams456/aws-data-analytics-cert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

AWS Big Data Specialty Cert Study

Disclaimer: this guide is the tail end of my studying so is not complete, mostly just focusing on trivia/gotchas.

Anti-patterns taken from AWS Big Data Whitepaper.

Data Collection

  • Service notes
    • 200 ms latency (1 standard consumer), 70 ms with enhanced fan out HTTP2 push - both considered real time
    • Server side encryption supported (default CMK, user managed CMK, or KMS imported key material)
  • Limits
    • retention period default 24 hours, max 7 days
    • shard can ingest 1 MiB/s or 1000 messages/sec, max message size 1 MiB
    • consumer throughput is 2 MiB/s with max of a single API call 10,000 records or 10 MiB
  • Anti-patterns
    • small scale consistent throughput (less than 200 KB/sec)
    • long term storage/analytics
  • Service notes
    • "near real time"
    • destinations: S3, Redshift, Elasticsearch, Splunk
    • stores records up to 24 hours in case downstream system is unavailable
    • can store pre-transformation records, log transform and delivery errors
    • a Kinesis Data Stream can feed a Firehose stream
  • Limits
    • max record size 1,000 KiB, throughput just soft limits
    • buffer hints:
      • S3: 1-128 MiB
      • ElasticSearch service (direct delivery, however can use S3 failed or all records): 1-100 MiB
      • Lambda processor: 1-3 MiB
  • Service notes
    • Input streams: Kinesis data streams or Firehose, static reference tables from S3 for joins
    • Output streams can be written to Kinesis Data Streams, Firehose, or trigger Lambda
      • An event is emitted about once a second for sliding window and once ever time period for tumbling
    • pumps select from one stream and insert into the next stream
    • You can have multiple writers insert into an in-application stream, and there can be multiple readers selected from the stream. Think of an in-application stream as implementing a publish/subscribe messaging paradigm.

    • Random Cut Forest: takes one or more numeric columns and outputs an anomaly score using a machine learning model built during execution time
    • Window types:
      • Stagger Windows: A query that aggregates data using keyed time-based windows that open as data arrives. The keys allow for multiple overlapping windows. Using stagger windows is a windowing method that is suited for analyzing groups of data that arrive at inconsistent times.
      • Tumbling Windows: A query that aggregates data using distinct time-based windows that open and close at regular intervals.
      • Sliding Windows: A query that aggregates data continuously, using a fixed time or rowcount interval. It re-evaluates the window each time a new event comes in (blog)
  • Limits
    • 1 input stream, 1 reference source, 3 output streams
  • Service notes
    • Consuming video streams
      • GetMedia API: real time stream to build your own consumer using the Stream Parser Library
      • HLS: 3-5 second latency standard HTTP streaming
    • You can also send non-video time-serialized data such as audio data, thermal imagery, depth data, RADAR data, and more

    • Metadata can either be transient, such as to mark an event within the stream, or persistent, such as to identify fragments where a given event is taking place. A persistent metadata item remains, and is applied to each consecutive fragment, until it is canceled.

  • stand-alone Java program, reads from config and processes files
  • can do some transformations like multiline, CSV to JSON, or Apache log to JSON

Kinesis Libraries/Helpers

  • Key concepts overview
  • manages retries and batching
  • types of batching:
    • Aggregation:
      • many small user records into less data records
      • decreases cost since less shards needed, also increases client PUT performance
      • aggregated user records go to the same shard since a single data record
      • firehose supports KPL deaggregation when fed from a kinesis data stream link
    • Collection:
      • wait and send multiple data records at the same time using PutRecords API rather than sending each in its own HTTP transaction
      • increases client PUT performance since smaller number of requests
      • records in PutRecords API call don't necessarily have to go to the same shard
  • Monitoring has metric level (NONE, SUMMARY, or DETAILED) and granularity level (GLOBAL, STREAM, or SHARD)
  • integrates with KCL to deaggregate records
  • The KCL takes care of many of the complex tasks associated with distributed computing, such as load balancing across multiple instances, responding to instance failures, checkpointing processed records, and reacting to resharding.

  • uses DynamoDB to coordinate between multiple workers for leases and checkpoints
  • can deaggregate KPL data records into user records
  • monitoring with cloudwatch
    • per-application metrics, across all workers
    • per-worker metrics, across all record processors on that worker
    • per-shard metrics, corresponds to one record processor

Data Pipeline

  • data node is the logical destination, database is the physical connection information
  • Data Nodes (both input and output): DynamoDBDataNode, SqlDataNode, RedshiftDataNode, S3DataNode
  • Activities
    • CopyActivity: S3 and SqlDataNodes input/output
    • EmrActivity: runs an Amazon EMR cluster
    • HiveActivity: runs a Hive query on an Amazon EMR cluster
    • HiveCopyActivity: runs a Hive query on an Amazon EMR cluster with support for advanced data filtering and support for S3DataNode and DynamoDBDataNode
    • PigActivity: runs a Pig script on an Amazon EMR cluster
    • RedshiftCopyActivity: copies data to and from Amazon Redshift tables
    • ShellCommandActivity: runs a custom UNIX/Linux shell command as an activity
    • SqlActivity: runs a SQL query on a database
  • TaskRunners can be either run on your long-running instances or on the EC2 or EMR that DataPipeline spins up dynamically
  • Preconditions can be system managed (DynamoDBDataExists, DynamoDBTableExists, S3KeyExists, S3PrefixNotEmpty) or user managed on your own compute resource (Exists, ShellCommandPrecondition)

Data Processing

AWS ML

EMR

  • Open source components
    • Spark
      • Spark SQL: use SQL to process your data
      • Spring Streaming: lets you treat streaming data as micro-batches and use the same analysis code on batch and streaming data
      • MLlib: train machine learning models on Spark
      • GraphX: graph queries/algorithms
    • Phoenix: SQL against HBase
    • Sqoop: efficient loading of RDBMS to HDFS
  • EMRFS
  • Other storage options
  • Security configuration
  • Compression
  • serverless Jupyter notebook, contents stored in S3
  • announcement

You can create multiple notebooks directly from the console. There is no software or instances to manage, and notebooks spin up instantly, you have a choice of either attaching the notebook to an existing cluster or provision a new cluster directly from the console. You can attach multiple notebooks to a single cluster, detach notebooks and re-attach them to new clusters.

Data Storage

Redshift

  • distribution key styles
    • even: for when a tables doesn't participlate in joins or when key vs all is not known
    • key: for when you want to co-locate data across tables on join columns
    • all: copy of table on every node. This is just used for slow moving tables. Small dimension tables do not benefit since cost of redistribution is low
  • sort key types
    • compound: used in order.
    • interleaved: equal weight to each column in the sort key. More benefit to larger tables, don't go above 4 columns. More benefit the more coluns that are used in the query.
  • Load into redshift using:
    • Pull COPY initiated from within RedShift
      • S3: can read either from a prefix or manifest file, specify an encryption key for CSE-KMS, and copy from an S3 bucket in another region
      • DynamoDB
      • EMR: by HDFS prefix
      • SSH: uses a manifest file to copy data from multiple hosts in parallel
    • Push from external (stil uses COPY)
      • Firehose
      • Data Pipeline
  • Workload Management (WLM)
    • Automatic WLM: does it for you with one queue
    • Concurrency Scaling: AWS adds a temporary cluster to burst when queues start getting backed up
    • Short Query Acceleration (SQA)
      • queries detected for SQA get moved off to the side to execute and don't take up WLM slots
      • uses machine learning and specifies max runtime based on cluster load or can be set to static 1-20 seconds

ElasticSearch Service

  • data ingestion integration list
    • S3, Kinesis Data Streams, DynamoDB, CLoudWatch use Lambda code
    • Firehose and IoT rule actions are native integrations

Visualization/Reporting

Athena

  • best practices
    • partition data
    • use columnar formats
    • compression and split files into reasonalbe sizes to increase parallelism
      • snappy: default for parquet
      • zlib: default for orc
      • lzo
      • gzip: non split-able
    • IAM access
    • supports SSE-S3, SS3-KMS, and CSE-KMS (client side)

QuickSight

  • Components
    • data sources: relational, file, or 3rd party SaaS
    • data set: identifies the specific data in a data source, refreshing in SPICE
    • analysis: "container for a set of related visuals and stories", up to 20 data sets and 20 visuals
    • visual: graphical representation of your data
    • dashboard: read only snapshot of an analysis
  • READ THIS! row level security
    • can be added before or after the data set is shared
    • NULL (no value) means all values
    • Two modes:
      • Grant access to data set: no entry=denied, entry with NULL filters=see all rows
      • Deny access to data set: no entry=see all rows, entry with NULL filters=denied
  • joining tables
    • must come from the same SQL data source, join tables before importing if must come from different data sources
    • can't join on calculated fields
    • refreshing SPICE and imported data
  • User Management for Enterprise edition
    • AD via AWS Directory Service or AD Connector
    • SAML
  • visualization types

Resources

I passed with with a 92%!!! Resources used were the following:

About

Study notes for AWS Big Data Specialty certification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published