# Hadoop
- Open Source
- Distributed store and computing
- Can scale to petabytes of data
- Consists of
    - Hadoop Distributed File System (HDFS)
    - MapReduce
    
Source: LinkedIn: big-data-analytics-with-hadoop-and-apache-spark

Practical on 
- Ambari sandbox
- Zeppelin notebook

## Hadoop Distributed File System (HDFS)
- good and cheap option to store large amount of data
- provides scaling, security and cost benefit
- suitable for enterprises with in-house data centers
- Cloud Alternatives - Amazon S3, Oracle OSS, Google Cloud Storage

## MapReduce
- Scales Horizontally
- Very Slow as it uses disk storage for internediate caching instead of memory
- Faster Alternatives - Apache Spark, Apache Flink
- These alternatives also has growing list of supporting libraries

## Apache Spark
- Open Source
- Large scale distributed data processing engine
- Uses memory to speed up computations
- Batch, streaming, ML and graph capabilities
- Support Scala, Java, Python and R 
- Most popular big data platform today

## Hadoop and Spark
- Spark is well integrated with Hadoop
- Spark can access and process data HDFS using 
    - parallel nodes
    - read optimization to use less memory and I/O
    - use HDFS for intermediate Data Caching
    - Yarn provides single cluster management for both HDFS and Spark

## HDFS Data Modelling for Analytics

#### 1. Hadoop Storage Formats
- Raw Texts (blobs)
- Structured text files (csv, xml, json)
- Sequence Files
- Avro
- ORC
- Parquet

Text Files:  
- Simple to read/write
- Low performance - no parallel operations
- More Storage
- No schema

Avro:  
- Language neutral data serialization
- Row format
- Self-describing schema support
- Compressible
- Splittable
- Ideal for multi-language support

Parquet:  
- Columnar Format
- Read-only selected columns (saves I/O)
- Schema Support
- Compressible (column level) and Splittable
- Supports nested Data Structures
- Ideal for analytics applications  

Parquet for Analytics:  
- provides overall better performance and flexibility for analytics applications

#### 2. Hadoop Compression Options
- Snappy
    - Compression codec developed by google
    - Moderate Compression
    - Excellent read/write performance
    - Compresses entire file
    - Not splittable so dont support parallel operations
- LZO
    - Moderate compression
    - Excellent processing performance
    - Splittable - Support parallel processing
    - requires separate license
- GZIP
    - Very good compression
    - Moderate processing performance
    - Not Splittable
    - Ideal for container type applications
- bzip2
    - Excellent compression
    - Slower processing performance 
    - Splittable
    - Ideal for archival type applications

#### 3. Partitioning
- HDFS does not have the concept of indexes
- Even for reading one row, the entire file should be read
- Partitioning provides a way to read only a subset of data
- Multiple attributes can be used for hierarchical partitioning
- Split data into directories based on individual values of attributes

<img src="Image/partitioning.JPG" width="600" />

- choose attributes with a limited set of values and those that are most used in SELECT filters
- otherwise many sub directories will be created

#### 4. Bucketing
- Partitioning is optimal when an attribute is having small set of unique values
- What if we need to partition based on a key having large no of values
- Similar to Partioning but instead of value it uses a Hash function to convert value to a Hash key
- Controls number of unique directories created
- Even disribution
- Choose attributes with large number of unique values and those that are most used in SELECT filters.

#### 5. HDFS Schema Designing and Storage Best Practices
- understand the data whether its read intensive or write intensive or both
- determine what needs optimization and what can be compromised (reduce storage req or compromise storage for better read/write performance)
- choose options carefully as they cant be changed easily
- run tests on actual data to understand performance and storage caracteristics
- choose partitioning and bucketing keys wisely

### Data Extraction with Spark Best Practices
- reduce data read into memory by:
    - using filters based on partition key
    - reading only required columns
- use data sources and file formats that support parallelism
- keep number of partitions >= (No. of executors * No. of cores per executor)