Merge branch 'main' of github.com:data-burst/data-engineering-wiki

data-burst · Jun 5, 2024 · e0d3a40 · e0d3a40
2 parents db3b960 + 5fc9ccc
commit e0d3a40
Show file tree

Hide file tree

Showing 16 changed files with 465 additions and 1 deletion.
diff --git a/hugo-blog/content/docs/roadmap/data-architecture/_index.md b/hugo-blog/content/docs/roadmap/data-architecture/_index.md
@@ -0,0 +1,6 @@
+---
+bookCollapseSection: true
+weight: 18
+title: "Data Architecture"
+---
+
diff --git a/hugo-blog/content/docs/roadmap/data-architecture/kappa.md b/hugo-blog/content/docs/roadmap/data-architecture/kappa.md
@@ -0,0 +1,24 @@
+---
+title: "Kappa"
+---
+
+# Kappa Architecture
+
+## Introduction
+
+**Kappa Architecture** is a data processing paradigm that seamlessly integrates real-time and batch processing into a single system. Unlike the traditional Lambda Architecture, which maintains separate pipelines for batch and real-time data, Kappa Architecture simplifies data processing by focusing solely on streaming data.
+
+Key points about Kappa Architecture:
+
+1. **Unified Processing**: Kappa Architecture eliminates the need for a separate batch layer, reducing latency and complexity. It processes data continuously, making it well-suited for applications requiring real-time insights.
+
+2. **Stream Processing**: The core idea is to leverage stream processing engines to handle large volumes of data in real time. These engines clean, enrich, transform, filter, and aggregate streaming events.
+
+## Learning Resources
+
+### Articles
+- [The Kappa Architecture](https://medium.com/@devin.bost/the-kappa-architecture-8105a3c10f98)
+- [Kappa Architecture | Dremio](https://www.dremio.com/wiki/kappa-architecture/)
+- [Kappa Architecture: A Comprehensive Guide](https://medium.com/@sivakumar-mahalingam/kappa-architecture-a-comprehensive-guide-eb18050a6295)
+- [Kappa architecture - Wikipedia](https://en.wikipedia.org/wiki/Kappa_architecture)
+
diff --git a/hugo-blog/content/docs/roadmap/data-architecture/lambda.md b/hugo-blog/content/docs/roadmap/data-architecture/lambda.md
@@ -0,0 +1,32 @@
+---
+title: "Lambda"
+---
+
+# Lambda Architecture
+
+## Introduction
+
+**Lambda architecture** is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach aims to balance latency, throughput, and fault-tolerance. Let's explore the key components of lambda architecture:
+
+1. **Batch Layer**:
+   - The batch layer precomputes results using a distributed processing system capable of handling large data volumes.
+   - It ensures perfect accuracy by processing all available data when generating views.
+   - Output is stored in a read-only database, with updates replacing existing precomputed views.
+
+2. **Speed Layer**:
+   - The speed layer processes data streams in real time without fix-ups or completeness requirements.
+   - It minimizes latency by providing immediate views into the most recent data.
+   - These views may not be as accurate or complete as batch layer views but are available promptly.
+
+3. **Serving Layer**:
+   - The serving layer responds to user queries by combining batch and speed layer outputs.
+   - It provides comprehensive and timely views of data for various use cases.
+
+## Learning Resources
+
+### Articles
+
+- [Lambda Architecture: Design Simpler, Resilient, Maintainable, and Scalable Big Data Solutions](https://www.infoq.com/articles/lambda-architecture-scalable-big-data-solutions/)
+- [Lambda Architecture | Snowflake](https://www.snowflake.com/guides/lambda-architecture)
+- [What Is Lambda Architecture? - Databricks](https://www.databricks.com/glossary/lambda-architecture)
+
diff --git a/hugo-blog/content/docs/roadmap/data-integration/cdc/_index.md b/hugo-blog/content/docs/roadmap/data-integration/cdc/_index.md
@@ -12,8 +12,11 @@ CDC (Change Data Capture) is a data management technique used to identify and tr
 
 ### How CDC Works
 **1. Capture:** CDC monitors a source database for changes. Different methods can be used to capture changes, such as transaction log scanning, triggers, or timestamps.
+
 **2. Record:** Once changes are detected, CDC records them in a staging area or a log table. This record includes the type of change (insert, update, or delete) and the specific data fields that were affected.
+
 **3. Process:** The recorded changes are processed and transformed according to the needs of the target system or application. This might involve filtering, mapping, or other forms of transformation.
+
 **4. Deliver:** Finally, the processed changes are delivered to the target system, which could be a data warehouse, another database, or a message queue for further processing.
 
 ### Methods of CDC Implementation
@@ -57,4 +60,4 @@ CDC is crucial in data management strategies that require timely and accurate da
 - [What is Change Data Capture?](https://medium.com/@andrew.macconnell/using-change-data-capture-9548ff7b41e3)
 - [Change Data Capture (CDC): What it is and How it Works](https://www.striim.com/blog/change-data-capture-cdc-what-it-is-and-how-it-works/)
 - [Why Change Data Capture?](https://www.confluent.io/learn/change-data-capture/)
-- [Change Data Capture](https://www.qlik.com/us/change-data-capture/cdc-change-data-capture)
+- [Change Data Capture](https://www.qlik.com/us/change-data-capture/cdc-change-data-capture)
diff --git a/hugo-blog/content/docs/roadmap/query-engine/_index.md b/hugo-blog/content/docs/roadmap/query-engine/_index.md
@@ -0,0 +1,96 @@
+---
+bookCollapseSection: true
+weight: 17
+---
+
+# Query Engine
+
+## Introduction
+Big Data Query Engines are specialized systems designed to handle the vast and complex datasets characteristic of big data environments. They provide the tools necessary to efficiently query and analyze data distributed across many nodes in a cluster. These engines enable businesses and organizations to derive insights from their data by leveraging parallel processing and optimized query execution strategies.
+
+### Key Functions and Features
+
+1. **Distributed Query Processing:**
+
+- **Parallel Execution:** Big data query engines break down queries into smaller tasks that can be executed in parallel across multiple nodes, significantly improving performance and speed.
+
+- **Data Locality Optimization:** These engines optimize query execution by minimizing data movement and processing data where it is stored, reducing latency and network congestion.
+
+1. **Fault Tolerance and Reliability:**
+
+- **Redundancy:** Data is often replicated across multiple nodes to ensure that if one node fails, others can take over, ensuring continuous availability.
+
+- **Checkpointing and Logging:** Mechanisms such as checkpointing and transaction logs help in recovering from failures and maintaining data integrity.
+
+3. **Scalability:**
+
+- **Horizontal Scaling:** Big data query engines can scale out by adding more nodes to the cluster, allowing them to 
+handle increasing volumes of data and query loads.
+
+- **Elasticity:** These engines can dynamically adjust resources based on workload, optimizing performance and cost.
+
+4. **Advanced Query Optimization:**
+
+- **Cost-Based Optimization:** Uses statistical information about data distribution and storage to choose the most efficient execution plan.
+
+- **Predicate Pushdown:** Filters and conditions are applied as early as possible in the query execution process to reduce the amount of data processed.
+
+5. **Support for Various Data Formats and Sources:**
+
+- **Multi-Format Support:** Big data query engines can handle structured, semi-structured, and unstructured data formats such as JSON, Avro, Parquet, and ORC.
+
+- **Data Source Integration:** They can integrate with various data sources including Hadoop Distributed File System (HDFS), NoSQL databases, cloud storage systems, and traditional relational databases.
+
+### Popular Big Data Query Engines
+
+1. **Apache Hive:**
+
+- Built on top of Hadoop, Hive provides a SQL-like interface to query data stored in Hadoop clusters.
+- Converts SQL queries into MapReduce, Tez, or Spark jobs, enabling efficient data processing.
+
+2. **Presto:**
+
+- An open-source distributed SQL query engine designed for interactive analytic queries against data sources of all sizes.
+- Supports querying data where it lives, including Hive, HDFS, relational databases, and object storage.
+
+
+3. **Apache Drill:**
+
+- Provides a schema-free SQL query engine for big data, enabling queries across multiple data sources without requiring a fixed schema.
+- Known for its flexibility and ability to handle complex, nested data structures.
+
+4. **Apache Spark SQL:**
+
+- Part of the Apache Spark framework, Spark SQL allows users to run SQL queries on large datasets using Spark’s powerful in-memory processing capabilities.
+- Integrates seamlessly with Spark’s other components, enabling complex analytics and machine learning workflows.
+
+5. **Google BigQuery:**
+
+- A fully-managed, serverless data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure.
+- Supports real-time analytics and can handle terabytes to petabytes of data.
+
+
+### Key Benefits
+
+- **Performance:** Big data query engines are designed to handle and process large volumes of data quickly and efficiently, leveraging parallelism and optimized query execution strategies.
+- **Scalability:** They can easily scale out by adding more nodes to handle increasing data volumes and query complexity.
+- **Flexibility:** Support for various data formats and sources allows for integration with diverse data environments.
+- **Cost-Efficiency:** By optimizing resource usage and leveraging cloud-based solutions, big data query engines can offer cost-effective data processing solutions.
+
+### Use Cases
+- **Data Warehousing:** Big data query engines are fundamental in building data warehouses that store and process vast amounts of structured and unstructured data for analytics.
+- **Real-Time Analytics:** Engines like Apache Spark SQL and Google BigQuery enable real-time data analysis, providing immediate insights for time-sensitive decision-making.
+- **Business Intelligence:** These engines power BI tools by providing fast and reliable access to large datasets for reporting and analysis.
+- **ETL Processes:** They facilitate efficient Extract, Transform, Load (ETL) processes, preparing data for analysis by cleaning, transforming, and loading it from various sources.
+
+
+
+
+Big Data Query Engines are critical components in the modern data ecosystem, enabling the efficient querying and analysis of large, complex datasets. By leveraging distributed processing, advanced optimization techniques, and robust scalability, these engines empower organizations to harness the full potential of their data. Whether for real-time analytics, data warehousing, or complex BI tasks, big data query engines provide the necessary tools to transform vast amounts of data into actionable insights.
+
+## Learning Resources
+### Books
+- [How Query Engines Work](https://andygrove.io/how-query-engines-work/)
+
+### Miscellaneous
+- [WHAT IS A QUERY ENGINE?](https://www.alluxio.io/learn/presto/query/#:~:text=At%20a%20high%20level%2C%20a,answers%20for%20users%20or%20applications.)
diff --git a/hugo-blog/content/docs/roadmap/query-engine/hive/_index.md b/hugo-blog/content/docs/roadmap/query-engine/hive/_index.md
@@ -0,0 +1,72 @@
+---
+title: "Apache Hive"
+weight: 1
+---
+
+# Apache Hive
+
+## Introduction
+Apache Hive is an open-source data warehouse software built on Apache Hadoop. It provides a SQL-like interface for querying and analyzing large datasets stored in various databases and file systems integrated with Hadoop.
+
+### Key Features
+1. **SQL-Like Interface (HiveQL):**
+- Hive uses HiveQL, a query language similar to SQL, making it accessible to users familiar with SQL.
+
+2. **Scalability and Performance:**
+- Designed for handling large datasets distributed across multiple nodes.
+- Converts queries into efficient execution plans leveraging Hadoop’s parallel processing.
+
+3. **Data Storage Integration:**
+- Natively works with Hadoop Distributed File System (HDFS).
+- Supports multiple data formats: plain text, RCFile, ORC, Avro, Parquet.
+
+4. **Schema and Data Management:**
+- Schema-on-read: Applies schema at query time.
+- Partitioning and Bucketing: Enhances query performance by managing large datasets efficiently.
+
+5. **Extensibility:**
+- User-Defined Functions (UDFs): Custom functions to extend capabilities.
+- Integration with big data tools like Apache Spark, HBase, and Ranger.
+
+6. **Ease of Use:**
+- Provides an interactive shell and supports batch queries using scripts.
+
+### Architecture
+- **Metastore:** Stores metadata about tables, columns, data types, and locations.
+- **Driver:** Manages the lifecycle of a HiveQL statement, including parsing, compiling, and optimizing.
+- **Execution Engine:** Supports MapReduce and Tez for efficient query execution.
+- **Storage Handlers:** Integrates with various data sources, including HDFS and HBase.
+
+### Use Cases
+- **Data Warehousing:** Creating large-scale data warehouses.
+- **ETL Processes:** Efficient data transformation and loading.
+- **Business Intelligence:** Supports complex queries for data analytics and reporting.
+- **Data Exploration:** Allows interactive queries against large datasets.
+
+### Advantages
+- **Familiarity:** SQL-like language is easy for users familiar with RDBMS systems.
+- **Scalability:** Handles large datasets across multiple nodes.
+- **Flexibility:** Supports various data formats and storage systems.
+- **Extensibility:** Extendable through UDFs and integration with other tools.
+
+
+![Apache Hive](hive.png)
+
+Apache Hive is essential for querying and analyzing large datasets in Hadoop, providing scalability, flexibility, and ease of use. It is ideal for data warehousing, ETL processes, business intelligence, and data exploration.
+
+
+## Learning Resources
+### Books
+- [Apache Hive Cookbook](https://www.amazon.de/-/en/Shrey-Mehrotra/dp/1782161082)
+- [Apache Hive Essentials](https://www.scholarvox.com/catalog/book/88860073?_locale=en)
+- [Introduction to Apache Hive](https://www.oreilly.com/library/view/introduction-to-apache/9781771374804/)
+
+### Courses
+- [Apache Hive Introduction & Architecture](https://www.youtube.com/watch?v=taTfW2kXSoE)
+- [What is Apache Hive? : Understanding Hive](https://www.youtube.com/watch?v=cMziv1iYt28)
+- [Hive architecture | Explained with a Hive query example](https://www.youtube.com/watch?v=W1XnmXv8Wpo)
+
+
+### Miscellaneous
+- [About the Hive Engine](https://api-docs.treasuredata.com/en/tools/hive/quickstart/)
+- [What is Apache Hive?](https://www.databricks.com/glossary/apache-hive)
diff --git a/hugo-blog/content/docs/roadmap/query-engine/hive/hive.png b/hugo-blog/content/docs/roadmap/query-engine/hive/hive.png
diff --git a/hugo-blog/content/docs/roadmap/query-engine/presto/_index.md b/hugo-blog/content/docs/roadmap/query-engine/presto/_index.md
@@ -0,0 +1,84 @@
+---
+title: "Presto"
+weight: 4
+---
+
+# Presto
+## Introduction
+Presto is an open-source distributed SQL query engine designed for running interactive analytic queries against data sources of all sizes. Originally developed by Facebook, Presto is now widely used in various industries for its ability to query data where it resides, including Hive, HDFS, relational databases, and cloud storage.
+
+### Key Features
+1. **Distributed Query Processing:**
+- Executes queries across multiple nodes, providing high performance and scalability.
+- Supports querying large datasets in a parallel and distributed manner.
+
+2. **SQL Compatibility:**
+- Provides full support for standard ANSI SQL, making it easy for users familiar with SQL to write queries.
+
+3. **Data Source Integration:**
+- Integrates with a wide range of data sources: HDFS, Hive, relational databases, NoSQL databases, and cloud storage systems.
+
+4. **Performance Optimization:**
+- Cost-based optimizer: Uses statistics to determine the most efficient way to execute queries.
+- In-memory processing: Reduces latency by processing data in memory.
+
+5. **Extensibility:**
+- Connector architecture: Allows easy integration with new data sources by adding custom connectors.
+
+6. **Fault Tolerance and Reliability:**
+- Designed for high availability and fault tolerance, ensuring continuous query execution even in the event of node failures.
+
+### Architecture
+1. **Coordinator:**
+- Manages the lifecycle of queries, including parsing, planning, and scheduling.
+Coordinates with workers to execute distributed tasks.
+
+2. **Workers:**
+- Execute tasks assigned by the coordinator.
+- Handle data processing, including reading from data sources, performing joins, aggregations, and other operations.
+
+3. **Connectors:**
+- Plugins that enable Presto to communicate with various data sources.
+- Provide a unified interface for data access across different systems.
+
+### Use Cases
+1. **Interactive Analytics:**
+- Enables real-time data analysis, providing quick insights from large datasets.
+
+2. **Data Warehousing:**
+- Acts as a query layer on top of existing data warehouses, allowing fast and efficient querying.
+
+3. **Business Intelligence:**
+- Powers BI tools by providing a high-performance query engine for reporting and analytics.
+
+4. **Ad Hoc Queries:**
+- Supports exploratory data analysis with the ability to run complex queries on diverse data sources.
+
+### Advantages
+1. **High Performance:** Optimized for fast query execution with low latency.
+
+2. **Scalability:** Easily scales out to handle increasing data volumes and query complexity.
+
+3. **Flexibility:** Supports a wide range of data sources and can query data where it resides.
+
+4. **Ease of Use:** Standard SQL support makes it accessible to users with SQL knowledge.
+
+5. **Extensibility:** Connector architecture allows integration with new data sources and systems.
+
+![Presto](presto.png)
+
+Presto is a powerful and versatile SQL query engine designed for high-performance, distributed querying across diverse data sources. Its ability to handle large-scale data processing with low latency makes it ideal for interactive analytics, data warehousing, business intelligence, and ad hoc querying. With its extensibility and ease of use, Presto is a valuable tool for organizations seeking to derive insights from their data quickly and efficiently.
+
+## Learning Resources
+### Books
+- [Learning and Operating Presto](https://www.oreilly.com/library/view/learning-and-operating/9781098141844/)
+- [Presto: The Definitive Guide: SQL at Any Scale, on Any Storage, in Any Environment](https://books.google.de/books/about/Presto.html?id=hgJ_xgEACAAJ&redir_esc=y)
+
+### Courses
+- [What Is Presto | PrestoDB Explained | Presto Overview Video | Intellipaat](https://www.youtube.com/watch?v=nPhqnfy8DSE)
+- [Presto 101: An Introduction to Open Source Presto](https://www.youtube.com/watch?v=rKy7ifPhwrA)
+- [Presto: a Powerful SQL Query Engine for Big Data! | Hadoop Big Data Tutorial | Lecture 39](https://www.youtube.com/watch?v=QhgkbJJZoag)
+
+### Miscellaneous
+- [Fast and Reliable SQL Engine for Data Analytics and the Open Lakehouse](https://www.youtube.com/watch?v=QhgkbJJZoag)
+- [Presto](https://github.com/prestodb/presto)
diff --git a/hugo-blog/content/docs/roadmap/query-engine/presto/presto.png b/hugo-blog/content/docs/roadmap/query-engine/presto/presto.png
diff --git a/hugo-blog/content/docs/roadmap/query-engine/sparksql/_index.md b/hugo-blog/content/docs/roadmap/query-engine/sparksql/_index.md
@@ -0,0 +1,13 @@
+---
+title: "SparkSQL"
+weight: 2
+---
+
+# SparkSQL
+
+## Introduction
+
+## Learning Resources
+### Books
+### Courses
+### Miscellaneous
diff --git a/hugo-blog/content/docs/roadmap/query-engine/trino/_index.md b/hugo-blog/content/docs/roadmap/query-engine/trino/_index.md
@@ -0,0 +1,13 @@
+---
+title: "Trino"
+weight: 3
+---
+
+# Trino
+
+## Introduction
+
+## Learning Resources
+### Books
+### Courses
+### Miscellaneous
diff --git a/hugo-blog/content/docs/roadmap/sql-fundamentals/_index.md b/hugo-blog/content/docs/roadmap/sql-fundamentals/_index.md
@@ -0,0 +1,4 @@
+---
+bookCollapseSection: true
+weight: 10
+---