# Data Warehousing and Data Lake Concepts

Databases have been around since the early days of computing, fulfilling crucial needs. From the first computerized database in the 1960s to today's technological advancements, we've come a long way. In this module we will learn about the key conepts of databases, DBMS, RDBMS, Data Warehouse and Data Lakes. Finally, we will take a closer look at the key diffenrences between the last two.

## Database
_An organized set of data that is stored in a computer and can be looked at and used in various ways_ ~ Oxford Learner's Dictionary
![db1.jpeg](attachment:db1.jpeg)

- Databases are designed for fast and efficient search and retrieval of data.
- The organization of data to achieve this is a crucial concept in databases.
- Accessing data quickly and efficiently is vital for the usefulness of the database.

## Database Management Systems (DBMS)
A Database Management System (DBMS) is used to manage a database. A DBMS enables users to create, read, update, delete, and secure data within a database. It essentially serves as an interface between the database and the end-user or application.

### What can a DBMS do?
- Configuring authentication and authorization
- Performance tuning
- Data backup and recovery
- Centralized data viewing

### Components of a DBMS

![dbms1.jpeg](attachment:dbms1.jpeg)

1. **Storage engine**: It Interacts with the file system at an OS level to store data. All queries pass through the storage engine.

2. **Query language**: It Provides an API (e.g., SQL) to access and modify data.

3. **Optimization engine**: It Parses queries and determines the best execution plan.

4. **Query processor**: Interprets user queries into actionable commands for the database.

5. **Support for indexing**: Utilizes special lookup data structures to speed up data retrieval.

6. **Metadata catalog**: Acts as a repository for all database objects created. It is updated dynamically upon object creation and is used to verify user requests for data.

7. **Log manager**: Manages data integrity, backup, and recovery and keeps a record of all data changes made by the DBMS.

8. **Data utilities**: These include various tools for activities like reorganization, backup, recovery, etc.

## Relational Database Management System (RDBMS)
A RDBMS is a structured system that arranges data into predefined relationships. It utilizes tables comprising rows and columns to represent the data. In database terms, rows are also known as records, and columns are referred to as fields. Each row contains information about an object, while each column represents a specific type of information shared by all objects in the table.

![RDBMS2.jpg](attachment:RDBMS2.jpg)

1. **Data Organization:** RDBMSs organize data in a structured, tabular format, promoting efficiency and clarity in data management.

2. **Relationships:** The relationships between tables are defined through primary keys and foreign keys, ensuring data integrity and consistency.

3. **Query Language:** SQL (Structured Query Language) is commonly used to interact with RDBMSs, allowing users to retrieve and manipulate data.

4. **Scalability:** RDBMSs can efficiently handle small to medium-sized datasets, but may face challenges with very large-scale data.

5. **Data Integrity:** RDBMSs enforce constraints and rules to maintain data integrity, preventing inconsistencies or errors in the database.

6. **Well-established Technology:** RDBMSs have been widely used for decades, making them a mature and reliable choice for various applications.

7. **ACID Properties:** ACID (Atomicity, Consistency, Isolation, Durability) properties ensure that database transactions are reliable and maintain data integrity.

### Major aspects of a RDBMS

1. **Tables and Schema:**
   - Tables are the core elements of an RDBMS, representing entities and organizing data in a tabular format.
   - The structure and layout of a table, defining the columns and their data types, is known as the schema.
   - The schema outlines the blueprint for the database, ensuring consistency in data representation.

2. **Columns/Attributes:**
   - Columns, also called attributes, represent specific data elements within a table.
   - Each column has a name and a defined data type, specifying the kind of data it can store.
   - Columns enforce data integrity by validating the data against their defined data types.

3. **Domains:**
   - The set of all allowable values for a particular attribute (column) in a database table.
   - Defines the data type and constraints that restrict the kind of data that can be stored in that attribute.

4. **Rows/Records:**
   - Rows, also referred to as records, represent individual instances of data in a table.
   - Each row contains data values for each attribute (column) defined in the schema.
   - Rows are uniquely identified by a primary key, ensuring their distinctiveness.

![rdbms-table4.jpeg](attachment:rdbms-table4.jpeg)

5. **Relationships:**
   - Relationships establish connections between tables, defining how data in one table relates to another.
   - Primary keys and foreign keys are used to establish relationships between tables.
   - Common relationships include one-to-one, one-to-many, and many-to-many.

6. **Types of Keys:**
   - **Primary Key:** A primary key is a unique identifier for each row in a table. It ensures the uniqueness of each record.
   - **Foreign Key:** A foreign key establishes a link between two tables by referencing the primary key of another table. It enforces referential integrity.
   - **Composite Key:** A composite key is a combination of two or more columns used as a unique identifier for a row.
   - **Unique Key:** A unique key ensures that the values in a column or a set of columns are unique but allows for null values.
   
![rdbms-table-key2.png](attachment:rdbms-table-key2.png)

7. **Indexing:**
   - Indexes are data structures used to enhance data retrieval speed by providing efficient lookup.
   - They improve query performance by allowing rapid access to specific data.

## Data Warehouse

### Need for Analytical Data Systems
Effective data utilization can be the determining factor between a business thriving or faltering. When data is generated, it's processed by an Online Transactional Processing (OLTP) database, ideal for immediate transaction handling like sales or ATM withdrawals. However, for data analysis to drive decision-making, Online Analytical Processing (OLAP) databases come into play. These optimized analytical databases, including data warehouses, are designed to swiftly perform analytical functions, ensuring businesses have the insights they need for success. Analytical databases form a vital part of the broader ecosystem of analytic systems.

### Analytical Systems
These systems process data for analysis, as defined by the National Information Technology Laboratory. They comprise databases, data processing software, and Web services. While analytic databases fall under this category, analytical systems may also include other data processing or storage methods. 

Analytical systems primarily perform computations on data, and whether they store data or separate compute from storage determines if they are also classified as analytical databases. An example of such a system is a SQL query engine, which processes data for analysis without storing it, making it part of an analytical system but not an analytical database.

### Analytical Databases
These databases support crucial business decisions by providing efficient access to performance-related data, aiding data scientists and analysts. They optimize data structure to enable fast and versatile queries for a wide range of information.

Typically, analytical databases store aggregated historical data, unlike transactional systems that hold current production data. For instance, customer details may be deleted from a transactional database after some time, but their purchase history could be retained in the analytical database for longer periods.

While both transactional and analytical systems use indexing, the indexing approach differs. Transactional systems typically use primary or foreign keys, while analytical databases employ indexing relevant to analysts' querying patterns, enabling quick and efficient analysis.

Examples of analytical databases include OLAP cubes and Data warehouses.

![dw-analytical-db.png](attachment:dw-analytical-db.png)

### Data Warehouse - Defined
Data warehouses are specialized analytical databases that empower organizations to access and analyze both current and historical data, enabling valuable insights for business decisions, reporting, and forecasting.

Great examples of data warehouses include:
1. Amazon Redshift: A cloud-based data warehousing service that allows seamless analysis of vast datasets at high speed.
2. Snowflake: A cloud data platform offering a versatile data warehouse with features like data sharing and elasticity.
3. Google BigQuery: A fully-managed data warehouse that provides super-fast SQL queries and real-time analytics.

These examples demonstrate how data warehouses facilitate efficient data analysis while requiring careful planning and consideration of future data needs.

#### Trade-offs in Data Warehouses: Efficiency vs. Adaptability
Data warehouses excel in analytic speed and efficiency due to their structured data storage. This optimized structure allows for easy querying, benefiting analysts and data consumers.

However, this efficiency comes at a cost. Data warehouse architects must meticulously plan the data warehouse's schema-on-write model before its construction. This time-consuming process demands expert guidance to design a structure tailored to address specific business questions.

Moreover, once data is transformed and loaded into the data warehouse, modifying its structure becomes challenging. Any new data must align with the original design to maintain the warehouse's functionality. This lack of adaptability might hinder efficiently addressing new questions as industries and data collection evolve.


### Types of Data in a Data Warehouse

Certain types of data are well-suited for storage within a data warehouse, such as financial transaction data, customer relationship data, and enterprise resource planning data. However, organizations may not store all collected data in a data warehouse due to cost and resource limitations. Social media data, documents, and sensor data are examples of data that might be better handled by other technologies like data lakes or data lakehouses.

Examples:
1. **Financial Transaction Data:** Records of sales, purchases, and financial activities are commonly stored in data warehouses, facilitating financial analysis and reporting.

2. **Customer Relationship Data:** Information about customer interactions, preferences, and buying behavior can be efficiently stored and analyzed in data warehouses, enabling personalized marketing strategies.

3. **Enterprise Resource Planning (ERP) Data:** Data related to inventory, supply chain, and business operations can be consolidated and accessed for comprehensive business insights.

While some organizations solely rely on data warehouses for analytical purposes, this approach may limit access to all collected data. To address this, they can consider integrating a data lake alongside the data warehouse or adopt a data lakehouse approach, combining the benefits of both technologies. The choice depends on the organization's analytical needs and the complexity of the data they collect.

### **Data Warehouse Schema Designs: Star, Snowflake, and Galaxy**

In data warehousing, schema designs play a crucial role in organizing data for efficient querying and analysis. Three common schema designs are the Star, Snowflake, and Galaxy schemas, each with its unique characteristics and use cases.

1. **Star Schema:**
   - The Star Schema is the simplest and most widely used schema design in data warehousing.
   - It consists of a central fact table that holds the primary measures or metrics of interest, surrounded by dimension tables.
   - Dimension tables contain descriptive attributes related to the central fact table, forming a star-like shape when visually represented.

![star-schema-erd.png](attachment:star-schema-erd.png)

**Example of Star Schema:**
Consider an e-commerce data warehouse. The fact table could contain sales data, with measures like total revenue, quantity sold, and profit. The dimension tables might include information about products (product name, category, etc.), customers (customer ID, location, etc.), and time (date, month, year).

2. **Snowflake Schema:**
   - The Snowflake Schema is an extension of the Star Schema and is designed to reduce data redundancy and storage space.
   - Dimension tables are further normalized into sub-dimensions, resulting in a more complex structure compared to the Star Schema.

![snowflake-schema.jpg](attachment:snowflake-schema.jpg)

**Example of Snowflake Schema:**
Using the same e-commerce data warehouse example, the dimension table for customers might be normalized into sub-dimensions like customer details and customer location.

3. **Galaxy Schema (Constellation Schema):**
   - The Galaxy Schema is a hybrid design that combines elements of both the Star and Snowflake schemas.
   - It allows for more complex relationships between dimension tables while still maintaining some level of denormalization.

![galaxy-schema.png](attachment:galaxy-schema.png)

**Example of Galaxy Schema:**
In the e-commerce data warehouse, the product dimension table might be connected to multiple other dimension tables, such as suppliers, manufacturers, and product attributes.

Each schema design has its advantages and considerations. The Star Schema offers simplicity and ease of use, ideal for straightforward data analysis. The Snowflake Schema saves storage space but adds complexity to queries. The Galaxy Schema provides flexibility and allows for more complex business requirements. Data architects and analysts carefully choose the appropriate schema design based on the organization's specific needs and data characteristics.

### Components of Data Warehouse Architecture

1. **Data Sources:**
   - These are the various systems and applications that generate and store data. Data sources could include transactional databases, external systems, spreadsheets, logs, and more.

2. **Data Integration:**
   - Data integration involves the process of extracting data from diverse sources, transforming it into a consistent format, and loading it into the data warehouse.

3. **Staging Area:**
   - The staging area acts as an intermediary storage space between data sources and the data warehouse. It holds the extracted data temporarily before it is further processed and loaded into the warehouse.

4. **Data Warehouse Database:**
   - The core of the data warehouse architecture is the database that stores the integrated, cleaned, and transformed data. It is designed for efficient querying and analysis.

5. **Data Modeling:**
   - Data modeling defines the logical and physical structures of the data warehouse, including the creation of dimensional models like star schemas or snowflake schemas.


![dw-arch2.png](attachment:dw-arch2.png)

6. **ETL (Extract, Transform, Load):**
   - ETL processes are responsible for extracting data from source systems, transforming it to fit the data warehouse schema, and loading it into the warehouse database.

7. **Metadata Management:**
   - Metadata is data about data and includes information about data sources, data definitions, data lineage, and transformations. Effective metadata management is crucial for understanding and managing the data in the warehouse.

8. **Business Intelligence (BI) Tools:**
   - BI tools are used to query, analyze, and visualize the data stored in the data warehouse, providing valuable insights to business users and decision-makers.

9. **Data Access and Security:**
   - Data access controls ensure that only authorized users can access specific data in the data warehouse. Proper security measures are implemented to protect the sensitive data.

10. **Data Governance and Quality Management:**
    - Data governance encompasses policies and procedures for managing data, ensuring data quality, and maintaining compliance with regulations and business rules.

11. **Performance Optimization:**
    - Performance optimization involves tuning the data warehouse to ensure efficient data retrieval and query processing, improving overall system performance.

The combination of these components creates a robust data warehouse architecture that enables organizations to efficiently store, manage, and analyze large volumes of data, supporting data-driven decision-making and business intelligence.

## Data Lakes

Imagine a data lake as an extensive reservoir brimming with raw data collected from diverse sources. Data lakes are not only for storage but also for analysis, leveraging machine learning and artificial intelligence algorithms. This adaptable technology empowers organizations to harness valuable insights, tailored to their specific business needs.

1. **Data Variety and Flexibility:** Data lakes accommodate various data formats, from structured to unstructured, and support unaltered data ingestion, allowing seamless integration of new data sources.

2. **Scalability and Cost-Effectiveness:** Data lakes are highly scalable, capable of handling vast amounts of data without incurring high infrastructure costs, making them a cost-effective choice for big data storage.

3. **Data Exploration and Discovery:** Data lakes encourage data exploration and discovery, enabling data scientists and analysts to experiment with diverse datasets and uncover hidden patterns or trends.

4. **Data Governance and Security:** Proper data governance and security measures are crucial for data lakes to maintain data quality, integrity, and ensure compliance with regulations.

5. **Schema on Read vs. Schema on Write:** Data lakes follow the "Schema on Read" approach, allowing data to be structured during the analysis phase, providing greater flexibility compared to the traditional "Schema on Write" approach used in databases.

6. **Integration with Data Warehouses:** Data lakes and data warehouses often work in tandem. Data lakes handle raw and unprocessed data, while data warehouses store curated, structured data for business intelligence and reporting purposes.

7. **Real-Time Data Processing:** Organizations can employ real-time data processing frameworks like Apache Spark or Apache Flink to handle streaming data efficiently.

By embracing the potential of data lakes, organizations gain a competitive edge, discovering valuable insights that drive innovation, growth, and improved decision-making across various industries.


![data-lake3.webp](attachment:data-lake3.webp)

### Components of a Data Lake Architecture

1. **Data Sources:**
   - Data lakes ingest data from various sources, such as databases, applications, log files, sensors, social media, and other external systems.
   - These sources feed raw and diverse data into the data lake.

2. **Data Ingestion Layer:**
   - The data ingestion layer is responsible for collecting and loading data from various sources into the data lake.
   - It ensures data is efficiently processed, validated, and made available for further processing.

3. **Data Storage:**
   - The core component of a data lake is its storage layer, which stores vast amounts of raw and unprocessed data in its native format.
   - Data storage can be based on distributed file systems like Hadoop Distributed File System (HDFS) or cloud-based object storage services.

4. **Metadata Store:**
   - The metadata store is a catalog or repository that keeps track of data stored in the data lake.
   - It includes information about data sources, data formats, schema, data quality, and data lineage.

5. **Data Processing and Transformation Layer:**
   - This layer enables data transformation, cleansing, and preparation for analytical purposes.
   - Big data processing technologies like Apache Spark, Apache Hive, or Apache Flink are often used for this purpose.

6. **Data Governance and Security:**
   - Data governance and security components ensure proper data access controls, data privacy, compliance, and data quality management within the data lake.

7. **Data Catalog and Metadata Management:**
   - The data catalog helps users discover, understand, and access data assets stored in the data lake.
   - Metadata management ensures consistent metadata across the data lake for better data governance and discovery.

8. **Data Exploration and Analytics:**
   - Data exploration and analytics tools enable data scientists, analysts, and business users to run queries and perform data analysis on the data lake.

9. **Data Visualization and Reporting:**
   - Data visualization and reporting tools help in creating interactive dashboards and visual representations of data analysis results for easy consumption.

10. **Data Access APIs and Services:**
    - APIs and services allow different applications and tools to interact with the data lake, enabling seamless integration with other systems.

11. **Data Lineage and Auditing:**
    - Data lineage tracks the origin and transformation history of data in the data lake, ensuring data provenance and compliance.
    - Auditing capabilities help track data access, changes, and user actions for governance and compliance purposes.

The components of a data lake architecture work together to create a scalable, flexible, and cost-effective environment for storing, processing, and analyzing vast volumes of raw data, empowering organizations to extract valuable insights and make informed decisions.

## Data Warehousing vs. Data Lakes: Key Differences

1. **Data Structure:**
   - Data Warehousing: Data warehouses store structured data with predefined schemas, suitable for structured and well-defined business analysis.
   - Data Lakes: Data lakes store raw and unstructured data, allowing flexibility in data exploration and analysis, accommodating diverse data types.

2. **Data Processing Approach:**
   - Data Warehousing: Data processing in data warehouses follows the "Schema on Write" approach, where data is structured and transformed before being loaded into the warehouse.
   - Data Lakes: Data processing in data lakes follows the "Schema on Read" approach, meaning data is structured during analysis, offering greater flexibility for data exploration.

3. **Data Ingestion and Storage:**
   - Data Warehousing: Data warehouses focus on ingesting curated data from well-defined sources, following strict ETL (Extract, Transform, Load) processes.
   - Data Lakes: Data lakes ingest raw data from diverse sources, storing it in its native format without extensive transformation, allowing agile data integration.

4. **Data Processing Performance:**
   - Data Warehousing: Data warehouses excel in optimized query performance for structured data, supporting complex analytical queries.
   - Data Lakes: Data processing performance may vary in data lakes, as the data is often raw and unprocessed, requiring additional data transformation steps.

![DL-vs-DW1.png](attachment:DL-vs-DW1.png)


5. **Data Usage and Analytics:**
   - Data Warehousing: Data warehouses primarily serve business intelligence and reporting needs, offering curated and reliable data for decision-making.
   - Data Lakes: Data lakes cater to data exploration, advanced analytics, and machine learning applications, allowing data scientists to work with diverse, unstructured data.

6. **Scalability and Cost:**
   - Data Warehousing: Data warehouses are designed for moderate to large-scale data, but the schema and indexing can impact storage costs.
   - Data Lakes: Data lakes are highly scalable, accommodating petabytes of data without significant cost implications, making them suitable for big data scenarios.

7. **Data Governance and Security:**
   - Data Warehousing: Data governance and security are well-established in data warehouses, ensuring data integrity and compliance.
   - Data Lakes: Data governance and security in data lakes require careful planning due to the diversity of data sources and the dynamic nature of raw data.

Both data warehousing and data lakes have their strengths and are suited to different use cases. Data warehouses are ideal for structured data and well-defined reporting needs, while data lakes excel in handling diverse and raw data for exploratory analytics and advanced data science applications. Organizations often use both solutions in combination to complement each other's capabilities and cover a wide range of data requirements.