# Dimension modeling

- Dimension modeling is a crucial concept in data warehousing and data engineering, which aligns well with your interests. It involves designing data structures to support efficient data analysis and reporting. 
- Dimensional modeling is a data modeling technique that is used to organize data in a way that is optimized for analytical queries. It is based on the concept of dimensions and facts.
-  Facts are typically (but not always) numeric values that can be aggregated and dimensions are groups of hierarchies and descriptors that define the facts. 
- **Example:**
    - Dimensions are attributes of data that are used to describe it. For example, the dimensions of a sales transaction might include the product, customer, date, and store.
    - Facts are quantitative measures of data. For example, the fact of a sales transaction might be the quantity sold or the total price.


### Dimesnion modeling schema

In the context of dimension modeling, two common schema designs are used: 
- the star schema and 
- the snowflake schema. 
 
These schemas are specifically structured to support data warehousing and analytics. 

1. **Star Schema**:
   - **Description**: The star schema is a denormalized design where dimension tables are fully denormalized, making them highly readable and efficient for querying.
   - **Characteristics**:
     - Fact table at the center: A central fact table holds quantitative data, surrounded by dimension tables.
     - Denormalized dimensions: Each dimension table is denormalized, containing all the necessary attributes, including hierarchies.
     - Simple to understand: Star schemas are intuitive and straightforward for end users to work with.
   - **Advantages**:
     - Fast query performance: Queries are typically faster due to denormalized dimensions.
     - Simplified reporting: Users can easily create reports and perform ad-hoc analysis.
     - Suitable for data warehousing and analytics.

2. **Snowflake Schema**:
   - **Description**: The snowflake schema is a normalized design where dimension tables are partially or fully normalized. It can be viewed as an extension of the star schema.
   - **Characteristics**:
     - Dimension table normalization: In a snowflake schema, dimension tables might be normalized to reduce data redundancy.
     - Multiple related tables: This can lead to a more complex schema with multiple related tables.
   - **Advantages**:
     - Data consistency: Normalization can improve data consistency and reduce the chances of update anomalies.
     - Space efficiency: Snowflake schemas can be more space-efficient, especially when dealing with large datasets.
     - Easier maintenance: Normalized data may be easier to maintain, especially when dealing with slowly changing dimensions (SCD).
   - **Considerations**:
     - Query complexity: Snowflake schemas can introduce additional complexity in query design due to the need to join multiple related tables.

The choice between star and snowflake schemas depends on various factors, including the organization's specific data needs, data update frequency, and performance requirements. In practice, many data warehousing solutions use a combination of both, adapting the schema design to suit different dimensions and business requirements within the same data warehouse.

## 

<blockquote style="border-left: 5px solid #CEB5BC; background-color: #DFD531; padding: 10px; color: #000000;">
<h2>Warehouse dimensional modeling</h2>
</blockquote>

- Warehouse dimensional modeling is a data modeling technique used to store and organize data in a data warehouse. It is a way of organizing data so that it is easy to query and analyze.

- It's particularly valuable when dealing with large datasets and complex business questions.
  
- Warehouse dimensional modeling is a powerful technique for storing and organizing data in a data warehouse. It makes it easy to query and analyze data, and it can be used to support a wide range of business intelligence and data warehousing applications.

- Warehouse dimensional modeling is based on the concept of **dimensions** and **facts**. 

  - **Dimensions** These are attributes by which you want to analyze your data. Common dimensions include time, geography, product, customer, and more, depending on the domain.
    - Dimensions often have hierarchies. For example, a time dimension might have a hierarchy of Year > Quarter > Month > Day. Hierarchies enable drill-down analysis.

  - **Facts** are quantitative measurements of data, such as sales, inventory, and website traffic.

### Objectives of Dimensional Modeling
The purposes of dimensional modeling are:

- To produce database architecture that is easy for end-clients to understand and write queries.
- To maximize the efficiency of queries. It achieves these goals by minimizing the number of tables and relationships between them.

### Types of tables in dimensional modeling

In a warehouse dimensional model, data is stored in two types of tables: 

  - **Dimension tables:**
    - Each dimension table contains attributes related to that dimension. For example, a "Time" dimension might have attributes like Year, Quarter, Month, and Day.
    - Dimension tables are typically normalized, which means that they are organized in such a way that each attribute is only stored once.

  - **Fact tables:** 
    - These are measurable data points or metrics you want to analyze, such as sales revenue, quantity sold, or profit.
    - ***foreign keys:*** Fact tables typically contain ***foreign keys*** that link to dimension tables. These foreign keys establish relationships between dimensions and facts.
    - Contrary to dimension table, fact tables, are typically denormalized, which means that they may contain redundant data. This is done to improve query performance.

  <img src="images/Dimension-fact.png" alt="Types of data" style="max-width: 600px;"/>

Dimension tables contain data about the dimensions, and fact tables contain data about the facts.

### Advantages of Dimensional Modeling

Here are some of the benefits of using warehouse dimensional modeling:

- **Improved query performance:** Warehouse dimensional models are designed for fast querying and analysis. This is because dimension tables are normalized and fact tables are denormalized.
- **Reduced data redundancy:** Warehouse dimensional models can help to reduce data redundancy by storing data in a single location. This can save storage space and improve data quality.
- **Improved data consistency:** Warehouse dimensional models can help to improve data consistency by enforcing referential integrity constraints between dimension tables and fact tables. This ensures that data is always accurate and up-to-date.
- **Improved data accessibility:** Warehouse dimensional models make it easy to access data from different sources. This is because dimension tables and fact tables are typically stored in a single database.

Warehouse dimensional modeling is a valuable technique for any organization that needs to store and analyze large amounts of data. It can help organizations to improve their business intelligence and data warehousing capabilities.

### Disadvantages of Dimensional Modeling
- To maintain the integrity of fact and dimensions, loading the data warehouses with a record from various operational systems is complicated.
- It is severe to modify the data warehouse operation if the organization adopting the dimensional technique changes the method in which it does business.

#### Example

Here are some examples of how warehouse dimensional modeling is used in the real world:

  - **Retail:** Retailers use warehouse dimensional modeling to analyze customer data, product data, and sales data. This data can be used to improve customer targeting, product recommendations, and inventory management.
  - **Finance:** Financial institutions use warehouse dimensional modeling to analyze customer data, transaction data, and market data. This data can be used to improve risk management, fraud detection, and investment decisions.
  - **Healthcare:** Healthcare organizations use warehouse dimensional modeling to analyze patient data, medical research data, and clinical trial data. This data can be used to improve patient care, develop new treatments, and reduce costs. 
  
  Warehouse dimensional modeling is a powerful technique that can be used to support a wide range of business applications.

### Steps to creat a dimensional data model

Dimensional data modeling is a key technique in data warehousing and analytics. Here are the basic steps to create a dimensional data model:

1. **Identify Business Requirements**: Begin by understanding the business needs and the questions you want to answer with your data model. This step is crucial as it guides the entire modeling process.

2. **Select Dimension and Fact Tables**: Identify the entities (dimensions) and the measurable data (facts) that are relevant to your analysis. Dimensions are attributes by which you want to analyze your data, while facts are the numerical data points you want to measure.

3. **Design Dimension Tables**:
   - Create dimension tables for each dimension identified in step 2. These tables contain attributes related to each dimension. For example, if you are modeling sales data, you might have a "Time" dimension with attributes like Year, Quarter, Month, etc.
   - Define hierarchies within dimensions, such as Year > Quarter > Month in the Time dimension.

4. **Design Fact Tables**:
   - Create fact tables that store the measures or metrics you want to analyze. These tables typically contain foreign keys to link to dimension tables.
   - Decide on the granularity of your fact table. This determines the level of detail at which data is recorded. For example, daily sales or monthly sales.

5. **Establish Relationships**: Define the relationships between dimension tables and fact tables using foreign keys. This linkage allows you to perform meaningful queries across different dimensions.

6. **Create a Star or Snowflake Schema**:
   - In a star schema, all dimension tables directly relate to the fact table. It's a simple and denormalized structure.
   - In a snowflake schema, dimension tables may be normalized, leading to a more complex but space-efficient structure.

7. **Populate Data**: Load data into your dimension and fact tables from various sources, such as databases or external files. Ensure data quality and consistency.

8. **Add Hierarchies and Aggregations**:
   - Create hierarchies in your dimension tables to facilitate drill-down analysis.
   - Pre-calculate aggregations like sums, averages, or counts in your fact table to improve query performance.

9. **Implement Business Logic**: Apply any required business rules or calculations to the data to meet specific analytical needs.

10. **Testing and Validation**: Thoroughly test your data model by running sample queries and ensuring the results match the expected outcomes.

11. **Documentation**: Document the structure of your dimensional data model, including definitions of dimensions, facts, hierarchies, and relationships. This documentation is crucial for future users.

12. **Optimization**: Continuously monitor and optimize the performance of your data model as the data volume grows or business requirements change.

13. **Usage and Reporting**: Finally, make the data model available to end-users through data visualization tools or analytics platforms for reporting and analysis.

Remember that dimensional modeling is highly dependent on the specific business context and data you're working with. Tailor your approach to meet the unique requirements of your projects, whether they are related to data science, data analytics, finance, or data engineering.

Main steps for dimensional modelling can be summarized as:

<img src="images/Data-modelling.png" alt="Types of data" style="max-width: 400px;"/>

### Ways to structure data warehouse

There are several ways to structure a data warehouse, depending on the specific needs and requirements of the organization. Here are some common approaches to structuring a data warehouse:

1. **Star Schema:** The star schema is a widely used structure in data warehousing. It consists of a central fact table surrounded by multiple dimension tables. The fact table contains the measurements or facts of the business, such as sales or revenue, and the dimension tables provide context to these facts. Each dimension table represents a different aspect of the business, such as time, product, customer, or location. The star schema offers simplicity, ease of understanding, and efficient query performance.

<img src="images/image-4.png" alt="Types of data" style="max-width: 700px;"/>

(Image credit: https://en.wikipedia.org/wiki/Snowflake_schema)

2. **Snowflake Schema:** The snowflake schema is an extension of the star schema. It expands on the dimension tables by normalizing them into multiple levels. In a snowflake schema, dimension tables are connected through hierarchical relationships, resulting in a more complex structure. This schema can be useful when dealing with dimensions that have many attributes and hierarchies. However, it may require more complex query joins and can potentially impact performance.

<img src="images/image-5.png" alt="Types of data" style="max-width: 700px;"/>


(Image credit: https://en.wikipedia.org/wiki/Snowflake_schema)

3. **Fact Constellation (Galaxy) Schema:** The fact constellation schema, also known as the galaxy schema, is a more complex structure that consists of multiple fact tables sharing dimension tables. It is suitable when there are multiple fact tables representing different business processes or areas, but they share common dimensions. The fact constellation schema offers greater flexibility and can support complex analysis involving multiple fact tables.

<img src="images/image-6.png" alt="Types of data" style="max-width: 700px;"/>


(Image credit: https://www.geeksforgeeks.org/fact-constellation-in-data-warehouse-modelling/)

4. **Data Vault:** The Data Vault methodology is a modeling technique that focuses on scalability, flexibility, and historical tracking of data. It involves separating the data into three main types of tables: the hub tables representing core business entities, the satellite tables containing descriptive information about the hubs, and the link tables that capture the relationships between hubs. Data Vault structures are designed to handle large volumes of data, accommodate changes over time, and provide traceability of data.

<img src="images/image-7.png" alt="Types of data" style="max-width: 700px;"/>


(Image credit: https://blog.viadee.de/data-vault-nutzen-und-funktionsweise)

5. **Hybrid Approaches:** In practice, many data warehouses use a combination of different structures to meet specific needs. This can involve a mix of star schemas, snowflake schemas, and other modeling techniques. Hybrid approaches allow organizations to strike a balance between simplicity, performance, and flexibility by adopting different structures based on the specific requirements of different parts of the data warehouse.

The choice of data warehouse structure depends on factors such as the complexity of the data, the analytical requirements, the organization's reporting needs, and the scalability and performance considerations. It's important to carefully analyze the data and business requirements before determining the appropriate structure for a data warehouse.

### Types of Fact tables

Fact tables are a fundamental component of a star or snowflake schema in a data warehousing environment. They store quantitative data (measures) that can be analyzed and aggregated. There are several types of fact tables, each designed to serve a specific analytical purpose. Here are some common types of fact tables:

1. **Transaction Fact Table**:
   - Also known as a "detail" or "atomic" fact table, this type records individual transactions or events at the most granular level. It contains foreign keys to dimension tables and measures associated with each event. Transaction fact tables are used for detailed analysis and are the basis for aggregations in other fact tables.

2. **Snapshot Fact Table**:
   - A snapshot fact table captures data at specific points in time, often at regular intervals (e.g., daily, weekly, monthly). It's useful for tracking changes over time and creating historical reports. Examples include daily sales or inventory snapshots.

3. **Accumulating Snapshot Fact Table**:
   - This type of fact table is used to track a process that goes through well-defined stages. It records measures at each stage of the process and is updated as the process progresses. Commonly used in business processes, such as order fulfillment or project management.

4. **Periodic Snapshot Fact Table**:
   - Similar to the accumulating snapshot, but it records data at specific intervals (periods) rather than stages. It's often used for analyzing performance metrics, such as monthly sales or quarterly revenue.

5. **Factless Fact Table**:
   - A factless fact table contains only foreign keys to dimension tables and no measurable data (facts). It's used to represent events or relationships between dimensions. For example, it can be used to track student attendance, where the fact is the presence of a student in a class.

6. **Semi-additive Fact Table**:
   - Semi-additive fact tables contain measures that can be aggregated along some dimensions but not along others. For instance, you can sum up sales data by product but not by time because adding sales across time periods doesn't make sense.

7. **Non-additive Fact Table**:
   - Non-additive fact tables contain measures that cannot be aggregated at all. These measures are typically ratios, percentages, or other derived values. For example, profit margin or customer churn rate.

8. **Degenerate Dimension Fact Table**:
   - This type of fact table incorporates dimensions that are not represented in dimension tables because they don't provide additional information. Instead, they are included as attributes in the fact table. Common examples include transaction IDs and invoice numbers.

9. **Multi-fact Table**:
   - Multi-fact tables combine multiple measures from different fact tables into a single table. This is done to simplify querying and reporting when multiple measures are often used together.

The choice of which type of fact table to use depends on the specific analytical requirements of your data warehousing project. Different types of fact tables serve different purposes and are optimized for different types of queries and reporting.

### steps to create a fact table

Creating a fact table is a fundamental step in designing a data warehouse schema. Here are the steps to create a fact table:

1. **Define the Business Process or Analytical Objective**:
   - Start by understanding the business process or analytical objective that you want to model using the fact table. Determine what kind of measures (quantitative data) you need to track and analyze.

2. **Identify Dimensions**:
   - Identify the dimensions that are relevant to the business process. Dimensions are descriptive attributes that provide context to the measures in the fact table. For example, if you are tracking sales, dimensions might include time, product, store, and customer.

3. **Determine Grain**:
   - Define the grain or granularity of the fact table. This specifies the level of detail at which you want to record data. For example, in a sales fact table, you might choose to record sales at the daily, weekly, or monthly level.

4. **Select Measures**:
   - Determine the measures (facts) you want to store in the fact table. These are the quantitative values that you want to analyze. Common measures include sales revenue, quantity sold, cost, and profit.

5. **Design the Schema**:
   - Decide on the schema type for your data warehouse. Common schema types include star schema and snowflake schema. In a star schema, the fact table is at the center, surrounded by dimension tables. In a snowflake schema, dimension tables may be further normalized.

6. **Create the Fact Table**:
   - Using your chosen schema design, create the fact table in your data warehouse. Define the columns for the measures and foreign keys for the related dimension tables. Ensure that the fact table has a primary key.

7. **Load Data**:
   - Populate the fact table with data from your source systems. This may involve ETL (Extract, Transform, Load) processes to transform and clean the data before inserting it into the fact table.

8. **Establish Relationships**:
   - Create relationships between the fact table and dimension tables using foreign keys. These relationships enable you to join the fact table with dimension tables for analysis.

9. **Indexing and Partitioning**:
   - Depending on the size of your fact table and the query performance requirements, consider indexing and partitioning strategies. Indexes can improve query performance, while partitioning can help manage large fact tables.

10. **Testing and Validation**:
    - Thoroughly test the fact table by running sample queries and ensuring that the data is accurate and consistent with your business requirements. Validate that the aggregations and calculations work as expected.

11. **Documentation**:
    - Document the fact table structure, relationships with dimension tables, and the meaning of each measure. This documentation is essential for the understanding and maintenance of the data warehouse.

12. **Maintenance and Updates**:
    - As your data warehouse evolves, you may need to update and maintain the fact table. This includes adding new data, handling changes in source systems, and ensuring data quality.

13. **Access Control**:
    - Implement access control and security measures to restrict access to the fact table based on user roles and permissions.

Creating a fact table is a critical step in building an effective data warehouse that supports business intelligence and analytics. Properly designed and maintained fact tables are key to extracting valuable insights from your data.

#### Example

Let's create a fact table for a fictional retail business to track sales data. We'll go through the steps of creating the fact table using this example:

**Step 1: Define the Business Process**
- Our business process is to track and analyze sales data. We want to understand how much revenue we generate from the sales of various products in different stores over time.

**Step 2: Identify Dimensions**
- Dimensions for our sales fact table include:
  - Time (Date of the sale)
  - Product (Product ID, Name, Category)
  - Store (Store ID, Name, Location)
  - Customer (Customer ID, Name)

**Step 3: Determine Grain**
- We'll record sales data at the daily level, so our grain is daily sales.

**Step 4: Select Measures**
- Measures we want to include in our fact table:
  - Sales Amount (Revenue generated from each sale)
  - Quantity Sold (Number of products sold in each transaction)

**Step 5: Design the Schema**
- In this example, we'll use a star schema. The fact table will be at the center, surrounded by dimension tables (Time, Product, Store, Customer).

**Step 6: Create the Fact Table**
- Our fact table, let's call it `SALES_FACT`, is created with the following columns:
  - `SaleID` (Primary Key)
  - `DateKey` (Foreign Key to Time Dimension)
  - `ProductKey` (Foreign Key to Product Dimension)
  - `StoreKey` (Foreign Key to Store Dimension)
  - `CustomerKey` (Foreign Key to Customer Dimension)
  - `SalesAmount` (Numeric)
  - `QuantitySold` (Numeric)


<img src="images/sales-fact-dime.png" alt="Types of data" style="max-width: 800px;"/>


**Step 7: Load Data**
- Populate the `SALES_FACT` table with data from source systems. This involves extracting data from various sources, transforming it (e.g., data cleansing, data type conversions), and loading it into the fact table.

**Step 8: Establish Relationships**
- Create relationships between the fact table (`SALES_FACT`) and dimension tables (e.g., `TIME_DIM`, `PRODUCT_DIM`, `STORE_DIM`, `CUSTOMER_DIM`) using foreign keys. These relationships enable us to join the fact table with dimension tables for analysis.

**Step 9: Indexing and Partitioning**
- Depending on the size of the fact table, consider adding indexes on foreign key columns for performance optimization. You may also partition the fact table by time to manage large volumes of data efficiently.

**Step 10: Testing and Validation**
- Perform thorough testing by running sample queries and ensuring that data is accurate and consistent. Validate that aggregations and calculations (e.g., total revenue) work as expected.

**Step 11: Documentation**
- Document the structure of the `SALES_FACT` table, including column descriptions and relationships with dimension tables. Explain the meaning of each measure to aid in understanding and maintenance.

**Step 12: Maintenance and Updates**
- Regularly update the `SALES_FACT` table with new sales data as it becomes available. Ensure that historical data remains accessible and accurate.

**Step 13: Access Control**
- Implement access control and security measures to restrict access to the fact table based on user roles and permissions to protect sensitive data.

By following these steps and using the example of a sales fact table, you can create a structured and well-designed fact table that supports analytics and reporting for your retail business.