<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/DataScience_01_OrganizingData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1: Organizing Data for Analysis - From Tom Nook's Shop to Data Warehouses

In the world of data science, organizing and managing data effectively is crucial for deriving meaningful insights. This chapter uses the familiar setting of Tom Nook's shop in Nintendo's popular Animal Crossing to introduce fundamental concepts in data organization and management. We'll explore how data structures evolve from simple lists to complex data warehouses, mirroring the growth of Tom's business from a small island shop to a multi-island empire.

We begin with basic data organization using lists and simple databases, progressing to more advanced concepts like relational databases, data normalization, and non-relational databases. As Tom's business expands, we'll dive into data lakes, warehouses, and marts, explaining how these large-scale data storage solutions support business intelligence and decision-making.

The chapter also covers essential data processing systems - Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) - demonstrating how they handle Tom's daily sales and complex business analyses respectively. We'll explore data warehouse design patterns like Star and Snowflake schemas, showing how they optimize data for analytical queries.

Finally, we'll discuss the concept of Slowly Changing Dimensions, illustrating how businesses like Tom's can track historical changes in their data, enabling trend analysis and informed decision-making.

By the end of this chapter, you'll have a comprehensive understanding of data organization principles, from basic storage to complex analytical systems, all contextualized through the lens of a familiar and relatable business scenario.

## Learning Outcomes

By the end of this chapter, you will be able to:

1. Explain the basic principles of data organization and storage
2. Describe the structure and purpose of relational databases
3. Understand the process and benefits of data normalization
4. Compare and contrast relational and non-relational databases
5. Define and differentiate between data lakes, data warehouses, and data marts
6. Explain the fundamental differences between OLTP and OLAP systems
7. Describe the structure and benefits of Star and Snowflake schemas in data warehouses
8. Understand the concept of Slowly Changing Dimensions and their importance in tracking historical data
9. Apply data organization principles to real-world business scenarios
10. Evaluate the appropriate data storage and processing solutions for different business needs

## Keywords

Relational databases, data normalization, non-relational databases, data lakes, data warehouses, data marts, OLTP, OLAP, Star schema, Snowflake schema, Slowly Changing Dimensions



## How Does Tom Nook Keep Track of His Inventory?

Imagine walking into Tom Nook's general store in Animal Crossing. The shelves are neatly stocked with various items - from furniture to tools, clothing to snacks. Have you ever wondered how Tom manages to keep track of all these items efficiently? The answer lies in the world of data organization, specifically in **data schemas** and **dimensions**.

A **data schema** is like a blueprint for organizing information. It defines the structure of a database, describing how data is arranged and how different pieces of information relate to each other. In Tom's case, a data schema would outline how he stores information about each item in his shop.

**Dimensions**, on the other hand, are the different ways we can describe or categorize data. They're like the characteristics or attributes of the information we're storing. For Tom's inventory, dimensions might include things like item type, price, color, or size.

Let's look at a simple example of how Tom might organize his data, using a programming language called **Structured Queery Language (SQL)**.

In [1]:
#load sql magic and connect to nook.db
%reload_ext sql
%config SqlMagic.autopandas=True
%sql sqlite:///nook.db

In [2]:
%%sql
DROP TABLE IF EXISTS Inventory;
CREATE TABLE Inventory (
    item_id TEXT PRIMARY KEY,
    name TEXT,
    type TEXT,
    price INTEGER,
    color TEXT,
    stock INTEGER
);

INSERT INTO Inventory (item_id, name, type, price, color, stock) VALUES
('001', 'Leaf Table', 'Furniture', 1200, 'Green', 5),
('002', 'Fishing Rod', 'Tool', 500, 'Blue', 10),
('003', 'T-shirt', 'Clothing', 800, 'Red', 15);

SELECT * FROM Inventory;

 * sqlite:///nook.db
Done.
Done.
3 rows affected.
Done.


Unnamed: 0,item_id,name,type,price,color,stock
0,1,Leaf Table,Furniture,1200,Green,5
1,2,Fishing Rod,Tool,500,Blue,10
2,3,T-shirt,Clothing,800,Red,15


In this short SQL script, we see the following:
-   The `CREATE TABLE` statement initializes a new table within the database. It defines the table's name (`Inventory`) and its columns, along with the column names and data types. This sets up the foundational structure for how data will be stored and organized.
-  Each column in the table has an associated **data type** (e.g., `TEXT` for text strings, `INTEGER` for whole numbers). Data types specify the kind of data that can be stored in each column, ensuring that each piece of data is stored in a consistent format, which helps maintain data integrity and facilitates efficient querying.
- The `INSERT INTO` statement is used to add new records (rows) to the table. It specifies the columns to be filled and the corresponding values for each new record. For example, `INSERT INTO Inventory (item_id, name, type, price, color, stock) VALUES ('001', 'Leaf Table', 'Furniture', 1200, 'Green', 5)` adds a new item to the `Inventory` table with its specific attributes.
-  The `SELECT` statement is used to retrieve data from the table. The command `SELECT * FROM Inventory;` fetches all records from the `Inventory` table. This allows us to view the entire dataset stored within the table, which is crucial for analyzing and using the data effectively.

Once we have data encoded in this way, we can easily write queries to find out the stock of each item:

In [3]:
%%sql
SELECT name, stock
FROM Inventory
WHERE type = 'Furniture';

 * sqlite:///nook.db
Done.


Unnamed: 0,name,stock
0,Leaf Table,5


Here, we see a few new things:
-  In the `SELECT` statement, specifying columns allows us to retrieve only certain pieces of data from the table. Here, `SELECT name, stock` tells the database to fetch only the `name` and `stock` columns from the `Inventory` table, instead of retrieving all the columns.
-   The `FROM` clause indicates the table from which to retrieve the data. In this case, `FROM Inventory` specifies that we are pulling data from the `Inventory` table.
-  The `WHERE` clause is used to filter the records based on specific conditions. `WHERE type = 'Furniture'` restricts the query to only include rows where the `type` column has the value 'Furniture'. This allows us to narrow down the results to only the relevant data.

Together, this query fetches the names and stock quantities of all items in the `Inventory` table that are categorized as 'Furniture'.


Understanding data schemas and dimensions is crucial in data science because they form the foundation of how we organize, store, and analyze data. A well-designed schema makes it easier to:

1.  *Retrieve information*. Tom can quickly find out how many Fishing Rods he has in stock.
2.  *Update data*. If Tom sells a Leaf Table, he can easily decrease the stock count.
3.  *Analyze trends*. Tom can look at which types of items are selling best or which colors are most popular.
4.  *Make decisions* Based on his data, Tom can decide what items to restock or what new products to introduce.

As we delve deeper into data science, you'll see how these basic concepts of schemas and dimensions play a crucial role in more complex data structures and analysis techniques. They're the building blocks that allow data scientists to turn raw information into valuable insights, helping businesses like Tom's general store thrive in the digital age.

### How Does Tom Nook Manage Complex Data Relationships?

As Tom Nook's business grows, he needs a more sophisticated way to manage his data. Enter **relational databases**, a powerful tool that allows Tom to organize and connect different types of information efficiently.

A **relational database** is a type of database that stores and provides access to data points that are related to one another. It's based on the relational model, an intuitive, straightforward way of representing data in tables. In a relational database, each row in the table is a record with a unique ID called the key. Columns of the table hold attributes of the data, and each record usually has a value for each attribute.

Let's see how Tom might use a relational database to manage his store. First, let's recreate his inventory table.


In [4]:
%%sql
DROP TABLE IF EXISTS Inventory;
-- Create the Inventory table
CREATE TABLE Inventory (
    ItemID TEXT PRIMARY KEY,
    Name TEXT,
    Type TEXT,
    Price INTEGER,
    Color TEXT
);

-- Insert some sample data
INSERT INTO Inventory VALUES
    ('001', 'Leaf Table', 'Furniture', 1200, 'Green'),
    ('002', 'Fishing Rod', 'Tool', 500, 'Blue'),
    ('003', 'T-shirt', 'Clothing', 800, 'Red');

-- select all data
SELECT * FROM Inventory;

 * sqlite:///nook.db
Done.
Done.
3 rows affected.
Done.


Unnamed: 0,ItemID,Name,Type,Price,Color
0,1,Leaf Table,Furniture,1200,Green
1,2,Fishing Rod,Tool,500,Blue
2,3,T-shirt,Clothing,800,Red


Now, let's create a table for `Stock.`

In [5]:
%%sql
DROP TABLE IF EXISTS Stock;
-- Create the Stock table
CREATE TABLE Stock (
    StockID TEXT PRIMARY KEY,
    ItemID TEXT,
    Quantity INTEGER,
    Location TEXT,
    FOREIGN KEY (ItemID) REFERENCES Inventory(ItemID)
);

INSERT INTO Stock VALUES
    ('S001', '001', 5, 'Shelf A'),
    ('S002', '002', 10, 'Shelf B'),
    ('S003', '003', 15, 'Shelf C');

SELECT * FROM Stock;

 * sqlite:///nook.db
Done.
Done.
3 rows affected.
Done.


Unnamed: 0,StockID,ItemID,Quantity,Location
0,S001,1,5,Shelf A
1,S002,2,10,Shelf B
2,S003,3,15,Shelf C


Finally, let's create a table for `Sales`.

In [6]:
%%sql
DROP TABLE IF EXISTS Sales;

-- Create the Sales table
CREATE TABLE Sales (
    SaleID TEXT PRIMARY KEY,
    ItemID TEXT,
    Quantity INTEGER,
    Date TEXT,
    FOREIGN KEY (ItemID) REFERENCES Inventory(ItemID)
);

INSERT INTO Sales VALUES
    ('SA001', '001', 1, '2024-06-27'),
    ('SA002', '002', 2, '2024-06-27'),
    ('SA003', '003', 3, '2024-06-28');


SELECT * FROM Sales;


 * sqlite:///nook.db
Done.
Done.
3 rows affected.
Done.


Unnamed: 0,SaleID,ItemID,Quantity,Date
0,SA001,1,1,2024-06-27
1,SA002,2,2,2024-06-27
2,SA003,3,3,2024-06-28



In this relational structure, we have three tables: Inventory, Stock, and Sales. These tables are **related** to each other through the ItemID. This relationship allows Tom to connect information across tables, enabling more complex queries and data analysis.

Key concepts in relational databases include:

1.  **Tables**: Also known as relations, these are the main structures in a relational database. Each table represents a specific type of entity (like items, stock, or sales).
2.  **Columns**: Also called attributes or fields, these define the type of data stored in the table (like ItemID, Name, Price).
3.  **Rows**: Also known as records or tuples, each row represents a single data entry in the table.
4.  **Primary Key**: A unique identifier for each row in a table (like ItemID in the Inventory table).
5.  **Foreign Key**: A field in one table that uniquely identifies a row of another table (like ItemID in the Stock and Sales tables, which refers to the Inventory table).

Let's see how Tom might use this relational structure in practice:

In [7]:
%%sql
-- Query to get sales info for a specific item
SELECT Inventory.Name, Sales.Quantity, Sales.Date
FROM Sales
JOIN Inventory ON Sales.ItemID = Inventory.ItemID
WHERE Inventory.Name = 'Leaf Table';

 * sqlite:///nook.db
Done.


Unnamed: 0,Name,Quantity,Date
0,Leaf Table,1,2024-06-27


In this query, the `JOIN` clause (seen above) is used to combine rows from two or more tables based on a related column between them. In relational databases, **joins** are central because they allow us to create more complex queries by linking tables that share common data, enabling comprehensive data analysis across multiple data sets. Here's the basic ideas

-   `FROM Sales` indicates that the primary table for this query is `Sales`. However, we need data that isn't (just) in this table!
-   `JOIN Inventory ON Sales.ItemID = Inventory.ItemID` specifies that we are joining the `Sales` table with the `Inventory` table. The join is performed on the `ItemID` column, which must be present in both tables. This operation links sales data with the corresponding inventory information for each item.
-  `SELECT Inventory.Name, Sales.Quantity, Sales.Date` tells the database to retrieve specific columns: the `Name` column from the `Inventory` table and the `Quantity` and `Date` columns from the `Sales` table. This selective retrieval allows us to get only the relevant data we need.
-  `WHERE Inventory.Name = 'Leaf Table'` filters the results to include only the sales information for the item named 'Leaf Table'. This ensures the query returns data specifically related to this item.

In the end, this allows Tom Nook to answer questions that require using data from multiple related "tables" in the database.

## How Can Tom Nook Organize His Shop's Data More Efficiently?

After setting up his relational database, Tom Nook realizes that some of his tables have repeated information and are becoming difficult to manage. He's heard about something called "normalization" that might help. Let's explore what this means and how it can help Tom organize his shop's data more efficiently.

**Normalization** is like tidying up your room, but for databases. It's a way to organize data to reduce repetition and make it easier to manage. The goal is to structure your database so that each piece of information is stored in only one place.

### Why Normalize?

Imagine if Tom kept a notebook where he wrote down every sale, including all the customer's details each time. It might look something like this:

| Sale ID | Customer Name | Customer Island | Item Sold | Price |
|---------|---------------|-----------------|-----------|-------|
| 1 | Isabelle | Resident Services | Leaf Table | 1200 |
| 2 | Isabelle | Resident Services | Fishing Rod | 500 |
| 3 | K.K. Slider | Touring | Leaf Table | 1200 |

This table has some problems:
1. Customer information is repeated for each sale.
2. If Isabelle moves to a new island, Tom would need to update multiple rows.
3. If Tom wants to change the price of a Leaf Table, he'd need to find and update every row that includes it.

These issues are what normalization helps solve.

### How Does Normalization Work?

Normalization involves breaking down big tables into smaller, more focused tables. Let's see how Tom could normalize his sales data:

Customers Table:

| Customer ID | Name | Island |
|-------------|------|--------|
| 1 | Isabelle | Resident Services |
| 2 | K.K. Slider | Touring |

Products Table:

| Product ID | Name | Price |
|------------|------|-------|
| 1 | Leaf Table | 1200 |
| 2 | Fishing Rod | 500 |

Sales Table:

| Sale ID | Customer ID | Product ID |
|---------|-------------|------------|
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 3 | 2 | 1 |

Now, each piece of information is stored in only one place:
- Customer details are in the Customers table.
- Product details are in the Products table.
- The Sales table just links customers to products.

Some benefits of normalization include:

1. Each fact is stored in one place, saving space.
2. If Isabelle moves, Tom only needs to update one row in the Customers table.
3. With less repeated data, there's less chance of inconsistencies.
4. It's easier to add new types of data or change existing structures.

While normalization is great for many situations, there are times when it might not be the best approach. As we explore concepts like non-relational databases, data lakes, and data warehouses in future sections, we'll see different ways of organizing data that sometimes purposely "denormalize" for specific benefits.

## Table: Basic SQL Commands

| **Command** | **Description** |
| --- | --- |
| **Data Definition (DDL)** |  |
| `CREATE TABLE table_name (column1 datatype, column2 datatype, ...)` | Create a new table. |
| `DROP TABLE table_name` | Delete a table. |
| `ALTER TABLE table_name ADD column_name datatype` | Add a new column to a table. |
| `ALTER TABLE table_name DROP COLUMN column_name` | Remove a column from a table. |
| **Data Manipulation (DML)** |  |
| `INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...)` | Insert new data into a table. |
| `UPDATE table_name SET column1 = value1, column2 = value2, ... WHERE condition` | Update existing data in a table. |
| `DELETE FROM table_name WHERE condition` | Delete data from a table based on a condition. |
| **Data Querying (DQL)** |  |
| `SELECT column1, column2, ... FROM table_name` | Select specific columns from a table. |
| `SELECT * FROM table_name` | Select all columns from a table. |
| `SELECT column1, aggregate_function(column2) FROM table_name GROUP BY column1` | Group data and apply an aggregate function (e.g., `SUM`, `COUNT`, `AVG`). |
| `SELECT column1, column2 FROM table_name WHERE condition` | Select data that meets a specific condition. |
| `SELECT column1, column2 FROM table_name ORDER BY column1 ASC|DESC` | Select data and sort the results in ascending or descending order. |
| `SELECT table1.column1, table2.column2 FROM table1 JOIN table2 ON table1.common_column = table2.common_column` | Perform an inner join between two tables. |
| `SELECT table1.column1, table2.column2 FROM table1 LEFT JOIN table2 ON table1.common_column = table2.common_column` | Perform a left join between two tables. |
| `SELECT table1.column1, table2.column2 FROM table1 RIGHT JOIN table2 ON table1.common_column = table2.common_column` | Perform a right join between two tables. |
| `SELECT column1, COUNT(column2) FROM table_name GROUP BY column1` | Group data by a column and count the occurrences of another column. |
| `SELECT column1, SUM(column2) FROM table_name GROUP BY column1` | Group data by a column and calculate the sum of another column. |
| `SELECT column1, AVG(column2) FROM table_name GROUP BY column1` | Group data by a column and calculate the average of another column |

## How Can Tom Nook Handle Diverse and Rapidly Changing Data?

As Tom Nook's business expands and diversifies, he starts encountering data that doesn't fit neatly into the rows and columns of his relational database. He needs a more flexible solution to handle things like customer reviews, complex product descriptions, and rapidly changing inventory for special events. This is where **non-relational databases**, also known as **NoSQL databases**, come into play.

A **non-relational database** is a type of database that provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. These databases are designed to handle a wide variety of data models, offering flexibility that traditional relational databases can't match.

Let's explore some common types of non-relational databases and how Tom might use them:

### Document Databases

Document databases store data in flexible formats often related to **Javascript Object Notation (JSON)**  documents. Each document can have a different structure, allowing for more versatility in data representation. This is particularly useful for Tom when he needs to store detailed product information that might vary significantly between different types of items.

For example, here's how a fishing rod might be represented in a document database:

```javascript
{
  "_id": "PROD001",
  "name": "Deluxe Fishing Rod",
  "price": 2500,
  "attributes": {
    "material": "carbon fiber",
    "length": "2.5m",
    "features": ["telescopic", "anti-corrosion"]
  },
  "reviews": [
    {"user": "Isabelle", "rating": 5, "comment": "Caught a sea bass on my first try!"},
    {"user": "Blathers", "rating": 4, "comment": "Quite sturdy, but a bit heavy."}
  ]
}
```

In this structure, Tom can easily add or remove fields for different products without affecting others. He could store detailed attributes for a fishing rod, while a piece of furniture might have completely different attributes. This flexibility is a key advantage over relational databases, where changing the structure of a table affects all rows.

### Key-Value Stores

Key-value stores are the simplest NoSQL databases. They store data as a collection of key-value pairs, where the key is a unique identifier. This model is incredibly efficient for certain types of operations.

For instance, during a flash sale, Tom might use a key-value store for real-time inventory tracking:
```
SET inventory:PROD001 100
SET inventory:PROD002 50
SET inventory:PROD003 75
```

Here, each product ID is a key, and the current inventory count is the value. Key-value stores excel at rapid reads and writes, making them perfect for scenarios where speed is crucial. While a relational database could handle this task, a key-value store can perform these operations much faster, especially at scale.

### Graph Databases

Graph databases use graph structures with nodes, edges, and properties to represent and store data. They're excellent for data with complex relationships. Tom might use a graph database to analyze customer relationships or create a product recommendation system.

Here's a simple example of how relationships might be represented in a graph database:

```cypher
CREATE (tom:Person {name: 'Tom Nook'})-[:SELLS]->(rod:Product {name: 'Fishing Rod'})
CREATE (isabelle:Person {name: 'Isabelle'})-[:BOUGHT]->(rod)
CREATE (isabelle)-[:FRIENDS_WITH]->(tom)
```

This structure allows Tom to easily query complex relationships, like finding all products bought by friends of customers who purchased a specific item. While possible in a relational database, these queries can become very complex and slow as the relationships grow.

### Bridging Relational and Non-Relational: JSON in SQL

Modern relational databases have evolved to incorporate some non-relational features. One common approach is the addition of JSON columns, which allow for the storage of semi-structured data within a relational table.

Let's see how Tom might use this in his existing relational database:

In [8]:
%%sql
-- Add a JSON column to the Inventory table
ALTER TABLE Inventory ADD COLUMN Details JSON;

-- Insert a product with JSON details
INSERT INTO Inventory (ItemID, Name, Type, Price, Details)
VALUES ('PROD001', 'Deluxe Fishing Rod', 'Tool', 2500,
  '{"material": "carbon fiber",
    "length": "2.5m",
    "features": ["telescopic", "anti-corrosion"],
    "reviews": [
      {"user": "Isabelle", "rating": 5, "comment": "Caught a sea bass on my first try!"},
      {"user": "Blathers", "rating": 4, "comment": "Quite sturdy, but a bit heavy."}
    ]
  }');

SELECT * FROM Inventory;

 * sqlite:///nook.db
Done.
1 rows affected.
Done.


Unnamed: 0,ItemID,Name,Type,Price,Color,Details
0,001,Leaf Table,Furniture,1200,Green,
1,002,Fishing Rod,Tool,500,Blue,
2,003,T-shirt,Clothing,800,Red,
3,PROD001,Deluxe Fishing Rod,Tool,2500,,"{""material"": ""carbon fiber"",\n ""length"": ""2..."


In [9]:
%%sql
-- Query JSON data
SELECT Name, JSON_EXTRACT(Details, '$.material') AS Material
FROM Inventory
WHERE JSON_EXTRACT(Details, '$.features[0]') = 'telescopic';

 * sqlite:///nook.db
Done.


Unnamed: 0,Name,Material
0,Deluxe Fishing Rod,carbon fiber


This approach allows Tom to combine the strengths of relational databases (like strong consistency and complex joins) with the flexibility of non-relational data storage. He can store structured data in regular columns and semi-structured or varying data in the JSON column.

Understanding both relational and non-relational databases is crucial in modern data science. While relational databases remain important for structured data and transactions, non-relational databases have become essential for handling the volume, velocity, and variety of big data in modern applications. As a data scientist, you'll likely encounter both types in your career, and knowing when and how to use each is a valuable skill.

## How Does Tom Nook Organize and Analyze His Growing Data Empire?

As Tom Nook's business continues to flourish across multiple islands, he finds himself dealing with an ever-increasing volume and variety of data. From daily sales transactions to customer preferences, from inventory levels to seasonal trends, Tom needs a way to store, manage, and analyze all this information effectively. This is where concepts like **data lakes**, **data warehouses**, and **data marts** come into play. While these are often structured as relational databases, they can also contain non-relational data.

### Data Lakes: The Vast Ocean of Raw Data

A **data lake** is a storage repository that holds a vast amount of raw data in its native format until it's needed. Imagine Tom has a gigantic pool where he tosses in all sorts of data as soon as it's generated - sales receipts, customer surveys, social media mentions, weather reports, anything and everything that might be useful someday.

Key characteristics of a data lake:
- Stores all types of data: structured, semi-structured, and unstructured
- Data is stored in its raw form, without transformation
- Highly scalable and flexible
- Supports big data analytics and machine learning

For example, Tom's data lake might contain:

| Data Type | Example | Format |
|-----------|---------|--------|
| Sales Transactions | Daily sales logs | CSV files |
| Customer Feedback | Customer emails | Plain text |
| App Usage | User interactions | JSON |
| Product Images | New item photos | JPEG, PNG |

Tom can dump all this diverse data into his data lake without worrying about organizing it upfront. This allows him to collect data now and figure out how to use it later.

### Data Warehouses: The Organized Archive

While a data lake is great for storing raw data, Tom needs a more structured approach for regular business reporting and analysis. This is where a **data warehouse** comes in. A data warehouse is a central repository of integrated data from one or more disparate sources, designed for query and analysis.

Key characteristics of a data warehouse:
- Stores structured, processed data
- Data is organized into schemas optimized for analysis
- Supports complex queries and business intelligence tools
- Provides a historical view of the business

Let's visualize how Tom might structure his data warehouse:

FactSales Table:

| SaleID | DateID | ProductID | CustomerID | StoreID | Quantity | Revenue |
|--------|--------|-----------|------------|---------|----------|---------|
| 1001   | 20240601 | P001    | C101       | S01     | 2        | 2400    |
| 1002   | 20240601 | P002    | C102       | S02     | 1        | 500     |
| 1003   | 20240602 | P001    | C103       | S01     | 1        | 1200    |

DimProduct Table:

| ProductID | ProductName | Category | Subcategory | Price |
|-----------|-------------|----------|-------------|-------|
| P001      | Deluxe Fishing Rod | Tool | Fishing | 1200  |
| P002      | Leaf Table  | Furniture | Table    | 500   |
| P003      | T-shirt     | Clothing | Tops      | 800   |

DimDate Table:

| DateID    | Date       | Day | Month | Year | Season |
|-----------|------------|-----|-------|------|--------|
| 20240601  | 2024-06-01 | 1   | 6     | 2024 | Summer |
| 20240602  | 2024-06-02 | 2   | 6     | 2024 | Summer |
| 20240603  | 2024-06-03 | 3   | 6     | 2024 | Summer |

This structure allows Tom to easily answer questions like "What was the total revenue for furniture items during the summer season across all island stores?"

### Data Marts: Focused Subsets for Specific Needs

As Tom's business grows, different departments might need quick access to specific subsets of data. This is where **data marts** come in. A data mart is a subset of a data warehouse, oriented to a specific business line or team.

For example, here's how a data mart for Tom's marketing team might look:

Marketing Data Mart:

| CustomerID | Name | Age | Island | Category | TotalPurchases | TotalSpent |
|------------|------|-----|--------|----------|----------------|------------|
| C101 | Isabelle | 28 | Resident Services | Tool | 5 | 6000 |
| C102 | Blathers | 35 | Museum | Furniture | 3 | 1500 |
| C103 | K.K. Slider | 30 | Plaza | Clothing | 10 | 8000 |

This focused view provides the marketing team with customer information, their purchase history by category, and total spend, which they can use for customer segmentation and targeted campaign planning.

### Putting It All Together

In practice, Tom's data infrastructure might look something like this:

1. Raw data from various sources flows into the data lake
2. ETL (Extract, Transform, Load) processes clean and structure this data
3. Structured data is loaded into the data warehouse
4. Specific subsets of the warehouse data are used to create data marts

This setup allows Tom to:
- Collect and store all potentially useful data (data lake)
- Perform comprehensive business analysis (data warehouse)
- Provide tailored, efficient access to specific teams (data marts)

Understanding these concepts is crucial in data science, as they form the backbone of how organizations manage and utilize their data assets. As a data scientist, you'll often work with data from warehouses and marts, but you might also need to dive into the data lake for more exploratory analysis or to access raw data for machine learning projects.



## How Does Tom Nook Handle Daily Transactions and Complex Analysis?

As Tom Nook's business empire grows, he finds himself dealing with two distinct types of data processing needs. On one hand, he needs to handle numerous small, fast transactions throughout the day as customers make purchases. On the other hand, he wants to analyze large amounts of historical data to make informed business decisions. These two needs are addressed by OLTP and OLAP systems respectively.

### OLTP: Keeping the Bells Ringing

**OLTP** stands for **Online Transaction Processing**. This system is designed to manage transaction-oriented applications, typically for data entry and retrieval transactions in a large number of short, online transactions.

In Tom's store, the OLTP system would handle tasks like:
- Processing a customer's purchase
- Updating inventory levels
- Recording customer information

Key characteristics of OLTP:
- Handles large numbers of short, simple transactions
- Emphasizes very fast query processing
- Maintains data integrity in multi-access environments
- Focuses on day-to-day operations

Remember the relational databases we discussed earlier? OLTP systems typically use these types of databases. The structured nature of relational databases, with their tables and relationships, is perfect for the quick, consistent transactions that OLTP requires.

Let's look at how Tom's OLTP system might record a transaction in a relational database table:

| Transaction ID | Date       | Customer ID | Product ID | Quantity | Price | Total |
|----------------|------------|-------------|------------|----------|-------|-------|
| T1001          | 2024-06-27 | C101        | P001       | 1        | 1200  | 1200  |
| T1002          | 2024-06-27 | C102        | P002       | 2        | 500   | 1000  |
| T1003          | 2024-06-27 | C101        | P003       | 3        | 800   | 2400  |

Each row represents a single transaction, updated in real-time as customers make purchases. This table might be part of a larger relational database schema, with other tables for customers, products, and inventory.

### OLAP: Mining the Data for Insights

**OLAP** stands for **Online Analytical Processing**. This system is designed to quickly answer multi-dimensional analytical queries. It's used for complex calculations, trend analysis, and sophisticated data modeling.

Tom might use an OLAP system to:
- Analyze sales trends over time
- Compare performance across different island locations
- Understand customer buying patterns

Key characteristics of OLAP:
- Handles large volumes of data
- Performs complex queries across multiple dimensions
- Emphasizes response time to queries
- Focused on business intelligence and decision support

Remember the data warehouses we talked about earlier? OLAP systems typically operate on data warehouses. The structured, historical data in a data warehouse is ideal for the complex analyses that OLAP performs.

Let's visualize how Tom's OLAP system might organize data for analysis in a data warehouse:

Sales Cube:

| Product   | Location   | Time    | Sales |
|-----------|------------|---------|-------|
| Furniture | Island A   | Q1 2024 | 50000 |
| Furniture | Island A   | Q2 2024 | 60000 |
| Furniture | Island B   | Q1 2024 | 45000 |
| Furniture | Island B   | Q2 2024 | 55000 |
| Clothing  | Island A   | Q1 2024 | 30000 |
| Clothing  | Island A   | Q2 2024 | 35000 |
| Clothing  | Island B   | Q1 2024 | 25000 |
| Clothing  | Island B   | Q2 2024 | 30000 |

This multi-dimensional view, often called a "cube" in OLAP terminology, allows Tom to easily compare sales across different products, locations, and time periods. It's similar to the fact and dimension tables we saw in the data warehouse section, but optimized for quick analysis across multiple dimensions.

### Comparing OLTP and OLAP

To better understand the differences, let's compare OLTP and OLAP side by side:

| Characteristic | OLTP                          | OLAP                           |
|----------------|-------------------------------|--------------------------------|
| Purpose        | Day-to-day transactions       | Complex analysis and reporting |
| Data Source    | Operational databases         | Data warehouses                |
| Database Design| Normalized (like we saw in relational DB section) | Denormalized (like our data warehouse examples) |
| Data View      | Current, detailed             | Historical, summarized         |
| Users          | Large number of end users     | Analysts, managers             |
| Query Type     | Simple transactions           | Complex queries                |
| Response Time  | Milliseconds                  | Seconds to minutes             |

### Putting It All Together

Now, let's see how all the concepts we've learned fit together in Tom's data ecosystem:

1. The OLTP system, built on a relational database, handles daily transactions in the store, ensuring smooth operations.
2. Data from the OLTP system is periodically extracted, transformed, and loaded (ETL) into a data warehouse.
3. The data warehouse might also incorporate data from other sources, like the data lake we discussed earlier, which could include unstructured data like customer reviews or social media mentions.
4. The OLAP system uses this warehoused data to perform complex analyses.
5. For specific department needs, data marts (remember those?) might be created from the data warehouse.
6. Tom uses insights from the OLAP system to make informed business decisions, which in turn affect the day-to-day operations managed by the OLTP system.

As a data scientist, you'll likely interact more with OLAP systems for your analyses, but understanding both OLTP and OLAP is crucial. The data you analyze in OLAP systems often originates from OLTP systems, and understanding this flow helps in data validation, troubleshooting, and designing effective data pipelines.

Remember, while OLTP keeps Tom's business running smoothly day-to-day, OLAP helps him understand trends, make predictions, and strategize for the future. Both are essential for the success of his island empire, and they rely on the database concepts we've been learning throughout this chapter!



## How Does Tom Nook Organize His Data Warehouse for Efficient Analysis?

As Tom Nook's business continues to grow, he finds that his data warehouse needs a specific structure to support fast and efficient analysis. This is where the **Star Schema** comes into play, a fundamental concept in data warehouse design that builds upon the ideas of relational databases and OLAP systems we've discussed earlier.

### What is a Star Schema?

A **Star Schema** is a relational database design used in data warehousing to optimize database systems for querying large data sets. It's called a "star" schema because the diagram of this structure resembles a star, with a central table (the fact table) connected to multiple surrounding tables (dimension tables).

Let's break it down using Tom's Nook's Cranny as an example:

1.  **Fact Table**. This is the central table in a star schema. It contains the measures or metrics of the business process. In Tom's case, this might be sales transactions.
2.  **Dimension Tables**. These are the tables that surround the fact table. They contain descriptive attributes used to filter, group, or label the facts.

Here's how Tom's Star Schema for sales analysis might look:

In [10]:
import base64
from IPython.display import Image, display, HTML

def mm(graph):
    graphbytes = graph.encode("utf8")
    base64_bytes = base64.b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))


mm("""
erDiagram
    FactSales }o--|| DimDate : "DateID"
    FactSales }o--|| DimProduct : "ProductID"
    FactSales }o--|| DimCustomer : "CustomerID"
    FactSales }o--|| DimStore : "StoreID"

    FactSales {
        int SaleID
        int DateID
        int ProductID
        int CustomerID
        int StoreID
        int Quantity
        float TotalAmount
    }
    DimDate {
        int DateID
        string Date
        int Day
        int Month
        int Year
        string Season
    }
    DimProduct {
        int ProductID
        string ProductName
        string Category
        string Subcategory
        float Price
    }
    DimCustomer {
        int CustomerID
        string CustomerName
        string Island
        date JoinDate
    }
    DimStore {
        int StoreID
        string StoreName
        string Island
        string Manager
    }


""")

In this schema:

-   FactSales is the fact table, containing the actual sales data.
-   DimDate, DimProduct, DimCustomer, and DimStore are dimension tables, providing context to the sales data.
-   PK stands for Primary Key, a unique identifier for each row in a table.
-   FK stands for Foreign Key, a field that links to the primary key in another table.

Benefits of the Star Schema include:

1.  *Simplicity*. The star schema is intuitive and easy to understand, making it simpler for analysts to write queries.
2.  *Query Performance*. By denormalizing data and creating pre-joined tables, star schemas can significantly improve query performance. This is crucial for the quick response times needed in OLAP systems.
3.  *Consistency*. Dimension tables serve as a single source of truth for attributes, ensuring consistency across analyses.

### Connecting the Dots

The Star Schema ties together several concepts we've discussed:

-   It uses the principles of **relational databases** that we learned about earlier, with tables and relationships between them.
-   It's optimized for the kind of analytical queries performed in **OLAP** systems, allowing for fast aggregations across multiple dimensions.
-   It forms the backbone of a **data warehouse**, providing a structured way to store data from various operational (**OLTP**) systems.
-   The fact table often contains summarized data from transactional systems, while dimension tables might incorporate data from various sources, including the **data lake**.

By using a Star Schema in his data warehouse, Tom can efficiently analyze his business from multiple angles. He can quickly answer questions like:

-   Which products sell best in each season?
-   How do sales vary across different islands?
-   Who are the top customers, and what do they buy?

These insights help Tom make data-driven decisions to grow Nook's Cranny and keep his customers happy across all islands.

### Example: Creating and Querying a Data Warehouse
Let's take a look at how Tom Nook could create and query his data warehouse. First, we will create and populate the tables:

In [14]:
%%sql
DROP TABLE IF EXISTS FactSales;
DROP TABLE IF EXISTS DimDate;
DROP TABLE IF EXISTS DimProduct;
DROP TABLE IF EXISTS DimCustomer;
DROP TABLE IF EXISTS DimStore;

-- Create Dimension Tables

CREATE TABLE DimDate (
    DateID INT PRIMARY KEY,
    Date DATE,
    Day INT,
    Month INT,
    Year INT,
    Season VARCHAR(10)
);

CREATE TABLE DimProduct (
    ProductID INT PRIMARY KEY,
    ProductName VARCHAR(50),
    Category VARCHAR(50),
    Subcategory VARCHAR(50),
    Price FLOAT
);

CREATE TABLE DimCustomer (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(50),
    Island VARCHAR(50),
    JoinDate DATE
);

CREATE TABLE DimStore (
    StoreID INT PRIMARY KEY,
    StoreName VARCHAR(50),
    Island VARCHAR(50),
    Manager VARCHAR(50)
);

-- Create Fact Table

CREATE TABLE FactSales (
    SaleID INT PRIMARY KEY,
    DateID INT,
    ProductID INT,
    CustomerID INT,
    StoreID INT,
    Quantity INT,
    TotalAmount FLOAT,
    FOREIGN KEY (DateID) REFERENCES DimDate(DateID),
    FOREIGN KEY (ProductID) REFERENCES DimProduct(ProductID),
    FOREIGN KEY (CustomerID) REFERENCES DimCustomer(CustomerID),
    FOREIGN KEY (StoreID) REFERENCES DimStore(StoreID)
);

-- Populate Dimension Tables

-- DimDate
INSERT INTO DimDate (DateID, Date, Day, Month, Year, Season) VALUES
(1, '2024-07-01', 1, 7, 2024, 'Summer'),
(2, '2024-07-02', 2, 7, 2024, 'Summer'),
(3, '2024-07-03', 3, 7, 2024, 'Summer');

-- DimProduct
INSERT INTO DimProduct (ProductID, ProductName, Category, Subcategory, Price) VALUES
(1, 'Fishing Rod', 'Tools', 'Fishing', 500),
(2, 'Bug Net', 'Tools', 'Bug Catching', 400),
(3, 'Shovel', 'Tools', 'Digging', 600);

-- DimCustomer
INSERT INTO DimCustomer (CustomerID, CustomerName, Island, JoinDate) VALUES
(1, 'Villager A', 'Island 1', '2023-05-01'),
(2, 'Villager B', 'Island 2', '2023-06-15'),
(3, 'Villager C', 'Island 3', '2023-07-20');

-- DimStore
INSERT INTO DimStore (StoreID, StoreName, Island, Manager) VALUES
(1, 'Nooks Cranny', 'Island 1', 'Tom Nook'),
(2, 'Able Sisters', 'Island 2', 'Mabel'),
(3, 'Museum', 'Island 3', 'Blathers');

-- Populate Fact Table

INSERT INTO FactSales (SaleID, DateID, ProductID, CustomerID, StoreID, Quantity, TotalAmount) VALUES
(1, 1, 1, 1, 1, 2, 1000),
(2, 2, 2, 2, 2, 1, 400),
(3, 3, 3, 3, 3, 3, 1800);


 * sqlite:///nook.db
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
Done.
3 rows affected.
3 rows affected.
3 rows affected.
3 rows affected.
3 rows affected.


## How Can Tom Nook Further Normalize His Data Warehouse?

As Tom Nook's business continues to expand and become more complex, he finds that some of his dimension tables in the Star Schema are becoming quite large and contain hierarchical data. To address this, Tom's data team introduces him to the concept of the **Snowflake Schema**.

**What is a Snowflake Schema?**

A **Snowflake Schema** is a variation of the Star Schema where dimension tables are normalized into multiple related tables. This creates a structure that looks like a snowflake, hence the name. While a Star Schema has all the attributes of a dimension in a single table, a Snowflake Schema divides these attributes into separate tables.

Let's see how this might apply to Tom's Nook's Cranny data warehouse:

1. **Fact Table**: This remains the same as in the Star Schema – it's still the central table containing the measures or metrics of the business process (in our case, sales transactions).

2. **Dimension Tables**: These are now broken down into multiple related tables, creating a hierarchy. For example, the Product dimension might be split into Product, Category, and Subcategory tables.

Here's how Tom's Snowflake Schema for sales analysis might look:

[Mermaid diagram code will be inserted here]



In [12]:
mm("""
erDiagram
    FactSales {
        int SaleID PK
        int DateID FK
        int ProductID FK
        int CustomerID FK
        int StoreID FK
        int Quantity
        float TotalAmount
    }
    DimDate {
        int DateID PK
        string Date
        int Day
        int Month
        int Year
        string Season
    }
    DimProduct {
        int ProductID PK
        string ProductName
        float Price
        int SubcategoryID FK
    }
    DimCustomer {
        int CustomerID PK
        string CustomerName
        date JoinDate
    }
    DimStore {
        int StoreID PK
        string StoreName
        string Manager
        int IslandID FK
    }
    DimSubcategory {
        int SubcategoryID PK
        string SubcategoryName
        int CategoryID FK
    }
    DimCategory {
        int CategoryID PK
        string CategoryName
    }
    DimIsland {
        int IslandID PK
        string IslandName
    }

    FactSales }o--|| DimDate : "DateID"
    FactSales }o--|| DimProduct : "ProductID"
    FactSales }o--|| DimCustomer : "CustomerID"
    FactSales }o--|| DimStore : "StoreID"
    DimProduct }o--|| DimSubcategory : "SubcategoryID"
    DimSubcategory }o--|| DimCategory : "CategoryID"
    DimStore }o--|| DimIsland : "IslandID"

""")

In this schema:
- FactSales remains the central fact table, just like in the Star Schema.
- The Product dimension is now split into three tables: DimProduct, DimCategory, and DimSubcategory.
- The Store dimension is split into DimStore and DimIsland.
- Date and Customer dimensions remain denormalized for simplicity and performance.

Tom might choose to use a Snowflake Schema when:
- He has complex hierarchies in his dimension data (like product categories and subcategories).
- Data storage is a significant concern, and he needs to minimize redundancy.
- The dimension tables are very large and contain many attributes.

However, Tom needs to balance these benefits against the potential performance impact, especially for common queries.

### Connecting the Dots

The Snowflake Schema builds upon the concepts we've discussed earlier:
- It's an extension of the **Star Schema**, providing more normalization.
- Like the Star Schema, it's used in **data warehouses** and optimized for **OLAP** systems.
- It applies **normalization** principles from relational database design to dimension tables.

By using a Snowflake Schema, Tom can handle more complex hierarchies in his data, potentially save on storage, and maintain data consistency more easily. However, he needs to be aware of the potential performance trade-offs compared to a Star Schema.

As a data scientist, you might encounter both Star and Snowflake Schemas in different data warehouses. Understanding the trade-offs between them will help you optimize your queries and choose the right schema for different analytical needs.



## How Does Tom Nook Remember What Happened in the Past?

Imagine you're playing Animal Crossing, and you notice that the price of a Leaf Table at Nook's Cranny has changed. You might wonder, "Has it always been this price? When did it change?" This is where the concept of keeping track of both current and historical information becomes important, not just in video games, but in real-world businesses too!

Tom Nook, being the savvy businessman he is, understands that remembering the past can help him make better decisions for the future. By keeping track of how things change over time, he can spot trends, learn from history, and even answer "What if" questions about his business.

### Introducing Slowly Changing Dimensions

In the world of data, we often deal with information that changes slowly over time. We call these "Slowly Changing Dimensions" or SCDs. For Tom, things like product prices, customer addresses, or even the characteristics of items he sells might change occasionally, but not every day. These are his slowly changing dimensions.

Let's look at how Tom might handle these changes in his data. We'll explore a simple method called "Keeping a History Log," which is similar to what data scientists call a Type 2 SCD.

Here's what Tom's product table might look like initially:

| ProductID | ProductName | Price | DateChanged |
|-----------|-------------|-------|-------------|
| 1         | Leaf Table  | 1200  | 2024-01-01  |
| 2         | Fishing Rod | 500   | 2024-01-01  |

Now, let's say on July 1, 2024, Tom decides to change the price of the Leaf Table. Instead of just updating the price, he adds a new row to keep the history:

| ProductID | ProductName | Price | DateChanged |
|-----------|-------------|-------|-------------|
| 1         | Leaf Table  | 1200  | 2024-01-01  |
| 2         | Fishing Rod | 500   | 2024-01-01  |
| 1         | Leaf Table  | 1500  | 2024-07-01  |

By adding a new row instead of overwriting the old one, Tom creates a historical record of how the price has changed. This approach allows him to see both the current price (the most recent entry) and the price history for each product.

### The Power of Historical Data

Keeping track of these changes gives Tom some super data powers. He can now see how the price of the Leaf Table has changed over time. If a customer asks, "What was the price of a Leaf Table on March 15, 2024?" Tom can easily find out. He might even notice patterns, like prices tend to increase in the summer, and think about why that happens.

Let's look at another example. Tom wants to keep track of where his customers live, as they sometimes move between islands. Here's how he might track this:

| CustomerID | CustomerName | Island    | DateChanged |
|------------|--------------|-----------|-------------|
| 1          | Isabelle     | Resident  | 2024-01-01  |
| 2          | Tom Nook     | Resident  | 2024-01-01  |
| 3          | K.K. Slider  | Touring   | 2024-01-01  |
| 1          | Isabelle     | Vacation  | 2024-06-15  |
| 3          | K.K. Slider  | Resident  | 2024-07-01  |

From this table, Tom can see a story unfolding. Isabelle moved from Resident Island to Vacation Island on June 15, 2024. K.K. Slider, who was initially on tour, settled down on Resident Island on July 1, 2024. Meanwhile, Tom Nook hasn't moved at all.

### Challenges of Time Travel (in Data)

While keeping historical data is incredibly useful, it does come with some challenges. Tom needs more storage space to keep all this extra information. When he wants to find current information, he needs to look for the most recent entry for each product or customer, which is a bit more work. He also has to decide what information is important enough to keep a history of – after all, he can't record every tiny detail!



### Connecting the Time Streams

This idea of tracking changes over time isn't just a standalone concept. It weaves together many of the data ideas we've explored so far in our journey. Let's connect the dots:

• **Relational Databases**: Remember those tables we talked about? Tracking historical data builds on that idea, but adds a time dimension. It's like having a table that remembers its own past!

- **OLAP Systems**: When we discussed Online Analytical Processing, we talked about analyzing data from different angles. Adding historical tracking gives us a whole new dimension to explore – time itself!
- **Data Warehouses**: By incorporating historical data, Tom's data warehouse transforms from a simple snapshot of the present into a rich, historical narrative of his business. It's like turning a photo album into a movie!
- **Star and Snowflake Schemas**: These designs for organizing data can be adapted to include slowly changing dimensions, allowing for efficient storage and retrieval of historical information.

By keeping track of how things change, Tom gains a deeper understanding of his business. He can see how prices, customer behavior, and product popularity have evolved over time. This knowledge helps him make smarter decisions about what to sell, how to price items, and how to keep his customers happy.

For aspiring data scientists, understanding how to track and use historical data is crucial. It allows you to tell more interesting stories with data, understand how things have changed over time, and make better predictions about what might happen in the future. It's like being a time traveler, but instead of a time machine, you use data to visit the past and glimpse the future!

Remember, in the world of data, the past isn't just history – it's a valuable resource that can help shape the future. Whether you're managing a virtual store like Tom Nook or analyzing real-world data, the ability to track and understand changes over time is a superpower in the data science world!



### Key Points Review

- Data organization evolves with business growth, from simple lists to complex data warehouses
- Relational databases use tables with defined relationships to store structured data efficiently
- Data normalization reduces redundancy and improves data integrity in relational databases
- Non-relational databases offer flexibility for handling diverse and rapidly changing data
- Data lakes store raw data, data warehouses store processed data for analysis, and data marts focus on specific business areas
- OLTP systems handle day-to-day transactions, while OLAP systems support complex analytical queries
- Star and Snowflake schemas optimize data warehouse design for analytical processing
- Slowly Changing Dimensions allow businesses to track historical changes in their data
- Choosing the right data organization strategy depends on specific business needs and data characteristics
- Effective data organization is fundamental to deriving meaningful insights and supporting data-driven decision making


### Review With Quizlet

In [13]:
%%html
<iframe src="https://quizlet.com/927144641/learn/embed?i=psvlh&x=1jj1" height="600" width="100%" style="border:0"></iframe>

## Glossary

| Term | Definition |
|------|------------|
| Data science | An interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured information. It combines expertise from various fields such as statistics, mathematics, computer science, and domain knowledge to analyze complex sets of information and solve real-world problems. |
| Data | Raw facts, figures, or information that can be processed or analyzed to gain knowledge or make decisions. This information can exist in various forms, such as numbers, text, images, or audio, and can be structured or unstructured. |
| Data schema | A blueprint or structure that defines how information is organized within a database or information warehouse. It specifies the attributes, types, relationships, and constraints of the stored information, ensuring consistency and facilitating efficient querying and analysis. |
| Dimension | A perspective or attribute used to analyze information in multidimensional databases or information warehouses. These are typically categorical or discrete variables that provide context for numerical measures, allowing for more detailed and nuanced analysis of the stored information. |
| Structured Query Language (SQL) | A standardized programming language used for managing and manipulating relational databases. It allows users to create, read, update, and delete information, as well as define and modify database structures and perform complex queries to extract specific details. |
| Relational database | A type of information storage system that organizes details into tables with predefined relationships between them. It uses a structure that allows information to be accessed or reassembled in many different ways while maintaining integrity and avoiding redundancy. |
| Table (relation) | A collection of related entries organized into rows (records) and columns (fields) in a relational information system. Each collection represents a specific entity or concept and follows a defined structure that specifies the format and constraints of the contained information. |
| Primary key | A unique identifier for each record in an information storage table. It ensures that each row can be uniquely identified and accessed, and it helps maintain integrity by preventing duplicate or null entries for the key field. |
| Foreign Key | A field in an information table that refers to the unique identifier in another table. It establishes and enforces a link between details in two tables, maintaining referential integrity and allowing for efficient querying across related information. |
| Data integrity | The accuracy, consistency, and reliability of information throughout its lifecycle. It ensures that details remain intact and unaltered during operations such as transfer, storage, and retrieval, and includes concepts like referential integrity, entity integrity, and domain integrity. |
| Data normalization | The process of organizing information in a relational system to reduce redundancy and improve integrity. It involves breaking down large tables into smaller, more manageable ones and establishing relationships between them to minimize duplication and anomalies. |
| Denormalization | An optimization technique that involves adding redundant information to one or more tables to improve read performance. This process reverses some aspects of normalization to reduce the complexity of join operations and speed up query execution, especially in information warehousing scenarios. |
| Non-Relational (NoSQL) database | A type of information storage system that provides a mechanism for storing and retrieving details that are modeled in means other than the tabular relations used in relational systems. These systems are designed to handle large volumes of unstructured or semi-structured information and offer more flexibility in models. |
| Document database | A type of non-relational information system that stores details in flexible, JSON-like files. Each file can have a different structure, allowing for more dynamic and adaptable models compared to traditional relational systems. |
| JavaScript Object Notation (JSON) | A lightweight, text-based information interchange format that is easy for humans to read and write and easy for machines to parse and generate. It is commonly used for transmitting details between a server and web application, as well as storing unstructured information in certain types of databases. |
| Graph database | An information storage system that uses structures with nodes, edges, and properties to represent and store details. It is designed to handle highly connected information and is particularly useful for analyzing relationships in social networks, fraud detection, and recommendation systems. |
| Wide-column store | A type of non-relational information system that organizes details in tables with rows and dynamic columns. It's designed for storing large amounts of structured and semi-structured information across many commodity servers, offering high scalability and performance for certain types of big data applications. |
| Data Lake | A centralized repository that allows storage of all structured and unstructured information at any scale. It can hold details in raw format, without having to first structure the content, allowing for more flexibility in analysis and the ability to ask new questions as business needs change. |
| Data Warehouse | A central repository of integrated information from one or more disparate sources. It stores current and historical details in one single place and is used for creating analytical reports for knowledge workers throughout the enterprise. These systems are optimized for read-heavy operations and complex queries. |
| Data Mart | A subset of an information warehouse oriented to a specific business line or team. These smaller, focused systems often draw details from a few sources, making them easier to create and maintain than full-scale warehouses. They're designed to meet the specific demands of a particular group of users. |
| OLTP (Online Transaction Processing) | A class of software programs capable of supporting transaction-oriented applications on the Internet. These systems are designed to handle a large number of short, atomic, isolated operations that typically involve inserting, updating, and retrieving small amounts of information. |
| OLAP (Online Analytical Processing) | A technology used to organize large business databases and support complex analysis. It allows users to analyze multidimensional information interactively from multiple perspectives, facilitating complex calculations, trend analyses, and sophisticated modeling. |
| Star Schema | A widely used model for warehouse systems. It consists of one or more central tables referencing any number of descriptive tables, forming a star-like structure. This arrangement simplifies queries and provides fast aggregations, making it ideal for analytical processing systems. |
| Fact Table | The central component in a star-shaped model of a warehouse system. It contains the measures or metrics of a business process (like sales amount or quantity sold) and references to the descriptive tables that provide context to these metrics. These central components are typically very large and are optimized for analytical queries. |
| Dimension Table | A companion component to the central table in a star-shaped or snowflake-shaped model. It contains descriptive attributes that are typically textual fields or discrete numbers representing business entities like products, customers, or time. These components provide the context for the measures in the central table. |
| Snowflake Schema | An extension of the star-shaped model where descriptive tables are normalized into multiple related tables. This reduces information redundancy but can make queries more complex due to the increased number of joins required. It's useful when descriptive components are very large or when there's a need for more granular hierarchies. |
| Slowly Changing Dimensions | A concept in information warehousing that addresses how to handle changes to descriptive data over time. It defines several types (commonly Type 1, 2, and 3) that determine how historical information is preserved when descriptions change, allowing for accurate point-in-time reporting and analysis. |
| SQLite | A lightweight, serverless, and self-contained relational information management system. It's embedded directly into applications, requiring no separate server process or configuration. This system is widely used in mobile apps, browsers, and other applications where a full-scale server would be overkill. |
| Transaction | A sequence of database operations that are treated as a single unit of work. In information management systems, these units must have ACID properties: Atomicity (all operations complete successfully or none do), Consistency (the database remains in a consistent state), Isolation (concurrent units don't interfere with each other), and Durability (completed units persist even in case of system failure). |
