# Week 2 - Dynamic Database Design

You’ll learn more about database systems, including data marts, data lakes, data warehouses, and ETL processes. You’ll also investigate the five factors of database performance: workload, throughput, resources, optimization, and contention. Finally, you’ll consider how to design efficient queries that get the most from a system.

## Learning Objectives

- Discover strategies to create an ETL process that works to meet organizational and stakeholder needs and maintain an ETL process efficiently.

- Understand what the different data storage and extraction processes and tools may include (Extract/L: Stitch/Segment/Fivetran, Transform: DBT/Airflow/Looker).

- Explain how to optimize when building new tables.

- Identify and describe where new tables can fit in the pipeline.

- Recognize the different aspects of databases, including OLAP and OLTP, columnar and relational, distributed and single-homed databases.

- Understand the importance of database performance and optimization.

- Describe the different five factors of database performance: workload, throughput, resources, optimization, and contention.

- Perform pipeline debugging using queries.

## Database Performance

### Data marts and lakes


One of these tools is a data mart, as you may recall a data mart is a subject
oriented database that can be a subset of a larger data warehouse.
NBI, subject oriented, describes something that is associated with specific areas or
departments of a business such as finance, sales or marketing.


A data lake is a database system that stores large amounts of raw data
in its original format until it's needed.
This makes the data easily accessible,
because it doesn't require a lot of processing.
Like a data warehouse, a data lake combines many different sources, but
data warehouses are hierarchical with files and folders to organize the data.
Whereas data lakes are flat and while data has been tagged so it is identifiable,
it's not organized, it's fluid, which is why it's called a data lake

ELT takes the same steps but reorganizes them so
that the pipeline Extracts, Loads and then Transforms the data.
Basically ELT is a type of data pipeline that enables data to be gathered from
different sources.
Usually data lakes, then loaded into a unified destination system and
transformed into a useful format.
ELT enables BI professionals to ingest so many different kinds of data into
a storage system as soon as that data is available. 

### ETL vs ELT

Differences | ETL | ELT
------|-----|-----
The order of extraction, transformation, and loading data | Data is extracted, transformed in a staging area, and loaded into the target system | Data is extracted, loaded into the target system, and transformed as needed for analysis
Location of transformations | Data is moved to a staging area where it is transformed before delivery | Data is transformed in the destination system, so no staging area is required
Age of the technology | ETL has been used for over 20 years, and many tools have been developed to support ETL pipeline systems| ELT is a newer technology with fewer support tools built-in to existing technology
Access to data within the system | ETL systems only transform and load the data designated when the warehouse and pipeline are constructed | ELT systems load all of the data, allowing users to choose which data to analyze at any time
Calculations| Calculations executed in an ETL system replace or revise existing columns in order to push the results to the target table | Calculations are added directly to the existing dataset
Compatible storage systems | ETL systems are typically integrated with structured, relational data warehouses| ELT systems can ingest unstructured data from sources like data lakes
Security and compliance| Sensitive information can be redacted or anonymized before loading it into the data warehouse, which protects data| Data has to be uploaded before data can be anonymized, making it more vulnerable
Data size| ETL is great for dealing with smaller datasets that need to undergo complex transformations| ELT is well-suited to systems using large amounts of both structured and unstructured data
Wait times| ETL systems have longer load times, but analysis is faster because data has already been transformed when users access it|Data loading is very fast in ELT systems because data can be ingested without waiting for transformations to occur, but analysis is slower

### Data Storage System
Data Warehouse | Data Lake
------|-----
Data has already been processed and stored in a relational system| Data is raw and unprocessed until it is needed for analysis; additionally, it can have a copy of the entire OLTP or relational database
The data’s purpose has already been assigned, and the data is currently in use|The data’s purpose has not been determined yet
Making changes to the system can be complicated and require a lot of work|Systems are highly accessible and easy to update


### Performance
Database performance is a measure of
the workload that can be processed by a database,
as well as the associated costs. 

1. Workload

In BI, workload refers to
the combination of transactions, queries,
analysis, and system commands being
processed by the database system at any given time.
It's common for a database's workload to
fluctuate drastically from day to day,
depending on what jobs are being processed
and how many users are interacting with the database.
The good news is that you can
often predict these fluctuations


2. Throughput

Throughput is the overall capability of
the database's hardware and software to process requests.
Throughput is made up of the input and output speed,
the central processor unit speed,
how well the machine can run parallel processes,
the database management system,
and the operating system and system software.
Basically, throughput describes
a workload size that the system can handle. 


3. Resources

This includes the disk space and memory.
Resources are a big part of
a database system's ability to
process requests and handle data.
They can also fluctuate,
especially if the hardware or
other dedicated resources are
shared with additional databases,
software applications, or services.
Also, cloud-based systems are
particularly prone to fluctuation


4. Optimization

Optimization involves
maximizing the speed and efficiency with
which data is retrieved in order to
ensure high levels of database performance.
This is one of the most important factors that
BI Professionals return to again and again.
Coming up soon, we're going to
talk about it in more detail. 


5. Contention 

Contention occurs when two or more components
attempt to use a single resource in a conflicting way.
This can really slow things down.
For instance, if there are
multiple processes trying to
update the same piece of data,
those processes are in contention.
As contention increases,
the throughput of the database decreases.
Limiting contention as much as possible will help
ensure the database is performing at its best. 


### Optimization

Make efficent queries, a query plan is a description of the steps
the database system takes in order to execute a query. The query plan is the how.
If queries are running slowly,
checking the query plan to find out if there are steps
causing more draw than necessary can be helpful.
This is another iterative process.
After checking the query plan,
you might rewrite the query or create
new tables and then check the query plan again. 


An index is an organizational tag
used to quickly locate data within a database system.
If the tables within a database
haven't been fully indexed,
it can take the database longer to locate resources.
In Cloud-based systems working with big data,
you might have data partitions instead of indexes.
Data partitioning is the process
of dividing a database into
distinct logical parts in order to
improve query processing and increase manageability. 

The next issue is fragmented data.
Fragmented data occurs when data is
broken up into many pieces that are not stored together.
Often as a result of using
the data frequently or creating,
deleting, or modifying files.
For example, if you are accessing the same data
often and versions of it are being saved in your cache,
those versions are actually
causing fragmentation in your system. 

Optimizing queries will make your pipeline operations faster and more efficient. In your role as a BI professional, you might work on projects with extremely large datasets. For these projects, it’s important to write SQL queries that are as fast and efficient as possible. Otherwise, your data pipelines might be slow and difficult to work with.




## Glossary

Contention: When two or more components attempt to use a single resource in a conflicting way

Data partitioning: The process of dividing a database into distinct, logical parts in order to improve query processing and increase manageability

Database performance: A measure of the workload that can be processed by a database, as well as associated costs

ELT (extract, load, and transform): A type of data pipeline that enables data to be gathered from data lakes, loaded into a unified destination system, and transformed into a useful format 

Fragmented data: Data that is broken up into many pieces that are not stored together, often as a result of using the data frequently or creating, deleting, or modifying files

Index: An organizational tag used to quickly locate data within a database system

Optimization: Maximizing the speed and efficiency with which data is retrieved in order to ensure high levels of database performance

Query plan: A description of the steps a database system takes in order to execute a query

Resources: The hardware and software tools available for use in a database system

Subject-oriented: Associated with specific areas or departments of a business

Throughput: The overall capability of the database’s hardware and software to process requests

Workload: The combination of transactions, queries, data warehousing analysis, and system commands being processed by the database system at any given time

### Terms and definitions from previous weeks

A

Application programming interface (API): A set of functions and procedures that integrate computer programs, forming a connection that enables them to communicate 

Applications software developer: A person who designs computer or mobile applications, generally for consumers

Attribute: In a dimensional model, a characteristic or quality used to describe a dimension

B

Business intelligence (BI): Automating processes and information channels in order to transform relevant data into actionable insights that are easily available to decision-makers

Business intelligence governance: A process for defining and implementing business intelligence systems and frameworks within an organization

Business intelligence monitoring: Building and using hardware and software tools to easily and rapidly analyze data and enable stakeholders to make impactful business decisions

Business intelligence stages: The sequence of stages that determine both BI business value and organizational data maturity, which are capture, analyze, and monitor

Business intelligence strategy: The management of the people, processes, and tools used in the business intelligence process

C

Columnar database: A database organized by columns instead of rows

Combined systems: Database systems that store and analyze data in the same place

Compiled programming language: A programming language that compiles coded instructions that are executed directly by the target machine

D

Data analysts: People who collect, transform, and organize data

Data availability: The degree or extent to which timely and relevant information is readily accessible and able to be put to use

Data governance professionals: People who are responsible for the formal management of an organization’s data assets

Data integrity: The accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle

Data lake: A database system that stores large amounts of raw data in its original format until it’s needed

Data mart: A subject-oriented database that can be a subset of a larger data warehouse

Data maturity: The extent to which an organization is able to effectively use its data in order to extract actionable insights

Data model: A tool for organizing data elements and how they relate to one another

Data pipeline: A series of processes that transports data from different sources to their final destination for storage and analysis

Data visibility: The degree or extent to which information can be identified, monitored, and integrated from disparate internal and external sources

Data warehouse: A specific type of database that consolidates data from multiple source systems for data consistency, accuracy, and efficient access

Data warehousing specialists: People who develop processes and procedures to effectively store and organize data

Database migration: Moving data from one source platform to another target database

Deliverable: Any product, service, or result that must be achieved in order to complete a project

Developer: A person who uses programming languages to create, execute, test, and troubleshoot software applications

Dimension (data modeling): A piece of information that provides more detail and context regarding a fact

Dimension table: The table where the attributes of the dimensions of a fact are stored

Design pattern: A solution that uses relevant measures and facts to create a model in support of business needs

Dimensional model: A type of relational model that has been optimized to quickly retrieve data from a data warehouse

Distributed database: A collection of data systems distributed across multiple physical locations

E

ETL (extract, transform, and load): A type of data pipeline that enables data to be gathered from source systems, converted into a useful format, and brought into a data warehouse or other unified destination system

Experiential learning: Understanding through doing

F

Fact: In a dimensional model, a measurement or metric

Fact table: A table that contains measurements or metrics related to a particular event

Foreign key: A field within a database table that is a primary key in another table (Refer to primary key)

Functional programming language: A programming language modeled around functions

G

Google DataFlow: A serverless data-processing service that reads data from the source, transforms it, and writes it in the destination location

I

Information technology professionals: People who test, install, repair, upgrade, and maintain hardware and software solutions

Interpreted programming language: A programming language that uses an interpreter, typically another program, to read and execute coded instructions

Iteration: Repeating a procedure over and over again in order to keep getting closer to the desired result

K

Key performance indicator (KPI): A quantifiable value, closely linked to business strategy, which is used to track progress toward a goal

L

Logical data modeling: Representing different tables in the physical data model

M

Metric: A single, quantifiable data point that is used to evaluate performance

O

Object-oriented programming language: A programming language modeled around data objects

OLAP (Online Analytical Processing) system: A tool that has been optimized for analysis in addition to processing and can analyze data from multiple databases

OLTP (Online Transaction Processing) database: A type of database that has been optimized for data processing instead of analysis

P

Portfolio: A collection of materials that can be shared with potential employers

Primary key: An identifier in a database that references a column or a group of columns in which each row uniquely identifies each record in the table (Refer to foreign key)

Project manager: A person who handles a project’s day-to-day steps, scope, schedule, budget, and resources

Project sponsor: A person who has overall accountability for a project and establishes the criteria for its success

Python: A general purpose programming language

R

Response time: The time it takes for a database to complete a user request

Row-based database: A database that is organized by rows

S

Separated storage and computing systems: Databases where data is stored remotely, and relevant data is stored locally for analysis

Single-homed database: Database where all of the data is stored in the same physical location

Snowflake schema: An extension of a star schema with additional dimensions and, often, subdimensions

Star schema: A schema consisting of one fact table that references any number of dimension tables

Strategy: A plan for achieving a goal or arriving at a desired future state

Systems analyst: A person who identifies ways to design, implement, and advance information systems in order to ensure that they help make it possible to achieve business goals

Systems software developer: A person who develops applications and programs for the backend processing systems used in organizations

T

Tactic: A method used to enable an accomplishment

Target table: The predetermined location where pipeline data is sent in order to be acted on

Transferable skill: A capability or proficiency that can be applied from one job to another

V

Vanity metric: Data points that are intended to impress others, but are not indicative of actual performance and, therefore, cannot reveal any meaningful business insights

