<a href="https://colab.research.google.com/github/brendanpshea/database_sql/blob/main/Database_01_StarShipSQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Starship SQL: An Introduction to Databases
### Databases Through Pop Culture: Brendan SHea, PhD

This chapter provides a comprehensive introduction to the world of databases, highlighting their importance in managing and organizing data effectively. By drawing parallels between database concepts and scenarios from the beloved science fiction franchise, Star Trek, the chapter aims to make these concepts more relatable and easier to grasp. Readers will explore the distinctions between data, information, and knowledge, and learn about various data models, including flat files, relational databases, document databases, and graph databases. The chapter also delves into the process of data modeling, discussing the conceptual, logical, and physical levels of abstraction. Additionally, it covers the selection of appropriate database management systems and the choice between cloud storage and local storage. Finally, the chapter explores different concepts of data storage, including data lakes, data warehouses, and data marts.

Learning Outcomes:

1.  Understand the differences between data, information, and knowledge and how databases help transform data into knowledge.
2.  Recognize the advantages of databases over flat files in terms of data organization, retrieval, and consistency.
3.  Comprehend the process of data modeling and its three main levels of abstraction: conceptual, logical, and physical.
4.  Identify and compare the key features, advantages, and use cases of different logical data models, including relational, document (JSON), and graph databases.
5.  Understand the factors to consider when choosing a database management system (DBMS) and the trade-offs between cloud storage and local storage.
6.  Explain the relationship between databases, data lakes, data warehouses, and data marts.


## Introduction to the Case Study: Starship SQL
Welcome aboard the Starship Enterprise, a iconic vessel from the beloved science fiction franchise, Star Trek. In this introductory chapter, we'll be exploring the fascinating world of databases using examples and scenarios inspired by the adventures of the Enterprise and its intrepid crew.

But why use a fictional spaceship to learn about databases? The answer lies in the versatility and universal applicability of database concepts. Whether you're managing a starship, a small business, or a global enterprise, the principles of data storage, retrieval, and manipulation remain the same.

By setting our learning journey against the backdrop of the Enterprise's missions, we can make the abstract concepts of databases more relatable and easier to grasp. We'll see how databases can help us solve real-world problems, from organizing crew schedules to analyzing sensor data from uncharted planets. As we embark on this journey together, we'll demystify the world of databases and discover how they can help us navigate the complexities of data in the 21st century.

Get ready to boldly go where no database learner has gone before!

## What's the Difference Between Data, Information, and Knowledge?

To understand the role of databases, it's essential to grasp the distinction between data, information, and knowledge. Let's beam down to a planet's surface with the Enterprise's away team and explore these concepts in action.

**Data** refers to raw, unorganized facts and figures. Imagine the Enterprise's sensors collecting various readings about the planet's atmosphere, temperature, and geology. These individual measurements, such as "Nitrogen: 78.1%," "Oxygen: 20.9%," and "Temperature: 25°C," are data points. On their own, they don't provide much insight or meaning. Examples of data include:

1. "Temperature: 25°C"
2. "Gravity: 9.8 m/s²"
3. "Radiation levels: 0.5 mSv/hr"
4. "Soil pH: 6.5"
5. "Atmospheric pressure: 1.01 bar"

**Information** is data that has been processed, structured, and given context. When the Enterprise's computer systems analyze the sensor data and present it in a meaningful way, such as "The planet's atmosphere is similar to Earth's, with a slightly higher oxygen content," it becomes information. This level of organization and interpretation makes the data more useful and accessible. Examples of information include:

1. "The planet's average temperature is similar to Earth's, suitable for human habitation."
2. "The gravity on this planet is approximately equal to Earth's, which means the crew can move around easily."
3. "Radiation levels are within safe limits for short-term exposure, but prolonged stays may require protective gear."
4. "The soil pH indicates that the ground is slightly acidic, which may affect the growth of certain plants."
5. "Atmospheric pressure is comparable to Earth's at sea level, allowing for normal breathing."

**Knowledge** is the understanding and application of information. When the Enterprise's science officer, Mr. Spock, reviews the atmospheric information and concludes, "Captain, this planet is capable of supporting human life," he demonstrates knowledge. By combining the information with his expertise and experience, Spock can make informed decisions and recommendations. Examples of knowledge include:

1. "Based on the temperature and atmospheric data, this planet is classified as habitable for humans, and we can proceed with landing without special equipment."
2. "The similar gravity to Earth's means that our standard transportation vehicles and equipment will function normally on this planet."
3. "While the radiation levels are safe for now, we should limit our time on the surface and regularly monitor our exposure to prevent long-term health risks."
4. "The slightly acidic soil suggests that we may need to adjust our agricultural techniques and select crops that can thrive in this environment."
5. "With the atmospheric pressure being similar to Earth's, we can rule out the need for specialized breathing apparatus, making our exploration more efficient."

In this scenario, a database would serve as the central repository for storing and organizing the raw sensor data. By structuring the data into tables, rows, and columns, the database makes it easier to process and analyze the information. The Enterprise's crew can then query the database to gain insights, identify patterns, and ultimately, generate knowledge to guide their actions.

For example, suppose the away team discovers a new plant species. They can collect data on its physical characteristics, chemical composition, and genetic structure, storing it in the database. By comparing this information with records of known plants, the crew can determine whether the species is edible, medicinal, or potentially dangerous. This knowledge can then be used to ensure the safety and well-being of the crew during their mission.

## What is a Database? How Can They Help Turn Data Into Knowledge?

A **database** is a structured collection of data that is organized in a way that allows for efficient storage, retrieval, and manipulation of information. Databases are designed to handle large amounts of data and provide a reliable and secure way to manage and access that data.

In contrast, a **flat file** is a simple, **linear** structure that stores data in a single table or spreadsheet. (Linear means the data is arrnaged in a "line", where searching for a record requires going through all of the previous records). Flat files are easy to create and understand but have limitations when dealing with complex data relationships and large datasets.

Let's consider an example from the Starship Enterprise. Suppose the crew is tasked with cataloging the various alien species they encounter during their missions. Using a flat file, they might create a spreadsheet with columns for species name, planet of origin, physical characteristics, and level of technological advancement.

While this approach may work for a small number of entries, it quickly becomes cumbersome as the list grows. Searching for specific information, updating records, and maintaining data consistency becomes increasingly difficult.

On the other hand, a database can store this information in a more organized and efficient manner. By breaking the data into separate tables for species, planets, and technological levels, and establishing relationships between these tables, the database can provide a more comprehensive and flexible way to manage the information.

For instance, the database can ensure that each species is linked to its correct planet of origin, prevent duplicate entries, and allow for easy updating of information across multiple records. The crew can then use **queries** to search for specific species based on various criteria, such as all species from a particular planet or those with a certain level of technological advancement. In a database, data is stored non-linearly, which means searching for data items does not require going through all "previous" entries first.

Furthermore, databases can help turn data into knowledge by enabling complex analysis and pattern recognition. By using **data mining** techniques, the Enterprise's science team can uncover hidden relationships and trends within the species data. They might discover that certain physical characteristics are correlated with higher levels of technological advancement, or that species from certain regions of space are more likely to be hostile.

This knowledge can then inform the crew's decision-making processes and help them prepare for future encounters. For example, if the database analysis reveals that species with certain features tend to be more aggressive, the Enterprise can adjust its diplomatic approach or defensive strategies accordingly.

## What is "Data Modeling"?

The first step to the creation of a database is to create a "data model." Here, **data modeling** is the process of analyzing and defining the structure, relationships, and constraints of an organization's data to create a standardized representation of the data that will be stored in a database. This representation serves as a blueprint for the database system, guiding its design, development, and maintenance.

Data modeling involves understanding the business requirements, identifying the entities (objects or concepts) that need to be represented, and determining how these entities relate to each other. The goal is to create a model that accurately reflects the organization's data needs, supports its business processes, and enables efficient data management.

The data modeling process typically involves three main levels of abstraction:

1.  **Conceptual Data Model**: This high-level model focuses on capturing the overall structure of the data and the relationships between different entities, without delving into the specifics of implementation. At this stage, data modelers work closely with business stakeholders to understand their requirements, identify the main data entities, and define the business rules that govern the data. The conceptual model is often represented using Entity-Relationship Diagrams (ERDs), which visually depict the entities, their attributes, and the relationships between them.
2.  **Logical Data Model**: The logical model takes the conceptual model and refines it to provide a more detailed, technology-independent representation of the data structure. At this level, data modelers decide on the specific data attributes, data types, and the relationships between entities. They also define the primary keys, foreign keys, and any constraints or rules that govern the data. The logical model is still independent of the specific database technology being used, allowing for flexibility in implementation. Depending on the organization's needs, the logical model may be represented using a relational model (for SQL databases), a document model (for NoSQL databases like MongoDB), or other data models.
3.  **Physical Data Model**: The physical model is a technology-specific representation of the logical model, taking into account the specific database management system being used and any performance or storage considerations. This model includes details such as table structures, indexes, and data types. For the Enterprise, the physical model would be the actual implementation of the crew management system using a specific database technology, like PostgreSQL or Oracle.

To illustrate the data modeling process, let's consider the Starship Enterprise's mission planning system. At the conceptual level, the data modelers would identify entities such as "Mission," "Starship," "Crew Member," and "Planet," and define the relationships between them (e.g., a Mission involves a Starship and a Crew, and may take place on a Planet). They would also define business rules, such as "each Mission must have at least one Crew Member assigned to it."

At the logical level, the data modelers would refine the model by adding attributes to each entity (e.g., Mission has a start date, end date, and objective), and determining the cardinality of the relationships (e.g., a Starship can have many Missions, but a Mission is associated with only one Starship). They would also choose the appropriate data model, such as a relational model, based on the organization's requirements.

Finally, at the physical level, the data modelers would translate the logical model into a specific database schema, defining tables, columns, data types, and any database-specific optimizations needed to ensure efficient data storage and retrieval.

By following this data modeling process, the Starship Enterprise can ensure that its mission planning system is well-structured, efficient, and aligned with the organization's needs, enabling effective data management and decision-making.

In the following sections, we'll take a brief look at the most common logical data models. In later chapters, we will consider each type of modeling in much greater detail.

### The Entity-Relationship Model

The Entity-Relationship (ER) Model is a visual representation used primarily in the conceptual and logical stages of data modeling. It provides a way to describe the data requirements of an organization in a graphical format, making it easier for both technical and non-technical stakeholders to understand and validate the data structure.

Key components of the ER Model include:

1.  **Entities**: These represent distinct objects or concepts in the system (e.g., Mission, Starship, Crew Member).
2.  **Attributes**: These are properties or characteristics of entities (e.g., mission start date, starship name, crew member rank).
3.  **Relationships**: These show how entities are related to each other (e.g., a Mission is assigned to a Starship).
4.  **Cardinality**: This indicates the number of instances of one entity that can be associated with the other entity in a relationship (e.g., one Starship can have many Missions, but each Mission is associated with only one Starship).

The ER Model fits into the data modeling process as follows:

-   In the **Conceptual Data Model** stage, a high-level ER diagram is created to capture the main entities and their relationships, without including all the detailed attributes. This helps stakeholders understand the overall structure of the data.
-   In the **Logical Data Model** stage, the ER diagram is refined to include all attributes, specify data types, and define the cardinality of relationships. This more detailed ER model serves as a bridge between the conceptual understanding and the actual database implementation.
-   When moving to the **Physical Data Model**, the ER diagram is translated into the specific schema of the chosen database system. For relational databases, entities typically become tables, attributes become columns, and relationships are implemented through primary and foreign keys.

By using the Entity-Relationship Model throughout the data modeling process, organizations like Starfleet can ensure that their data structures are well-defined, easily understood by all stakeholders, and accurately represent the complex relationships in their systems, such as those between missions, starships, crew members, and explored planets.

### Example: Conceptual ERD

In [None]:
import base64
from IPython.display import Image, display, HTML

def mm(graph):
    graphbytes = graph.encode("utf8")
    base64_bytes = base64.b64encode(graphbytes)
    base64_string = base64_bytes.decode("ascii")
    display(Image(url="https://mermaid.ink/img/" + base64_string))

mm("""
  erDiagram
        MISSION ||--o{ CREW_MEMBER : "assigned to"
        MISSION ||--|| STARSHIP : "uses"
        MISSION }o--|| PLANET : "explores"
    """)

### Example: Logical ERD

In [None]:
mm("""
erDiagram
        MISSION_L {
            int mission_id PK
            date start_date
            date end_date
            string objective
            int starship_id FK
            int planet_id FK
        }
        STARSHIP_L {
            int starship_id PK
            string name
            string class
        }
        CREW_MEMBER_L {
            int crew_member_id PK
            string name
            string rank
        }
        PLANET_L {
            int planet_id PK
            string name
            string classification
        }
        MISSION_CREW_L {
            int mission_id FK
            int crew_member_id FK
        }
        MISSION_L ||--o{ MISSION_CREW_L : has
        MISSION_CREW_L }o--|| CREW_MEMBER_L : includes
        MISSION_L ||--|| STARSHIP_L : uses
        MISSION_L }o--|| PLANET_L : explores

""")

### Example: Physical ERD

In [None]:
mm("""
erDiagram
missions {
            serial mission_id PK
            date start_date
            date end_date
            varchar objective
            int starship_id FK
            int planet_id FK
        }
        starships {
            serial starship_id PK
            varchar name
            varchar class
        }
        crew_members {
            serial crew_member_id PK
            varchar name
            varchar rank
        }
        planets {
            serial planet_id PK
            varchar name
            varchar classification
        }
        mission_crew {
            int mission_id FK
            int crew_member_id FK
        }
        missions ||--o{ mission_crew : has
        mission_crew }o--|| crew_members : includes
        missions ||--|| starships : uses
        missions }o--|| planets : explores
""")

## Logical Data Models: The Relational Model

The **relational model** is the most widely used logical data models, particularly in SQL (Structured Query Language) databases. In the relational model, data is organized into tables (also known as relations), with each table consisting of rows (tuples) and columns (attributes). The relational model provides a simple, flexible, and powerful way to represent and manipulate data.

Key concepts in the relational model include:

1.  **Tables**: A table is a collection of related data entries, organized into rows and columns. Each table represents a single entity or concept, such as "Crew Member" or "Mission."
2.  **Columns**: Each column in a table represents a specific attribute of the entity, such as "Name," "Rank," or "Employee ID" for the "Crew Member" table.
3.  **Rows**: Each row in a table represents a unique instance of the entity, such as a specific crew member or mission.
4.  **Primary Key**: A primary key is a column (or set of columns) that uniquely identifies each row in a table. For example, in the "Crew Member" table, the "Employee ID" could be the primary key.
5.  **Foreign Key**: A foreign key is a column (or set of columns) in one table that refers to the primary key of another table, establishing a relationship between the two tables. For example, the "Mission" table might have a foreign key "Crew Member ID" that refers to the primary key "Employee ID" in the "Crew Member" table.
6.  **Relationships**: Relationships define how tables are connected to each other based on their primary and foreign keys. The three main types of relationships are one-to-one, one-to-many, and many-to-many.

To illustrate the relational model, let's consider a simplified example of the Starship Enterprise's crew management system. We'll define two tables: "Crew Member" and "Department."

**Crew Member Table**

| Employee ID (PK) | Name | Rank | Department ID (FK) |
| --- | --- | --- | --- |
| 1 | James Kirk | Captain | 1 |
| 2 | Spock | Commander | 2 |
| 3 | Uhura | Lieutenant | 3 |
| 4 | Leonard McCoy | Lieutenant | 4 |

**Department Table**

| Department ID (PK) | Department Name |
| --- | --- |
| 1 | Command |
| 2 | Science |
| 3 | Communications |
| 4 | Medical |

In this example, the "Crew Member" table has a primary key "Employee ID" and a foreign key "Department ID," which references the primary key "Department ID" in the "Department" table. This establishes a one-to-many relationship between the two tables, as each crew member belongs to a single department, but a department can have multiple crew members.

Using this relational model, the Enterprise can easily store, retrieve, and manipulate data about its crew members and departments. For example, they can query the database to find all crew members belonging to a specific department or join the two tables to retrieve the department name for each crew member. The link between primary and foreign keys allows relational databases to provide strong support for **referential integrity** (basically, that the foreign key points to a "real" primary key in another table).

The relational model provides a strong foundation for organizing and managing data in a structured and efficient manner, and has been the dominant model since the 1970s. Leading database management software such as Oracle, MySQL, PosgtreSQL, Microsoft Access, Microsoft SQL server, and SQLite are are all based on the relational model. We'll focus much of our attention on this data model.

## Logical Data Models: JSON and Document Databases

**JSON (JavaScript Object Notation)** is a lightweight, text-based data format that has gained popularity as a way to represent and store data in NoSQL databases, particularly in **document databases**. JSON provides a flexible and intuitive structure for organizing data, making it well-suited for handling semi-structured and hierarchical data.

In JSON, data is represented as **key-value pairs** and arrays. **Keys** are strings, and **values** can be various data types, such as strings, numbers, booleans, objects, or arrays. JSON supports nested objects and arrays, allowing for the creation of complex, hierarchical data structures within a single document.

```javascript

// Example of a JSON file
// Form is key : value
{
  "crew_id": "001",
  "name": "James Kirk",
  "rank": "Captain",
  "ship": "Enterprise",
  
  // Missions is an example of a "nested" data structure
  "missions": [
    {
      "mission_id": "M001",
      "planet": "Vulcan",
      "objective": "Diplomatic meeting",
      "start_date": "2258-01-15",
      "end_date": "2258-01-18"
    },
    {
      "mission_id": "M002",
      "planet": "Andoria",
      "objective": "Scientific research",
      "start_date": "2258-02-03",
      "end_date": "2258-02-07"
    }
  ],
  "skills": ["Leadership", "Tactics", "Diplomacy"],
  "performance_reviews": [
    {
      "date": "2258-12-31",
      "reviewer": "Admiral Pike",
      "rating": 9,
      "comments": "Exceptional leadership and decision-making skills."
    }
  ]
}

```

Document databases, such as MongoDB and Couchbase, leverage the JSON format (or a binary variant like BSON) to store data as semi-structured documents. Each document can have a different structure, allowing for flexible and schema-less data storage. This flexibility enables developers to easily modify the data structure as application requirements evolve, without the need for costly schema migrations.

Most modern "relational" database management systems (mentioned above) also have the ability to interact natively with JSON. Later in this book, we'll see how this works.

Key features and advantages of JSON and document databases include:

1.  **Flexibility**: JSON allows for the storage of semi-structured and unstructured data, accommodating evolving data requirements and enabling rapid application development.
2.  **Scalability**: Document databases are designed to scale horizontally, distributing data across multiple servers to handle large volumes of data and high read/write throughput.
3.  **Performance**: By storing related data together within a single document, document databases can reduce the need for expensive joins and improve read performance.
4.  **Expressive Query Languages**: Document databases often provide expressive query languages that support complex queries, indexing, and aggregation operations on JSON data.

Compared to relational databases, JSON and document databases offer a different approach to data modeling and storage. While relational databases enforce a strict, predefined schema and normalize data across multiple tables, JSON and document databases allow for flexible, denormalized data storage within a single document. This approach can simplify data modeling and improve performance for certain use cases, particularly when dealing with rapidly changing or unstructured data.

However, it's important to note that relational databases still excel in scenarios that require strong data consistency, complex transactions, and rigorous ACID (Atomicity, Consistency, Isolation, Durability) properties. The choice between JSON/document databases and relational databases depends on the specific requirements of the application, such as data structure, scalability needs, and consistency guarantees.

In the context of the Starship Enterprise, a document database using JSON could be used to store and manage various types of data, such as mission reports, crew profiles, and sensor readings, allowing for flexible and easily extensible data representation. The ability to nest objects and arrays within JSON documents enables the creation of rich, hierarchical data structures that can be efficiently queried and updated, supporting the diverse data management needs of the Enterprise..

## Logical Data Models: Graph Databases

Graph databases (such as **neo4j**) are a type of NoSQL database that use a graph structure to represent and store data. They focus on the relationships between data entities, making them well-suited for handling highly connected and complex data. Graph databases excel in scenarios where the relationships between data elements are as important as the data itself.

In a graph database, data is represented as **nodes (vertices)** and **edges (relationships)**. Nodes represent entities, such as crew members, planets, or starships, while edges represent the connections or relationships between these entities. Both nodes and edges can have properties (key-value pairs) that store additional information about the entities and relationships.

Example:

```cypher
// Create crew member nodes
CREATE (kirk:CrewMember {name: "James Kirk", rank: "Captain"})
CREATE (spock:CrewMember {name: "Spock", rank: "Commander"})

// Create starship node
CREATE (enterprise:Starship {name: "Enterprise", registry: "NCC-1701"})

// Create planet node
CREATE (vulcan:Planet {name: "Vulcan", classification: "M"})

// Create mission node
CREATE (mission:Mission {name: "Diplomatic meeting", objective: "Establish relations with Vulcan"})

// Create relationships between nodes
CREATE (kirk)-[:COMMANDS]->(enterprise)
CREATE (enterprise)-[:VISITED]->(vulcan)
CREATE (spock)-[:SERVES_ON]->(enterprise)
CREATE (kirk)-[:PARTICIPATES_IN]->(mission)
CREATE (mission)-[:TAKES_PLACE_ON]->(vulcan)
```


In this example, we use Cypher to create nodes representing crew members (`kirk` and `spock`), a starship (`enterprise`), a planet (`vulcan`), and a mission (`mission`). We also create relationships between these nodes using the `CREATE` statement and the `[]` syntax to specify the relationship types, such as `COMMANDS`, `VISITED`, `SERVES_ON`, `PARTICIPATES_IN`, and `TAKES_PLACE_ON`.

Compared to relational databases, graph databases offer a different perspective on data modeling and querying. While relational databases normalize data and define relationships through foreign keys, graph databases prioritize the relationships between entities and store them as first-class citizens. This approach can lead to more intuitive data modeling and faster querying of highly connected data.

However, graph databases may not be the best fit for all use cases. They are particularly well-suited for scenarios where the relationships between data elements are complex, frequently traversed, and subject to change. In contrast, relational databases are better suited for structured data with well-defined schemas and strong consistency requirements.

In the context of the Starship Enterprise, a graph database could be used to model and analyze various relationships, such as:

-   Social connections between crew members
-   Dependency chains between ship systems and components
-   Trade routes and diplomatic relationships between planets and civilizations
-   Mapping of the explored universe and the connections between star systems

By leveraging the power of graph databases, the Enterprise can gain valuable insights into the complex web of relationships that underlie its operations, enabling better decision-making, faster problem-solving, and more efficient exploration of the final frontier.

### Logical Data Models: Column Databases

Column databases, also known as columnar databases, are a type of NoSQL database that stores data in columns rather than rows. This approach offers significant advantages for certain types of data analytics and business intelligence workloads.

In a column database, each column is stored separately on disk, with columns of the same type stored together. This structure allows for efficient compression and faster querying of specific attributes across large datasets.

Consider a simple table of starship crew members:

| ID | Name | Rank | Specialization |
| --- | --- | --- | --- |
| 1 | James Kirk | Captain | Command |
| 2 | Spock | Commander | Science |
| 3 | McCoy | Lieutenant | Medical |

In a column database, this would be stored as:

```
ID: [1, 2, 3]
Name: ["James Kirk", "Spock", "McCoy"]
Rank: ["Captain", "Commander", "Lieutenant"]
Specialization: ["Command", "Science", "Medical"]
```

Key advantages of column databases include:

-   Efficient data compression
-   Fast aggregation and analytics on specific columns
-   Scalability for large volumes of data
-   Improved I/O efficiency for queries on subset of columns

In the context of Star Trek and the Starship Enterprise, column databases could be used for various applications:

1.  Store and analyze vast amounts of data from long-range scanners. Each column could represent a different type of measurement (e.g., radiation levels, temperature, gravitational fields) across millions of data points. This structure would allow for efficient storage and quick analysis of trends across vast regions of space.
2.  Track various performance indicators for crew members over time. Columns could include attributes like mission success rates, efficiency ratings, and health metrics. This would enable quick analysis of crew performance trends and identify areas for improvement.
3.  Store detailed maintenance records for all systems on the Enterprise. Each column could represent a different component or system, with entries for maintenance dates, issues encountered, and parts replaced. This would allow for efficient querying of maintenance history for specific systems across the entire fleet.

Column databases excel in scenarios involving large-scale analytics, data warehousing, and business intelligence. They are particularly useful when dealing with time-series data or when performing aggregations and calculations on specific attributes across massive datasets. Popular column databases include Apache Cassandra, Google BigQuery, and Amazon Redshift.

While column databases offer significant advantages for analytical workloads, they may be less efficient for transactional processing compared to traditional row-oriented relational databases. The choice between different database models depends on the specific requirements of the application, such as query patterns, data volume, and performance needs.

### Table: Logical Models Compared

| Aspect | Relational | Document (JSON) | Graph | Column |
| --- | --- | --- | --- | --- |
| Structure | Multiple tables with rows and columns | Documents with nested key-value pairs | Nodes and edges | Columnar storage of data |
| Schema | Fixed, well-defined schema | Flexible, semi-structured schema | Flexible, schema-less or schema-optional | Flexible schema, column-oriented |
| Data Integrity | High, enforced through constraints and keys | Medium, depends on application logic | Medium to high, depending on the graph database | Medium, typically enforced at the application level |
| Referential Integrity | Strong, enforced through foreign key constraints | Weak, typically managed at the application level | Strong, inherent in edge relationships | Weak, typically managed at the application level |
| Query Language | SQL | NoSQL query languages (e.g., MongoDB query syntax) | Graph query languages (e.g., Cypher, Gremlin) | SQL-like languages, often with extensions for columnar operations |
| Relationships | Explicit, using foreign keys | Implicit, through nested documents | Explicit, through edges | Implicit, through column associations |
| Scalability | Good, but can be complex and costly at scale | High, designed for horizontal scaling | High, especially for complex, interconnected data | Very high, particularly for read-heavy analytical workloads |
| Use Cases | Traditional business applications, transactional data | Flexible applications, content management systems | Social networks, recommendation systems, IoT | Data warehousing, business intelligence, big data analytics |
| Performance | Generally efficient, but can slow with complex joins | Fast for read-heavy operations | Efficient for traversing relationships | Excellent for analytical queries and aggregations |
| Example | MySQL, PostgreSQL | MongoDB, CouchDB | Neo4j, ArangoDB | Apache Cassandra, Google BigQuery |

In this table:

-   **Data Integrity** refers to the accuracy, consistency, and reliability of data stored in a database, ensuring that it remains complete, accurate, and valid over its lifecycle.
-   **Referential Integrity** is a specific aspect of data integrity that ensures relationships between tables remain consistent. It's typically enforced through foreign key constraints in relational databases.
-   **Query Language** is the language or syntax used to retrieve, manipulate, and manage data stored in a database.
-   **Relationships** are the connections or associations between different entities or objects in a database, representing how they are related to each other.
-   **Scalability** is the ability of a database system to handle increasing amounts of data and accommodate growth in terms of data volume, traffic, and complexity.
-   **Use Cases** are the specific scenarios, applications, or problems that a particular database system is well-suited to address or solve.
-   **Performance** is the speed, efficiency, and responsiveness of a database system in executing queries, retrieving data, and performing various operations.

## Commander Spock's Database Lesson: Models and Starfleet Use Cases

*Scene: A young Starfleet cadet enters Commander Spock's office aboard the Enterprise for a mentoring session on database systems.*

Cadet: Good morning, Commander Spock. I'm here to learn more about database modeling and when to use different types of databases for Starfleet operations.

Spock: Greetings, Cadet. Your pursuit of knowledge is logical. Let us begin with the relationship between conceptual, logical, and physical models in database design.

Cadet: I've heard these terms, but I'm not entirely clear on how they relate to each other, especially in Starfleet contexts.

Spock: Indeed. Consider these models as three levels of abstraction in database design, each serving a specific purpose and audience within Starfleet.

1.  The conceptual model is the highest level of abstraction. It represents the overall structure of the data in a business context, independent of any database management system. For Starfleet, this might include entities like 'Starship', 'Crew Member', 'Mission', and 'Alien Species'.
2.  The logical model is a more detailed representation that includes all entities, attributes, relationships, and keys. For instance, it would specify that a 'Starship' has attributes like 'Registry Number', 'Class', and 'Current Mission', and that it has a many-to-many relationship with 'Crew Members'.
3.  The physical model is the most specific, detailing how the logical model will be implemented in a particular database management system. It includes tables, columns, data types, indexes, and other physical storage details. This is what our database administrators would work with directly.

Cadet: I see. So we start with a broad concept of Starfleet operations and gradually refine it into something that can be implemented in our computer systems?

Spock: Precisely. The conceptual model might be used in discussions with Starfleet Command, the logical model with our science officers and system designers, and the physical model with our computer systems specialists.

Cadet: That's clear. But how do we decide which logical model to use? I've heard about relational and non-relational databases, but I'm not sure when to use each for our various Starfleet systems.

Spock: An astute question. The choice between relational and non-relational models depends on the specific requirements of the application. However, it's crucial to note that most modern Relational Database Management Systems (RDBMS) used by Starfleet can support non-relational logical models, such as JSON and key-value structures, within a relational framework. This flexibility allows us to leverage the strengths of both paradigms. Let me elaborate on some Starfleet-specific use cases.

Relational models are most appropriate when:

1.  We have structured data with clear relationships. For example, our personnel database, which links crew members to their assignments, qualifications, and medical records.
2.  We need strong consistency and ACID compliance. This is crucial for our ship's systems, where data integrity can be a matter of life and death.
3.  We require complex queries and transactions. For instance, our mission planning system needs to join data from multiple sources like star charts, crew rosters, and equipment inventories.

Cadet: Could you give a specific example of when a relational database would be the best choice in Starfleet operations?

Spock: Certainly. Consider our ship's life support system. It requires strict data integrity, has clear relationships (e.g., decks have rooms, rooms have environmental controls), and needs complex queries (e.g., calculating oxygen levels, managing power distribution). A relational database would be most suitable here, allowing for accurate and efficient management of these critical systems.

Cadet: That's clear. What about non-relational models?

Spock: Non-relational or NoSQL models are more appropriate in scenarios such as:

1.  Handling large volumes of unstructured or semi-structured data. For example, our long-range sensor arrays collect vast amounts of varied data that don't always fit a predefined structure.
2.  When we need high scalability and performance for simple read/write operations. Our ship's computer logs, which record millions of events per second across all systems, benefit from this model.
3.  When our data schema is likely to evolve rapidly. As we encounter new phenomena and species, our xenobiology database needs to adapt quickly to store new types of information.

However, remember that modern RDBMS can often handle these scenarios as well, using features like JSON data types or key-value storage within a relational framework.

Cadet: Could you provide an example of when a non-relational model would be the best choice in Starfleet?

Spock: Indeed. Consider our Stellar Cartography system. It needs to handle various types of spatial and temporal data, from basic star positions to complex phenomena like temporal anomalies and subspace rifts. The data structures vary greatly and evolve as we make new discoveries. In this case, a document-based model within our RDBMS, using JSON to store flexible, schema-less data, could be more suitable. It allows for adaptable data storage while still benefiting from the ACID properties and querying capabilities of our relational systems.

Cadet: That's fascinating. Are there situations where you might use both relational and non-relational models in the same Starfleet system?

Spock: Yes, this is quite common in our advanced systems. For instance, our main computer uses a relational model for core ship functions and crew data, while employing non-relational models for sensor logs and scientific data analysis. Modern RDBMS allow us to do this within a single system, providing the benefits of both paradigms. For example, consider the following table from a relational database, which stores (non-relational) sensor log data:

**Sensor_Logs Table:**

| LogID | Timestamp | SensorType | LogData |
| --- | --- | --- | --- |
| 1 | 2023-06-24 14:30:00 | long-range | {"location": {"coordinates": [123.45, -67.89], "sector": "Alpha Quadrant"}, "readings": {"radiation": 0.02, "subspace_distortions": 3, "nearby_vessels": 1}} |
| 2 | 2023-06-24 14:30:05 | internal | {"deck": 7, "section": "Engineering", "readings": {"temperature": 20.5, "humidity": 45, "antimatter_containment": "stable"}} |

This approach allows us to index and query the structured data (LogID, Timestamp, SensorType) efficiently, while still maintaining the flexibility to store varied log data structures.



Cadet: Thank you, Commander Spock. This has been incredibly helpful. One last question: How do you recommend I practice applying these concepts?

Spock: I suggest you start by designing conceptual, logical, and physical models for a simple system, such as a shuttlecraft maintenance log. Then, try to identify scenarios in various Starfleet operations where different data models would be appropriate. Remember, the key is to match the data model to the specific needs of the application, while also considering the capabilities of our advanced RDBMS to support multiple models.

Cadet: I'll definitely do that. Thank you for your time and wisdom, Commander Spock.

Spock: You're welcome, Cadet. Live long and prosper in your database endeavors.

## Data Model Quiz
Click the following cell to lauch a data model quiz.

In [None]:
!wget https://github.com/brendanpshea/colab-utilities/raw/main/data_model_quiz.py -nc -q
from data_model_quiz import logical_data_models_quiz
logical_data_models_quiz()

Welcome to the Star Trek Logical Data Models Quiz!
For each description, enter the corresponding data model:
1. Flat, 2. Relational, 3. Document, 4. Graph

Type 'quit' to exit the game.


Statement: Suitable for small datasets.
Your answer (1-4): 1
Correct! Engage!


Statement: Data is represented as nodes and edges.
Your answer (1-4): 4
Correct! Engage!


Statement: Good for storing unstructured data.
Your answer (1-4): quit
Game exited early. Your final score: 2/10


## Physical Models: Choosing a DBMS


When it comes to storing and managing data for your application, you'll need to choose a suitable database system. But before diving into the different options, let's understand what a **database management system (DBMS)** is.

A DBMS is software that allows you to create, organize, and interact with databases. It provides a way to store, retrieve, and manage data efficiently. Think of it as a sophisticated filing cabinet that helps you keep your data organized and easily accessible.

Now, let's explore the different types of databases you can choose from:

1.  **File Database (e.g., SQLite)**: Imagine you're developing a small application for a personal tricorder device. In such cases, a file database like SQLite could be a great choice. SQLite is lightweight, easy to set up, and stores the entire database as a single file on your device. It's perfect for applications that don't require multi-user access or complex data management, just like a tricorder's simple data storage needs.
2.  **Personal Database (e.g., Microsoft Access)**: If you're building a small-scale application for managing crew member records on a single computer, a personal database like Microsoft Access might be suitable. Access provides a user-friendly interface for creating tables, forms, and reports. It's great for managing data that doesn't require extensive scalability or advanced features, similar to maintaining a local database on a starship's computer.
3.  **Open-Source Client-Server Database (e.g., MySQL, PostgreSQL)**: When your application needs to handle multiple users and requires more advanced functionality, like managing the Enterprise's crew and mission data, an open-source client-server database like MySQL or PostgreSQL could be the way to go. These databases are powerful, scalable, and offer a wide range of features. They are ideal for web applications, content management systems, and enterprise-level solutions. Just like how the Enterprise relies on robust databases to manage its vast amount of data across multiple systems and users.
4.  **Proprietary Client-Server Database (e.g., Oracle, SQL Server)**: For large-scale, mission-critical applications that demand high performance, reliability, and advanced features, such as managing the entire Starfleet's data, proprietary client-server databases like Oracle or SQL Server are often the choice. These databases come with extensive support, comprehensive documentation, and a wide array of tools and utilities. They are trusted by large organizations, similar to how Starfleet would rely on top-notch database systems to handle its critical data.
5.  **NoSQL Database (e.g., MongoDB, Cassandra)**: In some cases, your application might deal with unstructured or semi-structured data, such as sensor readings from the Enterprise's scientific instruments or logs from various starship systems. NoSQL databases, like MongoDB or Cassandra, are designed to handle such data efficiently. They provide flexibility, scalability, and high performance for handling large volumes of diverse data types. Just like how the Enterprise would use specialized databases to store and analyze complex data from its missions.

Differnt use cases call for different choices of DBMS. For example, for managing simple crew member records on a single computer, a personal database like Microsoft Access would suffice. However, for handling the vast amount of data generated by the Enterprise's sensors, scientific instruments, and logs, a combination of an open-source client-server database like PostgreSQL and a NoSQL database like MongoDB would be more suitable.

### Table: Leading Database Management Systmes
| DBMS | Description | Supported Logical Models |
| --- | --- | --- |
| Oracle | A powerful, enterprise-level database system known for its scalability and reliability, often used by large organizations. | Relational, JSON, XML, Spatial |
| MySQL | A popular, open-source database system widely used for web applications, offering simplicity and good performance. | Relational, JSON (MySQL 5.7+) |
| Microsoft SQL Server | A robust database system developed by Microsoft, providing scalability, security, and integration with other Microsoft products. | Relational, JSON (SQL Server 2016+), XML, Spatial |
| Microsoft Access | A user-friendly, small-scale database system included in the Microsoft Office suite, suitable for personal projects and small businesses. | Relational |
| SQLite | A lightweight, file-based database system commonly used as an embedded database in applications, known for its simplicity and efficiency. | Relational, JSON (partial) |
| MongoDB | A flexible, document-oriented NoSQL database that stores data in JSON-like formats, designed for scalability and agile development. | Document (JSON), Geospatial |
| Apache Cassandra | A highly scalable, distributed NoSQL database built to handle large amounts of structured data across multiple servers. | Wide Column Store |
| Amazon DynamoDB | A fully managed NoSQL database service provided by Amazon Web Services, offering automatic scaling and low latency data access. | Key-value, Document (JSON) |
| PostgreSQL | A powerful, open-source database system known for its reliability, robustness, and support for advanced data types. | Relational, JSON, XML, Spatial, Key-value (Hstore) |
| IBM Db2 | An enterprise-level database system developed by IBM, offering scalability, reliability, and advanced analytics capabilities. | Relational, JSON, XML, Spatial |

## Advanced Data Storage Concepts

In this section, we'll explore four fundamental concepts in data management: databases, data lakes, data warehouses, and data marts. We'll build on the basic idea of a relational database and see how these other concepts expand our ability to store and use data effectively.

### (Relational) Databases: The Foundation of Data Management

As we've learned, database is an organized collection of data stored and accessed electronically. The most common type of database is a relational database, which organizes data into tables with rows and columns. This structure is based on the relational model, proposed by E.F. Codd in 1970.

Key features of relational databases include:

1.  **Data integrity.** Ensuring data accuracy and consistency
2.  **ACID properties.** Guaranteeing reliable transaction processing
3.  **Normalization.** Organizing data to reduce redundancy
4.  **Indexing.** Improving query performance

While relational databases excel at handling structured data with clear relationships, they can be less flexible when dealing with unstructured or semi-structured data. This limitation led to the development of other data storage concepts.

### Data Lakes: Storing Raw, Unstructured Data
A data lake is a storage repository that holds a vast amount of raw data in its native format until it's needed. Unlike a relational database, a data lake can store structured, semi-structured, and unstructured data.

Key characteristics of data lakes include:

1.  The structure of the data is not defined until it's retrieved (**Schema-on-read**).
2.  Can store any type of data (text files, images, sensor data, etc.)
3.  Can easily grow to accommodate large volumes of data (**scalability**).
4.  Often uses commodity hardware for cost-effective storage

In a Star Trek context, a data lake might store diverse information like sensor logs, crew reports, and alien artifact scans, all in their original formats.

Data lakes are particularly useful for big data analytics and machine learning projects, where analysts might want to explore data in various ways not predetermined by a fixed schema.

### Data Warehouses: Integrated Data for Analysis

A data warehouse is a system used for reporting and data analysis. It's a central repository of integrated data from one or more disparate sources. Data warehouses use a process called ETL (Extract, Transform, Load) to gather data from various sources, clean and standardize it, and load it into the warehouse.

Key features of data warehouses include:

1.  Organized around major subjects (like customers or products)
2.  Data from different sources is merged into a consistent format
3.  Maintains historical data for trend analysis
4.  Data is stable (**non-voliatile**); updates are made in batches, not in real-time

Data warehouses often use a star schema or snowflake schema, which are variations of the relational model optimized for analytical queries. These schemas typically have a central fact table connected to multiple dimension tables.

In our Star Trek example, Starfleet might use a data warehouse to analyze trends across all starship missions, integrating data from individual ship logs, personnel records, and scientific findings.

### Data Marts: Focused Subsets of Data Warehouses

A data mart is a subset of a data warehouse oriented to a specific business line or team. It's essentially a condensed version of a data warehouse focused on a particular subject area.

Key aspects of data marts include:

1.  Tailored to a particular business function or department
2.  Contains less data than a full data warehouse
3.  Due to their focused nature, queries often run faster
4.  Often managed by specific business units

There are two main approaches to creating data marts:

-   Top-down: Created from an existing enterprise-wide data warehouse
-   Bottom-up: Created first, potentially integrated into a larger warehouse later

In our Star Trek scenario, while Starfleet might have a central data warehouse, individual departments like Security or Engineering might have their own data marts tailored to their specific analytical needs.

## Review Questions

Choose 3 (or more) of the following questions to answer in detail. Some of these might require outside research, and you are welcome to use things like StackOVerflow, Wikipedia, various chatBots, etc. However, the goal is that *you* should be able to explain these things (for example, on an exam).

1. How can the Starship Enterprise utilize databases to optimize its crew management and mission planning processes?
2. Discuss the benefits and drawbacks of using a relational database versus a flat file for storing and managing the Enterprise's inventory data.
3. Imagine you are a software engineer tasked with developing a new application for the Enterprise. How would you determine which logical data model (relational, document, or graph) to use?
4. As a data analyst aboard the Enterprise, how would you use data mining techniques on a document database to identify patterns and insights from unstructured mission reports?
5. Explain how the Enterprise can leverage a graph database to analyze the relationships between planets, species, and diplomatic treaties to inform its decision-making process.
6. Discuss the importance of data consistency and integrity in the context of the Enterprise's scientific experiments and research data.
7. How can the Enterprise ensure data security and privacy when storing sensitive information, such as personal crew member details or classified mission files, in a database?

## Review With Quizlet
Click the following cell to launch a quizlet review for this chapter.

In [None]:
%%html
<iframe src="https://quizlet.com/819297827/learn/embed?i=psvlh&x=1jj1" height="600" width="100%" style="border:0"></iframe>

## Glossary

| Term | Definition |
| --- | --- |
| Data | Raw, unorganized facts and figures |
| Information | Processed, structured, and contextualized data |
| Knowledge | Understanding and application of information |
| Database | Structured collection of data organized for efficient storage, retrieval, and manipulation |
| Flat (linear) file | Simple data structure that stores data in a single table or spreadsheet |
| Data consistency | Ensuring data remains consistent and free of contradictions across all instances |
| Data integrity | Maintaining accuracy, reliability, and overall quality of data |
| Data model (general) | Standardized representation of data structure, relationships, and constraints |
| Conceptual Model | High-level model focusing on overall data structure and relationships, without implementation specifics |
| Logical Model | Detailed, technology-independent representation of data structure, refining the conceptual model |
| Physical Model | Technology-specific representation of the logical model, considering database management system and performance |
| Relational Model | Organizes data into tables (relations) with rows (tuples) and columns (attributes) |
| Table (relational model) | Collection of related data entries organized into rows and columns |
| Attribute (relational model) | Column in a table representing a specific data characteristic |
| Primary Key (relational model) | Column(s) uniquely identifying each row in a table |
| Foreign Key (relational model) | Column(s) in one table referring to the primary key of another table |
| Document Model | Stores data as semi-structured documents, often using JSON format |
| JSON | Lightweight, text-based data format for representing and storing data as key-value pairs and arrays |
| Key (document model) | String identifying a value in a key-value pair within a JSON document |
| Value (document model) | Data associated with a key in a key-value pair within a JSON document |
| Graph Model | Represents data as nodes (vertices) and edges (relationships) |
| Node (graph model) | Represents an entity in a graph database |
| Edge (graph model) | Represents a connection or relationship between nodes in a graph database |
| Database Management System (DBMS) | Software for creating, organizing, and interacting with databases |
| **Structured data** | Data that is organized in a predefined manner, typically in tabular format with rows and columns, making it easy to search and analyze. |
| Unstructured data | Data that does not have a predefined data model or organization, such as text, images, videos, and social media posts. |
| NoSQL database | A non-relational database designed to handle large volumes of unstructured or semi-structured data, offering flexibility and scalability. |
| Data Lake | A centralized repository that allows the storage of all data types (structured, semi-structured, and unstructured) at any scale in their raw format. |
| Data Warehouse | A centralized repository designed for storing structured data from multiple sources, optimized for querying and analysis. |
| Data Mart | A subset of a data warehouse, focused on a specific business line or team, designed to provide specific insights. |
| Scalability| The ability of a system to handle an increasing amount of work or its potential to be enlarged to accommodate growth. |
| Schema-on-read | An approach where the data schema is applied at the time of reading the data, rather than when the data is written, providing flexibility in data use. Associated with data lakes. |