<a href="https://colab.research.google.com/github/brendanpshea/intro_cs/blob/main/IntroCS_08_Data_and_Databases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## An Intro to Data and Databases

In the ever-evolving digital age, we're surrounded by a colossal amount of information - so much so that it's become virtually impossible to comprehend its vastness without the aid of computers. This ocean of information is comprised of 'data', a term you're likely to hear frequently in the realm of computer science, and one that is the fundamental unit of our digital world. This chapter will introduce you to the concept of data, its types, and its critical role in computer science.

In any field, data serves as the basis for decision-making and problem-solving. For instance, meteorologists predict weather patterns based on historical climate data, marketers analyze consumer behavior data to tailor their advertising strategies, doctors make use of medical history data to diagnose and treat patients, and financial analysts forecast market trends using financial data. All these examples involve data analysis and interpretation, which are crucial components of computer science.

In computer science, we employ data in myriad ways: creating algorithms, designing software applications, and most notably, building and managing databases. But what exactly is a database? How does it organize vast amounts of data so efficiently? And how can we interact with a database to retrieve the information we need? To answer these questions, we delve into the world of databases and the language that we use to interact with them, Structured Query Language (SQL).

In this chapter, we'll explore what a database is, why it's an essential tool in handling data, and the various types of databases that exist. We'll also introduce you to SQL, discussing how to use it to retrieve data from a database efficiently. We'll primarily focus on SQL SELECT statements, which form the backbone of data retrieval in SQL.

Through this chapter, you'll not only learn about the concept and importance of data and databases but also acquire practical skills that are widely applicable in the field of computer science. By the end of it, you'll be well-equipped with the knowledge to navigate the world of data and databases and a newfound appreciation for the art and science of handling data in computer science.

## What is Data?
**** is a collection of facts in the form of words, numbers, images, or even more abstract concepts. It is raw and unprocessed information that, when processed or structured, can provide meaningful insights.  For example, if we record the daily temperature of a city, each recorded number is a piece of data. Another example might be the individual scores of students on a test, which can be processed to find the class average.

In the field of computer science, data is foundational and is the core subject of many, if not most, activities. It is the input to algorithms, the content stored and retrieved in databases, and the information transmitted across networks. Data is critical in a variety of fields and use cases, from decision-making in business environments to scientific research, from machine learning algorithms to web applications, and beyond.

Now, to further understand data, we can categorize it into two types: structured and unstructured data.

1.  **Structured Data:** This type of data is organized and formatted in a way that it's easily searchable in relational databases. Structured data adheres to a model that defines what fields of data exist and what types of data they hold. For example, an address book containing names, phone numbers, and addresses, where each piece of data has a certain type and is stored in a specific field.

2.  **Unstructured Data:** This type of data has no specific format or organization, making it more difficult to collect, process, and analyze. Examples of unstructured data include text files, images, videos, emails, social media posts, and web pages.

The importance of data in computer science cannot be overstated. It fuels the processes that drive decision-making algorithms, allows us to analyze and forecast trends, and forms the basis of machine learning and artificial intelligence. Understanding how to work with data - from basic data structures in programming languages to complex databases - is a crucial part of computer science education.

## What is a Database?
A **database** is a highly structured collection of data that is stored, managed, and accessed electronically. It's a systematic and organized way to store, retrieve, and manipulate data, facilitating efficient information management. In databases, data is organized into tables, each comprising rows and columns, forming a grid-like structure.

One common way to understand the value of databases is to contrast them with flat files. A **flat file** is a plain text or binary file that contains data but lacks the structured relationships between data elements that a database maintains. For example, you could store customer data in a flat file, with each line of the text file representing a customer and different pieces of information (like name, address, and phone number) separated by a delimiter such as a comma. This format is simple and can work for very small sets of data. Spreadsheets (like MS Excel or Google Docs) are examples of flat files.

However, as the volume of data grows, flat files become less practical. Retrieving specific information from a large flat file can be slow and resource-intensive. Furthermore, updating or modifying data in a flat file can be complex and prone to errors. Unlike databases, flat files don't support transactions (a logical unit of work that must either be entirely completed or entirely undone), concurrent access (multiple users accessing the data simultaneously), or constraints (rules governing the data), making data integrity and consistency harder to maintain.  On the other hand, databases are designed to manage, store, and retrieve large volumes of data efficiently. They are equipped with tools and functionalities to handle complex queries, support concurrent user access, ensure data consistency and security, and provide robust data recovery and backup systems.

Databases are designed around a database model, the most popular of which is the relational model. **Relational databases** organize data into tables (or relations), where each table represents an entity (like Customers, Orders, Products, etc.) and each row in the table represents a record or instance of that entity.  Many different types of relational databases are used today, each suited to different tasks:

1.  Microsoft Access: This is an entry-level, small-scale database management system (DBMS) provided by Microsoft. It's integrated with the Microsoft Office suite and is excellent for small businesses or departments within large organizations.

2.  SQLite: SQLite is a self-contained, serverless, and zero-configuration database engine used widely in mobile apps and small to medium-sized applications. It's renowned for its simplicity and lightweight footprint.

3.  Oracle Database: Oracle is a high-end, fully-featured DBMS. It is often used in large systems and for enterprise-level applications where performance, scalability, and reliability are crucial.

4.  Microsoft SQL Server: This is another enterprise-level DBMS, provided by Microsoft. SQL Server supports a wide range of transaction processing, business intelligence, and analytics applications in corporate IT environments.

5.  PostgreSQL: PostgreSQL is a powerful, open-source relational database system with a strong reputation for reliability, data integrity, and correctness. It supports both SQL (relational) and JSON (non-relational) querying.

6.  MySQL: MySQL is one of the most popular open-source relational database systems. It's widely used for web databases and is a part of the popular LAMP (Linux, Apache, MySQL, PHP) stack for web development.

Databases are an integral part of modern computing. Their structured nature allows efficient storage, retrieval, and manipulation of data, making them vital for a broad range of applications, from small scale applications like mobile apps to large, enterprise-level systems.

## Advantages of Databases over Flat Files
Databases have several advantages over flat files:

- *Efficient Storage:* Databases provide a highly efficient way to store large amounts of data. They allow us to organize data in a structured format, making it easier to manage. For example, a university might have a database that stores information about students, courses, and grades, all organized into separate tables for efficient storage.

- *Data Retrieval:* One of the significant advantages of databases is the ease with which we can retrieve data. Thanks to their structured format, we can quickly find and extract the information we need. Using our university example, if we want to find out all the courses a particular student has taken, we can easily retrieve this information from the database.

- *Data Management:* Databases also provide robust tools for managing data. They allow us to update data, enforce security measures, and ensure that the data remains consistent and accurate. In our university database, for instance, we could easily update a student's grade or add a new course.

- *Concurrency Control:* Databases allow multiple users to interact with the data simultaneously, ensuring that transactions are processed reliably and accurately, even when the database is being accessed concurrently by multiple users.

- *Data Protection:* Databases have built-in mechanisms for data backup and recovery, ensuring that data isn't lost in case of a failure.

In modern computing, databases are vital because they allow for the efficient handling of large amounts of data. They form the backbone of many applications, ranging from web and mobile applications to enterprise software systems. Understanding databases and how to interact with them is a crucial skill in computer science and many other fields.

## A Short Introduction to SQL
**Structured Query Language**, or **SQL**, is a programming language specifically designed for managing data held in a relational database management system (RDBMS). It is used for storing, manipulating, and retrieving data stored in databases.

The inception of SQL goes back to the 1970s. It was initially developed at IBM by Donald D. Chamberlin and Raymond F. Boyce. The language was originally called SEQUEL (Structured English Query Language) but was later shortened to SQL. The first commercial version was introduced by Oracle in 1979.

SQL is a little different from other programming languages. It's what we call a **declarative language**. This means that when you write SQL, you describe what you want without having to outline a detailed sequence of steps to get it. This is different from **procedural languages** like Python or C, where you provide the computer with step-by-step instructions.

Let's consider a simple example. Suppose you're playing a game of hide-and-seek with your friends. If your friend were a database, and you were trying to "query" your friend to find out where another friend was hiding, a procedural approach might involve asking a series of yes/no questions: "Are they hiding upstairs?", "Are they in a room with a window?", etc. In contrast, with a declarative approach like SQL, you would simply ask, "Where is our friend hiding?" SQL is designed to get the information you need in one question, without the need for a step-by-step process.

In practical terms, SQL is used for tasks like finding data that fits specific criteria, adding new data, updating or deleting existing data, and performing functions on the data such as adding it up or calculating averages. SQL is also used to create and modify the structure of databases themselves.

Understanding SQL is crucial for anyone who works with databases, as it is one of the few languages that is consistently used with virtually all types of databases. SQL has been around for a long time and continues to be widely used in the industry, which means learning SQL is a valuable skill that can open many doors in the tech field.

## Case Study: Database for Jedi Academy
In this case study, we will introduce the basic concepts of relational databases and SQL using the unique context of a Jedi Academy. We're dealing with a variety of intriguing characters, diverse Jedi training courses, and multifaceted relationships between students and their classes. To manage this intricate network of information, we'll be using a tool perfect for the job - a relational database.

A relational database allows us to store and manage data by organizing it into one or more tables. Each table represents a specific entity and stores relevant pieces of information about that entity, known as attributes. For the Jedi Academy, we'll create three tables:

1.  The `Students` table, representing all Jedi students. It will include each student's unique `StudentID` (an integer), `FirstName` and `LastName` (text strings, or `VARCHAR`), their `Level` (also an integer, indicating stages of Jedi training, such as 1 for Padawan, 2 for Knight, 3 for Master), and `GPA` (a float, for storing grade point average).

2.  The `Classes` table, representing different training courses. It will store the `ClassID` (an integer), `ClassName` (a text string or `VARCHAR`), `MasterName` (the name of the Jedi Master teaching the class, another `VARCHAR`), and the `RoomNumber` (another integer).

3.  The `Enrollment` table, representing the relationship between students and classes. It will include `StudentID`, `ClassID`, and `EnrollmentDate` (a `DATE`, for when the student enrolled in the class).

This database design allows us to efficiently manage and retrieve the academy's data. By leveraging SQL, we can quickly find out all classes a particular student is enrolled in, identify which students are under Master Yoda's tutelage, change the room of a class, and much more.

SQL uses several data types to define what kind of data each column in a table can store. For instance, `INTEGER` is used for whole numbers, `VARCHAR` for text strings, `FLOAT` for decimal numbers, and `DATE` for dates. Choosing the correct data type is crucial to maintaining data integrity, optimizing storage, and enabling accurate operations and comparisons.

As we proceed through this case study, we'll be delving deeper into these data types, learning to harness the power of SQL, and exploring how it brings organization and accessibility to our Jedi Academy's database.

## Setting Up SQLite Database in Jupyter Notebook

Before we start creating tables, we need to set up our database. For this case study, we'll be using SQLite, a simple yet powerful database engine, through a Jupyter Notebook. To interact with SQLite using SQL commands directly in the notebook, we'll use a feature called SQL "magic".

Here are the steps to get started:

### Install necessary libraries

First, we need to install the `ipython-sql` library which enables SQL magic. In a Jupyter notebook cell, run:

In [1]:
# This will downgrade SQLAlchemy so it works in Colab
!pip install sqlalchemy==1.3.24

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sqlalchemy==1.3.24
  Downloading SQLAlchemy-1.3.24.tar.gz (6.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.4/6.4 MB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sqlalchemy
  Building wheel for sqlalchemy (setup.py) ... [?25l[?25hdone
  Created wheel for sqlalchemy: filename=SQLAlchemy-1.3.24-cp310-cp310-linux_x86_64.whl size=1268255 sha256=aad0f9ac2c854d5f438a6e961aa2fc5d75317b777185d7ed812619dd51706f7f
  Stored in directory: /root/.cache/pip/wheels/27/51/b3/3481e88d5a5ba95dd4aafedc9316774d941c4ba61cfb93add8
Successfully built sqlalchemy
Installing collected packages: sqlalchemy
  Attempting uninstall: sqlalchemy
    Found existing installation: SQLAlchemy 2.0.10
    Uninstalling SQLAlchemy-2.0.10:
      Successfully uninstalled SQLAlchemy-2.0.10
Success

### Load SQL magic

After successfully installing the ipython-sql library, you need to load SQL magic. This is done by running the following command:

In [2]:
%load_ext sql

### Create the SQLite database

Now, we'll create our SQLite database. SQL magic allows us to connect to SQLite using the following syntax:

In [3]:
%sql sqlite:///jedi_academy.db

### Creating Tables with SQL

Now that we've planned our Jedi Academy database, it's time to put our ideas into action. SQL allows us to create tables using the CREATE TABLE statement. Let's see how we can define our Students, Classes, and Enrollment tables.

#### Creating the Students table 
Here is the SQL statement to create the `Students` table:

In [4]:
%%sql

DROP TABLE IF EXISTS Studets; -- Deletes table if it already exists
CREATE TABLE Students (
    StudentID INT PRIMARY KEY,
    FirstName VARCHAR(50),
    LastName VARCHAR(50),
    Level INT,
    GPA FLOAT
);


 * sqlite:///jedi_academy.db
Done.
Done.


[]

In this statement, `CREATE TABLE Students` tells SQL that we want to create a new table named `Students`. Each line inside the parentheses defines a column in our table, with its name and its data type. For example, `StudentID INT` creates a column named `StudentID` that will store integers.

`PRIMARY KEY` is a constraint that we've added to `StudentID` to tell SQL that this column will uniquely identify each record in our table. In other words, there cannot be two students with the same `StudentID`.

#### Creating the `Classes` table

Next, let's create our `Classes` table:

In [5]:
%%sql
DROP TABLE IF EXISTS Classes; -- Deletes table if it already exists
CREATE TABLE Classes (
    ClassID INT PRIMARY KEY,
    ClassName VARCHAR(100),
    MasterName VARCHAR(50),
    RoomNumber INT
);


 * sqlite:///jedi_academy.db
Done.
Done.


[]

This statement is similar to the first, but creates a table named `Classes` with its own unique columns. The `ClassID` column is the primary key here, so each class must have a unique `ClassID`.

#### Creating the `Enrollment` table

Finally, we'll create the `Enrollment` table:

In [None]:
%%sql
DROP TABLE IF EXISTS Enrollment;

CREATE TABLE Enrollment (
    StudentID INT,
    ClassID INT,
    EnrollmentDate DATE,
    PRIMARY KEY (StudentID, ClassID),
    FOREIGN KEY (StudentID) REFERENCES Students(StudentID),
    FOREIGN KEY (ClassID) REFERENCES Classes(ClassID)
);



In the `Enrollment` table, we have two primary keys, `StudentID` and `ClassID`. This is known as a composite primary key, and it ensures that a student can only be enrolled in each class once. The combination of `StudentID` and `ClassID` must be unique for each record.

Also, you'll notice that `StudentID` and `ClassID` are also foreign keys, referencing `StudentID` in the `Students` table and `ClassID` in the `Classes` table respectively. This is how we establish relationships between our tables in SQL.

By creating these three tables, we have successfully set up a basic structure for our Jedi Academy's database. In the next sections, we'll populate these tables with data and learn how to manipulate it.

### Inserting Data into Tables with SQL

Having established the structure of our tables, we now turn to populating them with data. We'll use the `INSERT INTO` statement for this purpose. Let's begin with our `Students` and `Classes` tables.

#### Inserting data into the `Students` table

We can insert data into the `Students` table with a statement that looks like this:

```
INSERT INTO Students (StudentID, FirstName, LastName, Level, GPA)
VALUES (1, 'Anakin', 'Skywalker', 3, 3.8);
```

This statement adds a single row to our table. It begins with `INSERT INTO Students`, which tells SQL that we want to insert data into the `Students` table. The names in parentheses (`StudentID`, `FirstName`, `LastName`, `Level`, `GPA`) are the columns we're inserting data into. The `VALUES` keyword is followed by the values we're inserting, in the same order as the column names.

Now, let's add data for 10 students:

In [6]:
%%sql 

INSERT INTO Students (StudentID, FirstName, LastName, Level, GPA)
VALUES (1, 'Anakin', 'Skywalker', 3, 3.8),
       (2, 'Grogu', '', 3, 3.9),
       (3, 'Luke', 'Skywalker', 2, 3.7),
       (4, 'Leia', 'Organa', 2, 4.0),
       (5, 'Rey', 'Skywalker', 1, 3.2),
       (6, 'Chewbacca', 'Wookiee', 1, 3.5),
       (7, 'Mara', 'Jade', 3, 4.0),
       (8, 'Mace', 'Windu', 3, 3.8),
       (9, 'Padme', 'Amidala', 1, 3.7),
       (10, 'Qui-Gon', 'Jinn', 3, 3.9);


 * sqlite:///jedi_academy.db
10 rows affected.


[]