# SQL 101 - Intro to Databases and Querying

By: Martin Arroyo

## **Introduction**

Welcome to **SQL 101 - Introduction to Databases and Querying**! In this class, we will cover some of the highlights of the theory behind relational databases as well as introduce you to common SQL query patterns and concepts that you should know to be successful in many analytics roles. The focus of this class is to provide you with a foundational framework to understand how databases work, their importance in today's digital world, and how these concepts are applied on-the-job.

First, we'll start by providing an overview of the theory and concepts behind relational databases. Then we introduce you to SQL, the language of data. We teach you how to write basic data retrieval queries, as well as how to process and summarize data. 

The content that we cover here is typically delivered over the course of an entire semester at most universities. Our aim is to introduce you to the important concepts that you'll need to know to be effective on the job. This means that we will not go into as much depth as a university course when it comes to theory. We include links to external resources that cover those topics at length to help enhance your study. While being effective is great, you can greatly enhance your effectiveness by understanding the theory behind things. We hope that this material sparks an interest in you to study databases further.

In the first half of the notebook we cover the theoretical content, which is mostly reading material that introduces you to key concepts. The second half of the notebook involves the practical application of SQL query skills and is meant to be interactive - *you will be writing code and executing queries as you learn*. Both theory and practice are blended together in an interactive way.

You will come away from this class (as well as the subsequent 102 class) with a better understanding of databases, practical SQL skills to write basic to complex queries, and the excitement to learn more. Let's begin!


## **I. Databases & Relational Database Concepts (Theory)**

### **What are databases?**

**Databases** are ubiquitous in today's world. Every day, we interact with many different databases, whether when checking our phones for updates on our favorite applications or when we go to the store and buy things. Without them, we would not be able to store information for very long, and much of the technology that we have come to depend on would not be able to function. But what is a database exactly?

A database is simply an organized collection of data that is stored electronically. Imagine a digital library where tons of books are organized and stored, and you can access any one you like by asking for what you want - that is what a database is like. There are many [different kinds of databases](https://www.simplilearn.com/tutorials/dbms-tutorial/what-are-various-types-of-databases) for a wide variety use cases. For this class, we will focus primarily on one of the most popular types - **the relational database**. 

While they will not be covered in this course, we should mention **NoSQL databases** if we are discussing relational databases. NoSQL databases can generally be thought of as databases which are not structured like a relational database. They typically handle semi-structured or unstructured data, unlike relational databases which impose structure on data by default. There are many different types of NoSQL databases. If you are curious, [here is a resource](https://www.ibm.com/topics/nosql-databases) where you can learn more about them. 

A popular feature of relational databases is the language that is used to manage them, **Structured Query Language (SQL)**. Unlike relational databases, NoSQL databases - as you might expect from the name - either do not use SQL or don't *just use* SQL. 

>**<em>A note about the modern data stack and nascent technologies in the data ecosystem:</em>**
>
>Relational database system architectures have been around and evolving somewhat since they were first introduced in the 1970s. However, as the amount of data in the world continues to grow massively, new database architectures and paradigms have risen to meet the increased processing and storage needs. Architectures such as Data Lakes, Data Lakehouses, and Distributed Databases are some examples of newer methods for storing data. The details of such systems are outside of the scope of this course since it is focused on data analysis using SQL. But if you are interested in pursuing a role in a specialty data engineering, then it would be worth it to study these topics in addition to relational databases.
>
>An interesting thing to note is that for these newer architectures, there is a common language that is being used to query and manipulate the data in them - SQL! While they are not relational databases, SQL has become so ubiquitous that designers of these systems made sure that it was a key component. That's one of the reasons why SQL is a critical skill for any data professional! 



### **What are relational databases?**

Here is an optional video that gives you a brief overview of Relational Databases. It may be helpful to watch this video first, then continue on with the rest of the content:

In [1]:
from IPython.display import HTML

HTML("""
<div align="center">
    <iframe width="560" height="315"
    src="https://www.youtube.com/embed/NvrpuBAMddw?si=oxA2ii4Tv5Bu0R4H">
    </iframe>
</div> 
""")

#### **Entity Relationship Diagrams (ERD)**

We just introduced databases, as well as relational databases, and discussed how they differ from NoSQL databases. But what exactly is a relational database?

Simply put, **a relational database is a type of database that stores information in tables which are related (or connected) to one another**. It's good to visualize what we mean by relationships between tables, so let's introduce you to the **Entity Relationship Diagram (ERD)**. ERD's are used to represent a database by modeling the relationships between different entities, or *tables*, using a type of flowchart like the one shown below:

![COOP ERD](assets/COOP_ERD.png)

The entities in this ERD should be familiar to you - it's the COOP program structure - except this is how we might model COOP in a database based on the relationships between the different roles in the program. As mentioned earlier, entities can be thought of as the tables in our database. The boxes each represent a table in our database, and the lines between each entity shows which ones are connected. The marks at the end of a line, known as the "Crow's Foot", indicate the type of relationship between the two tables. Each table has a name, and inside of the box are their attributes/column names along with their data types. We'll cover more about columns, data types, and relationships a little later on.

As you can see, each of the roles in COOP (Captain, Program Manager, and Student) are represented by a table. Captains and Program Managers (PMs) are connected directly to one another since PMs are their supervisors. Then we see that Apprentices are connected to the Cohort, just as you are an apprentice who is assigned to one cohort. Finally, the Cohort is the glue that connects the Captains to Apprentices (and by extension the PMs.)

#### **Types of Relationships**

Earlier, we mentioned that the "Crow's Foot" notation indicated the type of relationship between the two tables. In relational databases, **the type of the relationship between two tables indicates how the rows in one table are related to the rows in another**. This is an important consideration when designing databases, as well as when you want to combine the data in two or more tables. **There are three main types of relationships: `One-to-One`, `One-to-Many`, and `Many-to-Many`**.

Let's imagine we have two tables, `Table A` and `Table B`, that are related:
- **`One-to-One`**: This means that exactly one row in `Table A` is related to exactly one (and only one) row in `Table B`. An example of this type of relationship would be between a user and their password - one user should have, at most, one password.
- **`One-to-Many`**: One row in `Table A` can be matched to one or more rows in `Table B`. A real-life example would be the relationship between the `ProgramManager` and `Captain` tables, where one Program Manager supervises many captains, but each captain reports to just one Program Manager.
- **`Many-to-Many`**: One or more rows in `Table A` can be matched to one or more rows in `Table B`. An example of this is the relationship between the `Apprentice` and `Captain` tables - each `Captain` has multiple Apprentices and each `Apprentice` has multiple Captains.  

![Crows Foot Notation Example](assets/crows-foot.png)

---

**<em>Comprehension Check</em>**

Answer the questions below to check your understanding of what we have covered so far. Try to answer the questions first before looking at the answers:

*1. What is a database?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>A database is an organized collection of data stored electronically.</p>
</details>
</br>

*2. What are some of the differences between relational databases and NoSQL databases?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>Relational databases use SQL exclusively for management where NoSQL databases either don't use SQL at all or don't just use SQL; Relational databases impose structure on data. NoSQL databases typically deal with semi-structured or unstructured data.</p>
</details>
</br>

*3. What type of relationship is modeled between the entities in our COOP Program ERD?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>The relationships in our chart are all One-to-Many relationships. This is denoted by the use of the crow's foot notation on only one side of our line connecting the entities. Learn more about crow's foot notation <a href="https://vertabelo.com/blog/crow-s-foot-notation/" target="_blank">here</a>.</p>
</details>

---

#### **Tables, Columns, and Rows**

![Tables, columns, and rows delineated in a chart](assets/tables-columns-rows.png)

In relational databases, our data is stored in a tabular structure called a relation. However, it is much more common to refer to relations as tables. We will be using the terms interchangeably. **Tables are 2-dimensional structures that store data in rows and columns**. 

**Each table typically represents an "entity" or some "thing" we would like to model in our database**. Using our ERD example from earlier, an `Apprentice` would be considered an entity. 

**Columns represent attributes about an entity**. Continuing with our example, some attributes of an `Apprentice` includes their `FirstName` and their `CohortID`. 

**Rows each represent a single entity record for the table**. In the Apprentices table, each row would represent one Apprentice. 

The structure of tables in a relational database is similar to how we arrange data in spreadsheets (like Excel.)

#### **Data Types and Structure**

![Common SQL Data Types](assets/data-types.png)

**Attributes of an entity have both a type (the kind of data it is) and a value (the data itself.)** For example, the `FirstName` attribute for the `Apprentice` table is text data, so we would consider "text" to be its data type, and a possible value could be "Angela". The data type is very important because columns will only support one data type each. In Excel, you are allowed to input any type of data in a column that you would like - one cell of a column can have a number and the next cell in that same column can have a text value. However, that is where SQL and Excel differ. 

SQL helps impose structure on our data by making it so that all the values in a column must share the same data type. This is necessary when we are doing things making calculations based on our data. As an example, if we summarize the data in a column of numbers, we would want to be sure that each value in that column is indeed a number, otherwise you may get incorrect results or errors.  

So far, we have seen how entities/tables can be related to one another and the types of relationships they can have (one-to-one, one-to-many, and many-to-many.) But the key (pun fully intended) to establishing these relationships lies in the concept of **Primary Keys** and **Foreign Keys**.

**Primary Keys are a column (or multiple columns) in a table that identify a unique record**. In our COOP example, the `ID` column in the `ProgramManager` table would be considered a primary key because it uniquely identifies a single Program Manager. You might be thinking, "why not just use the `FirstName`, `LastName`, or a combination of the two instead?" The reason we wouldn't be able to use those columns as unique identifiers in our table is because it is possible that there would be Program Managers with the same first and last name. Therefore, neither of those columns (or their combination) can be used as a reliable indicator of uniqueness.

**Foreign Keys are used to formally establish a relationship between two tables in the database and do not need to uniquely identify rows**. Typically, one table will have a column that references the Primary Key of another table. The column in the table that references the Primary Key in this example is called the Foreign Key. One way to think about this is by using a parent-child relationship as an example - the "parent" table in this case is the one with the Primary Key that the "child" table (with the Foreign Key) refers to. This is effectively establishing a one-to-many relationship between the "parent" and "children", since parent's can have one ore more "children" while children (in this scenario!) have one and only parent.  

To illustrate this further, let's look at the `ProgramManager` and `Captain` tables, which have a one-to-many relationship, in our ERD:

![ERD showing the relationship between Program Managers and Captains](assets/COOP-PMs-Captains-relationship.png)

The `PM_ID` column in the `Captain` table is the Foreign Key in this relationship, since it references the `ID` column of `ProgramManager`, which is the Primary Key of that table. Put another way, the relationship between `ProgramManager` and `Captain` is similar to the parent-child example since they are both one-to-many relationships established through the Primary Key(`ID`) in the "parent" table (`ProgramManager`) and the Foreign Key(`PM_ID`) in the "child" table (`Captain`). 

Understanding these relationships is important when we want to combine data from two or more tables, as we will need to know what columns establish that relationship between the tables in order to join them.

Here are some more resources on [Primary Keys](https://www.w3schools.com/sql/sql_primarykey.asp) and [Foreign Keys](https://www.w3schools.com/sql/sql_foreignkey.asp).

---

**<em>Comprehension Check</em>**

Answer the questions below to check your understanding of what we have covered so far. Try to answer the questions first before looking at the answers:

*1. What do tables, columns, and rows represent in a relational database?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>Tables represent entities or things in a database. Columns are part of a table and represent the attributes of an entity. Rows each represent one record of an entity in a given table.</p>
</details>
</br>

*2. What role do data types play in structuring data in a relational database?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>Data Types are used to help ensure that the data in a column are all consistent.</p>
</details>
</br>

*3. What is the difference between a primary key and a foreign key?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>Primary Keys are used to identify unique rows in a table. Foreign Keys are used to formally establish relationships between tables in the database. Often, a foreign key will establish a relationship with a primary key in another table.</p>
</details>

---

#### **Schemas and Metadata**

![Schemas in the COOP ERD](assets/schemas.png)

**Schemas describe how data is organized in a database**. It is a broad term that, in practice, can refer to different scopes of organization in the database.
They can refer to the:

- **Structure of a single table**: When we talk about the schema of a table, we are referring to the table's name as well as the structure of the table, such as the column names and their data types. 

- **Collections of tables within a database**: Databases can be subdivided into different schemas, which each have their own collection of tables. For example, the tables in our COOP database could potentially be split into an "Employee" schema (`ProgramManager` and `Captain`) and a "Student" schema (`Apprentice`, `Cohort`). 

- **Set of ALL tables in a database**: When referring to the structure of a database as a whole, the schema describes the entire set of all tables (e.g. tables names and schemas) in the database.  

When you hear the word "schema", it could potentially mean any one of these, so **it's important to understand the context in which it is mentioned; when in doubt, clarify which schema is being referred to.** 

**Metadata is a set of data that describes another set of data**. An example of metadata in every day life would be the table of contents or a summary of a book. Since schemas describe how data are organized in databases, they are considered a type of metadata. Knowing the metadata of either the table or the database that you are using is important as it gives you critical context and information for any analysis you may do.

#### **Normalization, Denormalization, and OLTP vs OLAP**

Normalization and denormalization are data modeling methods that have different goals for data storage and retrieval. **Normalization is used when we want to ensure the consistency and integrity of the data by eliminating redundancy (e.g. duplicate records.)** This is achieved by dividing the tables into smaller sub-tables until redundant data are eliminated. **Denormalization, on the other hand, favors easier querying of the data and achieves this by combining data/tables together, even if it may introduce redundancy.**

>**Extra Context**: *What exactly is redundancy in a database and why does it matter?*
>
>Redundancy in a database refers to the unnecessary repetition of data or storing the same piece of information in multiple places. It matters because it can lead to increased storage costs, data inconsistencies, and complications in data updates and retrieval. By reducing or eliminating redundancy, the integrity (accuracy and consistency) of the data is maintained.

![Normalization vs Denormalization](assets/normalized-denormalized.png)

The normalization method is typically used in **Online Transactional Processing (OLTP) systems, which are use to model real-time transactions, favoring (write) speed, consistency, and data integrity**. Relational databases are used to create this model because of the strong emphasis on structure. OLTP typically has a focus on business-critical applications, meaning items that need to be processed and recorded in real-time. Examples of this kind of processing would be managing inventory in a warehouse or credit card transactions. Data in these databases are not usually kept for very long, so there is little historical data generally available in these systems.

**Online Analytical Processing (OLAP) systems, which generally use a denormalized structure for data, are used when querying and analyzing historical data is more important than being able to store data quickly and without redundancy**. OLAP systems can be built using relational databases, but may also be built using other types of databases or architectures (e.g. [Data Lakes](https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/)). These systems are used for information mining and gathering insights for a business. They typically store much more data than OLTP systems as it is essentially an archive of historical data that is continuously added to. Very often, the data source for an OLAP system will be the data from an OLT System.

While OLTP systems will generally use normalization and OLAP systems denormalization, it's important to point out that this isn't a hard and fast rule. There may be cases where denormalization is preferred in an OLTP system, as well as using normalization in OLAP systems. This is context and use-case dependent, so you will need to understand how the particular database you are using was structured. 

The diagram below shows a high-level breakdown of how OLTP and OLAP systems are used:

![OLTP and OLAP systems](assets/oltp-olap.png)

As data analysts, you may work with databases that are either OLTP or OLAP systems. The most common system used for data analytics is OLAP, which often models data in what are known as data warehouses. Data Warehouses combine data from multiple sources within a business to create a unified, holistic view of the data for reporting and analysis. This is a topic that books are devoted to and is out of the scope of what we will cover in our class. [This article](https://www.oracle.com/database/what-is-a-data-warehouse/) expands on the concept further.  

Normalization, denormalization, OLTP, and OLAP are all concepts that go much deeper than the treatment given here. For our purposes, we will not dive into these topics further. However, if you would like to learn more, here are some resources we recommend:

- [OLTP vs OLAP](https://aws.amazon.com/compare/the-difference-between-olap-and-oltp/)
- [Normalization vs Denormalization](https://medium.com/analytics-vidhya/database-normalization-vs-denormalization-a42d211dd891#:~:text=Normalization%20is%20the%20technique%20of,to%20make%20data%20retrieval%20faster.)
- [When and How You Should Denormalize a Relational Database](https://www.linkedin.com/pulse/when-how-you-should-denormalize-relational-database-pathuri/)

---

**<em>Comprehension Check</em>**

Answer the questions below to check your understanding of what we have covered so far. Try to answer the questions first before looking at the answers:

*1. What is a schema, and what are two possible things it could refer to?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>The schema refers to the organization of data inside of a database. Data can be organized according to different scopes. For tables, a table schema provides metadata about a specific table, such as the name, the column names, and their data types. A database schema provides metadata about the database as a whole, including the names of the tables that are in it and the possible schemas that the tables are organized intos. It could also simply refer to a collection of tables within a database.</p>
</details>
</br>

*2. When do we use denormalization to store our data in a database?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>Denormalization is used when we want to focus on querying and analyzing historical data.</p>
</details>
</br>

*3. What is the difference between OLTP and OLAP Systems? As a Data Analyst, which type of system are you more likely to work in?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>OLTP Systems focus more on real-time business processes and applications. OLAP Systems focus on making historical data easily available for querying and analysis. As Data Analysts, you may work with both systems, but are more likely to work in an OLAP system.</p>
</details>

---

#### **Relational Database Management System (RDBMS)**

![Top 6 Most Popular RDBMS Systems in 2023](assets/rdbms.png)

Until now, we have mostly been describing the relational database from a theoretical perspective. **However, when we interact with a relational database on a computer, we do so using a Relational Database Management System (RDBMS)**. The RDBMS essentially brings the relational database model to life and makes it something we can actually use by handling the physical storage of the data on a device. Think of it like the engine in a car - while we interact with the exterior of the car, the engine is what makes sure it runs smoothly. 

There are quite a few different RDBMS's out there, such as **Postgresql, MySQL, Oracle and SQL Server**. An important thing to note is that each RDBMS uses a slightly different version of SQL (these are called *dialects*.) This means that the syntax can be different for some queries between SQL Server and Postgres, for example. 

**Why use a Relational Database instead of just working with data in Excel (or a similar spreadsheet software)?**

While there are still instances where data is stored exclusively in spreadsheets, this is becoming less and less common. As the amount of data to be analyzed grows, using a database eventually becomes a necessity. Here are a few reasons why:

1. **RDBMS's can store much more information than spreadsheets can.** Excel, for example, has hard limits on the number of rows and columns that can be stored in both a single sheet as well as a workbook. Even if you are below these limits, Excel can become difficult to use once you are working with larger data sets (think hundreds of thousands to a million or more rows.)
2. **RDBMS's allow for many people to collaborate and use the same data source all at once.** While you can share Excel workbooks with one another and even work collaboratively, once you get to more than a handful of people working on updating the same document the process quickly becomes unwieldy. Conflicts begin to creep up more and more often, and there is a good chance that work can be lost or overwritten easily, costing hours of productivity. This is not an issue with RDBMS's as they are designed to handle many people using them all at once.
3. **Relational Databases and RDBMS's help ensure the integrity and structure of the data.** This is critical when data is updated on a regular basis. Excel does not have any mechanisms built-in to guarantee the integrity and structure of the data.
4. **RDBMS's provide more security for your data than spreadsheets.** While it is possible to encrypt spreadsheets, these methods are not robust against a determined attacker. On the other hand, RDBMS's have access control capabilities built-in, which allow you to control who has access to what data and what level of access they should have. They are generally more secure than spreadsheets.

These are the main reasons for using an RDBMS to store data over a spreadsheet. However, there are exceptions where storing data in a spreadsheet may be preferable, such as when the data you need to store is relatively small, it doesn't need to be updated, and it doesn't need to be kept particularly secure. All in all, the decision of what storage method to use should always be tied back to your needs and use case(s). 


## **What is SQL?**

**Structured Query Language (SQL - pronounced <em>"Sequel"</em>)** was created in the early 1970s by researchers at IBM based on [the relational model that was described by Edgar F. Codd](https://dl.acm.org/doi/10.1145/362384.362685). It is a domain-specific programming language that is used for querying and managing a RDBMS.

>**Extra Context**: *The math behind relational databases*
>
>[Relational algebra](https://www.geeksforgeeks.org/introduction-of-relational-algebra-in-dbms/) is a set of mathematical operations that define the foundation for querying relational databases. It provides a theoretical framework consisting of operations like selection, projection, union, set difference, and Cartesian product to manipulate data sets (relations). SQL, the standard language for querying relational databases, is essentially a practical implementation of these concepts, translating relational algebra operations into familiar, user-friendly query syntax.
>
>For those who want a challenge, after completing SQL 101, see if you can go back and translate your queries into expressions using Relational Algebra!

SQL is also known as a declarative language. This means that when we write queries in SQL, we are describing what data we want rather than describing, step-by-step, how the RDBMS should retrieve that data. The syntax is relatively simple to learn and writing queries can feel more intuitive which makes SQL a beginner-friendly language. 

Another important point to reiterate is that SQL itself is just a specification. **Different RDBMS vendors implement their own versions of SQL** based on that specification. This is why the syntax between the SQL for one RDBMS will differ (albeit slightly) from that of another RDBMS system. In practice, this means that you should always be aware of what RDBMS you are using so that you know what language features are available to you (and what documentation to use.)

Overall, you can think of SQL as a translator between you and a vast library of information. You ask it (query) for specific books or details (data,) and it fetches them for you. It's important to emphasize that when we write queries, we are asking for what we want (declarative) rather than telling the database how to retrieve our data (this would be considered ["imperative."](https://www.educative.io/blog/declarative-vs-imperative-programming))

### **SQL Sub-languages**

SQL can be further divided into five sub-languages, each of which contains commands for specific tasks like creating databases or manipulating and querying data. Of these five, we will focus on the Data Query Language (DQL). The five sub-languages are, in no particular order:
- **Data Definition Language (DDL)**: Used for creating or modifying the structure of tables or databases. Common DDL commands include `CREATE`, `DROP`, or `ALTER`
- **Data Manipulation Language (DML)**: Used for manipulating data that is already stored in the database, typically by either adding, removing, or updating the data. Common DML commands include `INSERT`, `UPDATE`, and `DELETE`.
- **Data Query Language (DQL)**: Used for querying data in the database. DQL is where the `SELECT` command comes from.
- **Data Control Language (DCL)**: Used for granting or modifying access to data stored in tables. Common DCL commands include `GRANT` and `REVOKE`.
- **Transaction Control Language (TCL)**: Used for controlling transactions in the database. Common TCL commands include `COMMIT`, `SAVEPOINT`, and `ROLLBACK`.

![SQL Sub-languages](assets/sql-sub-languages.png)


---

**<em>Comprehension Check</em>**

Answer the questions below to check your understanding of what we have covered so far. Try to answer the questions first before looking at the answers:

*1. What does SQL stand for?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>Structured Query Language.</p>
</details>
</br>

*2. When would we use a Relational Database Management System? When would it be best to just use a spreadsheet?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>We would use a Relational Database Management System when we are working with a lot of data that is consistently updated and need to have a lot of people work on that data simultaneously. However, when you don't need a lot of people to work on the data and it is not updated very often (or at all) then a spreadsheet is sufficient.</p>
</details>
</br>

*3. Which sub-language of SQL will we be focused on for the purpose of writing queries?*
<details>
    <summary>Click to reveal the answer</summary>
    <p>Data Query Language (DQL).</p>
</details>

---


**The rest of the material in SQL 101 is devoted to helping you master the basics of SQL querying. These skills will be applicable to just about any version of the language you may encounter.**

## **II. Querying and Data Retrieval (Practice)**


Now that you have a foundation of database theory under your belt, it's time to learn how to communicate with a database and write queries using SQL! In this portion of the notebook, we will focus on teaching you the basics of writing queries against single tables. In 102, we will show you how to combine data from multiple tables to enrich your queries - but we will walk before we run. 

All of your queries will be written using preloaded databases that are available only in this notebook. Our "RDBMS" and SQL dialect is called `duckdb`, a new and popular Python library that provides the framework to make our queries possible. You can find [the documentation for `duckdb` here](https://duckdb.org/docs/sql/introduction) - you will want to keep the documentation handy. 

`teachdb`, which provides the data that you will be working with, is a Python library written by The Freestack Initiative, a group of COOP alumni who want to empower the community to learn and improve technical skills by providing materials and resources at low (or no) cost.

First, we'll do a quick tutorial on how to use the notebook with these tools, then we'll dive into your first SQL clause and query!

## How to use this notebook

**Step 1: Run the cell below to set up the database and notebook**

Don't worry about what the code below does exactly. It is simply used to setup this notebook for the lesson:

In [28]:
# Install `teachdb` if it's not in the system already
!pip install --quiet --upgrade git+https://github.com/freestackinitiative/teachingdb.git
import pandas as pd
from teachdb.teachdb import connect_teachdb
# Set configurations for notebook
%load_ext sql
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
# Load data
con = connect_teachdb(databases=["sales_cogs_opex", "restaurant"])

%sql con

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


  pd.set_option('display.max_colwidth', -1)


Connected to the `teachdb` from the Freestack Initiative.


**Step 2: Run a query**

To run SQL queries against the database, make sure that the cell you are writing in has `%%sql` written at the top. You can write your queries underneath that and run the cell to execute them. 

Go ahead and try it by executing the query in the cell below:

In [29]:
%%sql

SELECT *
FROM Dishes
LIMIT 5

Unnamed: 0,DishID,Name,Description,Price,Type
0,1,Parmesan Deviled Eggs,"These delectable little bites are made with organic eggs, fresh Parmesan, and chopped pine nuts.",8.0,Appetizer
1,2,Artichokes with Garlic Aioli,Our artichokes are brushed with an olive oil and rosemary blend and then broiled to perfection. Served with a side of creamy garlic aioli.,9.0,Appetizer
2,3,French Onion Soup,"Caramelized onions slow cooked in a savory broth, topped with sourdough and a provolone cheese blend. Served with sourdough bread.",7.0,Main
3,4,Mini Cheeseburgers,"These mini cheeseburgers are served on a fresh baked pretzel bun with lettuce, tomato, avocado, and your choice of cheese.",8.0,Main
4,5,Panko Stuffed Mushrooms,"Large mushroom caps are filled a savory cream cheese, bacon and panko breadcrumb stuffing, topped with cheddar cheese.",7.0,Appetizer


Now you know how to write your queries in this notebook! Feel free to make as many new cells as you need to experiment with queries. Don't forget to save a copy of the notebook so that you won't lose any of your work. Let's go ahead and learn how to `SELECT` data!

## Basic Data Retrieval

**Key Skills/Concepts**

- Write SQL statements to retrieve data from tables.
- Use aliases for clarity in queries.
- Understand how to eliminate duplicate results.

#### **Scenario**

To make things a bit more interesting, we'll be using the a dataset from a fictional restaurant. The queries we write will help you learn more SQL while also learning more about the restaurant and its data. Let's start by looking at the restaurant's menu, which is found in the `Dishes` table. 

#### **`SELECT *`**

The most basic SQL query you will ever write is one where you simply retrieve all of the data from a single table. In order to create this query, you will need to know two commands: `SELECT` and `FROM`:

- `SELECT`: Used to select all of the specified columns in a table. When we want to select all of the columns in a table, we can use `*` as a shortcut instead of writing all of the column names.
- `FROM`: Specifies the name of the table that you would like to query.

The general form of a query where we ask to see all of the columns and rows in a table is:

```sql
SELECT *
FROM table_name
```

**Action Item**

Write a query that shows all of the dishes from the `Dishes` table. The query should include all of the columns from that table as well. Use the general form above to help you structure it: 

In [30]:
%%sql

SELECT *
FROM Dishes

Unnamed: 0,DishID,Name,Description,Price,Type
0,1,Parmesan Deviled Eggs,"These delectable little bites are made with organic eggs, fresh Parmesan, and chopped pine nuts.",8.0,Appetizer
1,2,Artichokes with Garlic Aioli,Our artichokes are brushed with an olive oil and rosemary blend and then broiled to perfection. Served with a side of creamy garlic aioli.,9.0,Appetizer
2,3,French Onion Soup,"Caramelized onions slow cooked in a savory broth, topped with sourdough and a provolone cheese blend. Served with sourdough bread.",7.0,Main
3,4,Mini Cheeseburgers,"These mini cheeseburgers are served on a fresh baked pretzel bun with lettuce, tomato, avocado, and your choice of cheese.",8.0,Main
4,5,Panko Stuffed Mushrooms,"Large mushroom caps are filled a savory cream cheese, bacon and panko breadcrumb stuffing, topped with cheddar cheese.",7.0,Appetizer
5,6,Garden Buffet,"Choose from our fresh local, organically grown ingredients to make a custom salad.",9.99,Main
6,7,House Salad,"Our house salad is made with romaine lettuce and spinach, topped with tomatoes, cucumbers, red onions and carrots. Served with a dressing of your choice.",7.0,Main
7,8,Chef's Salad,"The chef's salad has cucumber, tomatoes, red onions, mushrooms, hard-boiled eggs, cheese, and hot grilled chicken on a bed of romaine lettuce. Served with croutons and your choice of dressing.",9.0,Main
8,9,Quinoa Salmon Salad,"Our quinoa salad is served with quinoa, tomatoes, cucumber, scallions, and smoked salmon. Served with your choice of dressing.",9.99,Main
9,10,Classic Burger,"Our classic burger is made with 100% pure angus beef, served with lettuce, tomatoes, onions, pickles, and cheese of your choice. Veggie burger available upon request. Served with French fries, fresh fruit, or a side salad.",9.99,Main


#### **`SELECT column1, column2, column3, ...`**

Awesome! You just wrote your first query that retrieved the menu from our database. It seems like they have some really tasty dishes at very reasonable prices. But this is also a lot of information to take in all at once. Right now, I'd just like to see the name of the dish, the price, and what type it is. That means we'll need to specify columns that we want, which leads to the next general form of a simple select query:

```sql
SELECT column1, column2, column3
FROM table_name
```

The difference between this form and the latter is that rather than using "`*`" to specify all columns, I am listing out the names of the columns that I want instead. This method is generally preferred, since we are limiting the data selection to only what we want to see rather than just grabbing everything from the database. 

>**Pro-Tip:**
> In general, you should always try to only query for the data you need. The datasets in this notebook don't have tables with many columns, but in the real world it is possible to have tables with many more columns (hundreds or thousands even!) Queries that pull in a large amount of data can get expensive and take up a lot of your computer's resources, so it's best to avoid unnecessarily large queries by simply being more specific about the data you want.
>
> Here's a rule of thumb as to when to use "`*`" in your query: If you really need to have **ALL** of the columns in a table, then use it. Otherwise, specify only the columns you need.

**Action Item**

Write a query that shows the `Name`, `Price`, and `Type` of dishes in the `Dishes` table.

In [34]:
%%sql

SELECT Name, Price, Type
FROM Dishes

Unnamed: 0,Name,Price,Type
0,Parmesan Deviled Eggs,8.0,Appetizer
1,Artichokes with Garlic Aioli,9.0,Appetizer
2,French Onion Soup,7.0,Main
3,Mini Cheeseburgers,8.0,Main
4,Panko Stuffed Mushrooms,7.0,Appetizer
5,Garden Buffet,9.99,Main
6,House Salad,7.0,Main
7,Chef's Salad,9.0,Main
8,Quinoa Salmon Salad,9.99,Main
9,Classic Burger,9.99,Main


#### **Aliases (`AS`)**

Nice! We consolidated our data into a view that makes it easier to understand by only selecting the columns we wanted to see. But I'm going to be a stickler for details here - I don't like that the column that has our dish name is just called `Name`. I'd prefer if it was displayed as `DishName` so that it's a little more descriptive. Thankfully, SQL makes this easy for us to do with aliases. 

By default, when we write a query, the output shows the column names from the table. But sometimes we want to rename these columns for presentation purposes or just our own clarity. This is a scenario when we would use an alias. Aliases allow us to change the name of a column or a table in a query result to something we choose. It's important to note that it only changes the column name for the query result and does not actually change the name of the column in the table - it's only temporary and for display purposes. To specify an alias, we use the `AS` clause.

Here's the general form of the query:

```sql
SELECT column1 AS Col1, column2 AS "Column 2"
FROM some_table AS my_table 
```

There are two general alias examples shown here - one without quotation marks and one with them. If your alias name is all one word, then you do not need quotation marks. However, if you want your alias name to have a space between word, then you will use double-quotes around the alias (this could be different depending on which SQL dialect you are using, so be sure to check your documentation!)

>**Pro-Tip:**
> Unless you are presenting the results of a query directly to a client or stakeholder, where neatness and readability is important, avoid using spaces in column names and aliases. Using spaces in column names, especially in more complex queries, can introduce odd bugs and additional complexity that is often just not worth it. 

Let's rename the `Name` column to `DishName` together:

```sql
SELECT Name AS DishName, Price, Type
FROM Dishes
```

**Action Item**

Write a query using the same format as the one prior, except rename the `Name` column as `DishName`, the `Price` column as `Cost`, and the `Type` column as `DishType`.


In [35]:
%%sql

SELECT Name AS DishName, Price AS Cost, Type AS DishType
FROM Dishes

Unnamed: 0,DishName,Cost,DishType
0,Parmesan Deviled Eggs,8.0,Appetizer
1,Artichokes with Garlic Aioli,9.0,Appetizer
2,French Onion Soup,7.0,Main
3,Mini Cheeseburgers,8.0,Main
4,Panko Stuffed Mushrooms,7.0,Appetizer
5,Garden Buffet,9.99,Main
6,House Salad,7.0,Main
7,Chef's Salad,9.0,Main
8,Quinoa Salmon Salad,9.99,Main
9,Classic Burger,9.99,Main


#### **`DISTINCT` - Removing Duplicates** 

Great work changing those column names with aliases! We're going to change gears a little bit by taking a look at the the `Type` column from `Dishes`. This column specifies the category of the particular dish, such as whether it is an appetizer or main course. The values in `Type` repeat a lot, which indicates that there are fewer types than there are dishes on the menu (which would make sense!) 

I want to see a list of the types that are available in the `Dishes` table. Since there aren't many rows, we could do this just by eyeballing the results and creating a list by hand. But what if there were hundreds or thousands of rows? We couldn't accurately eyeball that. This is a good case to introduce you to a new clause, `DISTINCT`. The `DISTINCT` clause will remove duplicate rows from a query based on the columns you specify. 

Here is the general query form:

```sql
SELECT DISTINCT column1, column2, column3
FROM some_table
```

What will happen here is SQL will query the table using only the columns specified. Then, it will look at the rows *from that query result* and remove any duplicates that it finds. The final result will have no duplicate rows. 

>**Pro-Tip:**
> `DISTINCT` is an expensive operation, which means that it can take a lot of resources to process and complete. It should be used sparingly. If you happen to find yourself reaching for `DISTINCT` to help remove duplicates, first double-check that your query is actually correct. Often, it is used to cover up minor errors in query logic. Finding duplicate rows can also indicate an issue with the underlying data model. If there are issues where you find yourself having to constantly use DISTINCT on a table just to get the correct results, then there is likely something wrong with the data. In short, only use `DISTINCT` when it's needed.

**Action Item**

Using the `Dishes` table, write a query using just the `Type` column that shows the distinct dish types that are available.

In [36]:
%%sql

SELECT DISTINCT Type
FROM Dishes

Unnamed: 0,Type
0,Appetizer
1,Main
2,Dessert
3,Beverage


#### Data Retrieval - Comprehension Check

Great job in this section! You learned how to write basic select queries against tables in a database, how to create aliases for column and table names, and how to remove duplicate rows in your query using `DISTINCT`. Let's do some review over what we've learned so far:

**1. What is the difference between a query that uses `SELECT *` and one that uses `SELECT col1, col2`?**
<details>
    <summary>Click here to reveal the answer</summary>
    <p>The query using SELECT * will automatically include all of the columns from the table in the result set. On the other hand, when we use SELECT col1, col2, etc. we are specifying the columns that we want to return.</p>
</details>

**2. What is the clause that we use to create an alias? And is an alias name permanent?**
<details>
    <summary>Click here to reveal the answer</summary>
    <p>We use the AS clause to create an alias. Aliases are not permanent and are generally only for display purposes.</p>
</details>

**3. What does `DISTINCT` do and why should we only use it sparingly?**
<details>
    <summary>Click here to reveal the answer</summary>
    <p>DISTINCT removes duplicate rows in a query result. However, it should be used sparingly because it is a computationally expensive command to run, especially on large data sets.</p>
</details>

**4. In the cell below, write a query that shows the `FirstName`, `LastName`, `City`, `State`, and `Birthday` for customers in the `Customers` table. We have to give the results to the restaurant's owners, so the column names should look neat. Alias the `FirstName` and `LastName` columns so that they say `First Name` and `Last Name`, respectively.** 
<details>
    <summary>Click here to reveal the answer</summary>
    <p>SELECT FirstName AS "First Name", LastName AS "Last Name", City, State, Birthday</p>
    <p>FROM Customers</p>
</details>

In [40]:
%%sql

SELECT FirstName AS "First Name", LastName AS "Last Name", City, State, Birthday
FROM Customers

Unnamed: 0,First Name,Last Name,City,State,Birthday
0,Maggi,Domney,San Bernardino,CA,1938-10-11
1,Javier,Dawks,Hartford,CT,1953-11-21
2,Aleen,Fasey,Boca Raton,FL,1900-08-10
3,Taylor,Jenkins,Fort Lauderdale,FL,1961-05-02
4,Imogen,Kabsch,Anderson,SC,1919-08-27
5,Don,Weingarten,Columbus,GA,1919-07-19
6,Cammi,Kynett,Washington,DC,1927-03-06
7,Steffie,Kleis,Evansville,IN,1957-12-27
8,Carilyn,Calver,Dulles,VA,1960-12-17
9,Barbara-anne,Sweet,San Antonio,TX,1911-04-01


## Practice: More queries with constraints

Write a SQL statement to display all the information for those customers with a grade of 200.

Table: `customer`

In [None]:
%%sql



## Basic Pattern Matching

Pattern matching is a powerful feature in SQL that allows you to search for patterns within text data. The `LIKE` operator is used in SQL to perform pattern matching. Here are some of the basics of pattern matching using the `LIKE`, `%`, and `_` operators:

#### `LIKE` with the `%` Operator
The `%` is known as a wildcard operator. This means that it can be used to match zero (think blank spaces) to many characters in a given string. Here are some examples of how we can use it to filter for text patterns in our data.

The following code:

```sql
SELECT *
FROM table
WHERE column LIKE 'COOP%';
```

will filter for and match any strings that are like the following:
```
“COOP”
“COOP Careers”
“COOP Careers: Overcoming Underemployment”
```

Essentially, this *filters for any text string that starts with* the word “COOP”. We can also use the `%` operator to *filter for words within a given text string.*

The following code:

```sql
SELECT *
FROM table
WHERE column LIKE '%COOP%';
```

will filter for and match any strings that are like the following:
```
“COOP Careers”
“COOP”
“It’s pronounced ‘COOP’ not ‘COOP’”
```
#### `LIKE` with the `_` Operator

The `_` operator is also a wildcard operator, just like `%`. The only difference is that `_` only matches any single character.

The following code:

```sql
SELECT *
FROM table
WHERE column LIKE 'COOP_';
```

will filter for and match any strings that are like the following:

```
“COOP”
“COOP “
“COOP1”
```

but it will not match strings like **“COOP12”** or **“COOP12345*”**.


-------------------------------------------------------------------------------

## Practice: Pattern Matching Queries

table: `customer`

1. Using the `customer` table, write a query that displays all the customers with the first name **Brad**:

In [None]:
%%sql


2. Write a query that displays all cities that have the letter "i" in their name 

*e.g. Paris -- there is an "i" in Paris, so it is in the results)*

In [None]:
%%sql


3. Write a query that filters for rows that match this pattern: `Ne* York`

The asterisk(*) here means that it can match any **single** letter or number.

In [None]:
%%sql

## More resources for further practice

- [SQL Bolt](https://sqlbolt.com/): The lessons here are a great introduction to SQL and you know the platform already!
- [Mode](https://mode.com/sql-tutorial/): A comprehensive SQL tutorial from beginner all the way to advanced SQL. There's even a data analytics with SQL tutorial. This is a great resource to learn about SQL in depth and practice what you learn in their online database.
- [StrataScratch](https://platform.stratascratch.com/coding): Practice coding questions geared toward data analysts and data scientists. You can solve coding problems used by real companies for technical interviews using PostgresSQL, Python, R, or MySQL. It's free to sign up!
- [Codecademy - Free Learn SQL Course](https://www.codecademy.com/learn/learn-sql): Codecademy is another great resource to learn SQL as well as most other languages. There are a lot of free resources here that can help you learn SQL, Python, R, and many other languages.
- [Socratica SQL (YouTube)](https://www.youtube.com/watch?v=nWyyDHhTxYU&list=PLih4ch-U2DiBbMoFK4ML9faT3k3MM2UQY): This is a great playlist that will get you started learning SQL with one of the most popular relational databases - Postgres.
- [DB Fiddle](https://dbfiddle.uk/): This site is like a SQL scratch pad. You can use it to practice doing stuff like creating tables and inserting data into them, and all sorts of other stuff that you might not be able to do so freely in a live database. It's a sandbox, basically. Here are a couple of links to fiddles with some data in them to play with: [fiddle 1](https://dbfiddle.uk/?rdbms=postgres_13&fiddle=366b683701596d3f7459b0411c15acd1) and [fiddle 2](https://dbfiddle.uk/?rdbms=postgres_13&fiddle=dfffc1939f629d9286c55d732fb656c5).


And don't forget to keep your [SQL Cheatsheet](https://martinmarroyo.github.io/sqlcheatsheetandresources-coop/) handy!