<img src="./img/Dolan.png" width="180px" align="right">

# **DATA 6510**
# **Lesson 1: Database Systems**
_A view from from 40,000 feet_



## **Learning Objectives**
### **Theory / Be able to explain ...**
- Data Lakes, Data Warehouses, Data Marts
- Features and components of database systems
- Data models and data integrity
- Functions of a DB Management System
- Terminology like apps, layers, DBMS, SQL, metadata, etc.


### **Skills / Know how to ...**
- Identify the parts of a database table
- Use keys to match records from separate tables
- Run SQLite queries in a Jupyter notebook

## **BIG PICTURE: Business Intelligence = 80% Data, 20% Visualization**



While the title of this course is "Data Warehousing and Visualization," it is really a first course in **Business Intelligence (BI)** technology and practice.

Business intelligence is about generating actionable insights from data. It informs decision making at all levels of a firm: Given what we know now, how did it get that way and what can we do to get the results we desire? In other words, how can we use data to make us *smarter*?  



Data visualization, probably *the* key feature for most BI apps, is actually the end of a sometimes long and complex **data pipeline**. Given access to lots of preprocessed data, with all the anomalies smoothed out, a dashboard of Key Performance Indicators or a report in Excel allows us to see whatever stories the data can tell. That's very important $-$ literally why we collect, clean, and protect the data in the first place $-$ but most of the actual work lies in the pipelines, not the visuals.

![BI Pipelines](./img/L9_Data_Pipelines.png)

A data pipeline starts (on the left) with raw source data, which can take on many forms. Ideally this data is collected into a holding area called a **data lake**. (Fun fact: it's called a "lake" because it's where you fish for useful facts.) While there may be some effort to validate and catalog the data in the lake, an effort is made to keep it as pristine as possible.

 

A **data warehouse** (in the middle) is a repository of analytical data that has been:
- **Extracted** from a variety of data sources
- **Transformed** to suit the analytical uses
- **Loaded** into a database management system that handles user access, storage, etc.

The ETL process itself is a subject of a whole course, but suffice it to say that with due care it produces the "one true, comprehensive source" of analytical data.

In cases where the volume and variety of warehoused data are too much for desktop software to handle, data may be extracted into **data marts** with datasets tailored to suit specific analytical uses.



This course touches on four basic themes, each centered on a fundamental skill:
1. **Databases**: how to use Structured Query Language (SQL) to extract and analyze data from a relational database
2. **Data Models:** how to express and analyze our assumptions and data requirements
3. **Data Warehouses:** how to use dimensional modeling to structure data for analytical uses
4. **Beyond SQL:** how to use advanced data models that extend beyond traditional relational models

The rest of this lesson kicks off with a discussion of database systems, followed by a hands on tutorial in SQLite. 

---
## **Database Systems: Three Different Perspectives**
Any system that relies on the use of a data store is considered a database system. With that very broad definition, just about any "smart" device or app you use today is a database system. A smart watch that collects and stores data about the wearer is a database system, as is an email client that retrieves and archives email for reading later or the point of sale system used at the bodega around the corner.

Here we will look at database systems three different ways:
- **Technical Architecture** describes hardware and software resources that need to be bought, installed, integrated, maintained, and secured.   
- **Software Architecture** describes the logical structure of the system and how data is processed.
- **Data Architecture** describes how the data itself is organized, used, maintained.

---
## **Technical Architecture: Networks, Devices, Apps, and Servers**

![Enterprise Architecture](./img/L1_enterprise_architecture.png)

When viewed with an IT director's eye, technical resources include anything that has to be installed. Consider, for example, the technology needs of a small regional retail chain, as diagrammed above.

At each **retail location** (on the left), one would find a number of devices needed to complete sales, track inventory, and report to headquarters. While some of the technology might be proprietary, much of it would be licensed from vendors who specialize in retail systems.

At the right is the **corporate headquarters**, where functional managers and executive staff make decisions about marketing, human resources, supply chains, technology, etc. The needs of these executive offices are somewhat less industrial than the retail locations, with more of a need for historical data that can be analyzed offline. While they may, in fact, need to monitor individual transactions (e.g., if fraud is detected) they usually work with aggregated data (data marts) constructed to support specific kinds of decisions.

Somewhere in between the stores and headquarters is a **centralized data center**. Invisible to the users, this is where all the work is done to process and record the transactions (sales, incoming orders, outgoing shipments, etc.) that are the beating heart of the enterprise. It cannot be exaggerated how critical these central servers are: if they go down or are hacked then *everything* else reverts back to pencil and paper.

Connecting the various locations together is a **virtual private network (VPN)** that is secured using state of the art technology. Like the central servers, these networks are potentially always under attack. If a remote hacker is going to gain access to the systems then it is going to be over the VPN. (However, despite what you see in the movies, virtually all black hat hacking actually happens *within* the VPN rather than through some cryptographic hack of the network itself. Usually, a user exposes a password, installs a bit of malware, or is the criminal in question.)



### **Scale: Embedded / User / Workgroup / Enterprise**

Database systems come in different sizes, each with different needs and operating characteristics.

Some are so small that you barely know they are there, **embedded** in other hardware and software. So, for example, the scanning wand used by a point of sale system may keep a cache of recent scans. Or your sports watch keeps a record of your heart rate that is synced (uploaded and reset) through an app on your phone. Often, if the device is turned off (all the way off, not in hibernation) for a long enough period of time, the data is lost. However, with static memory and solid state disks becoming cheaper and smaller every day, even the smallest devices often come equipped with persistent storage that survives a reboot.

The next level up is data stored in files by an **end user app**. Such data will usually survive a system restart, though perhaps with some corruption if the system was writing data to storage at the time. In our example, the point of sale system may have a local storage mechanism so it can recover from power outages, void incorrect transactions, etc. Similarly, behind the scenes most desktop software stores data in caches, documents, or other kinds of files in order to improve the overall user experience.  

At a broader level are so-called **workgroup applications**, where data access is shared with a limited number of other users and devices on the same local area network (LAN). In our retail example, the back office systems and inventory systems might share a workgroup server that keeps track of recent activity. At the other end of the diagram, a similar setup connects the executive information systems and the analyst workstations to the data marts and file archives needed to do their work.

At the largest scale are the **enterprise** systems in the center of the diagram. They are not necessarily designed for raw speed but instead for throughput. The data on these servers may come from hundreds or even thousands of devices or users, and it is more important that each transaction complete correctly than that any particular transaction complete quickly.



### **Usage: Transaction Processing vs Analytical Processing**

In our retail example, there was a contrast between the operational systems used in the retail locations on the left and the decision support systems used by the headquarters on the right.

We call the kinds of work performed by the retail locations and the central data center **transaction processing**. The emphasis is on capturing *what is happening right now* in as much detail as needed and then storing it for posterity. The transaction server and database are thus designed for writing data quickly and accurately, without dropping any transactions due to bandwidth constraints or technical failures.

The work performed at headquarters, meanwhile, tends to be what we call **analytical processing**. Here the emphasis is on aggregating and understanding the transactions data and perhaps integrating it with other data collected elsewhere. These sorts of activities are more about data integration and communication, with read-only access to (scrubbed and aggregated) historical data in a data warehouse or data mart. Such systems may be nearly as large as the transaction systems, in that they contain the same basic volume of facts, but they do not have to support as many users and are less subject to data corruption. Read-only data is not corruptible. If it is corrupt then it was so when it was created.



### **Security: Files / DBMS / Services**

We conclude our discussion of information technology with a note about the effect of architecture on privacy and security. As an illustration please consider the three "houses" below, each of which are designed with security and privacy in mind. The first is Philip Johnson's world famous Glass House, which looks stunning but would not provide much in the way of privacy. In the center is Johnson's almost as famous brick Guest House, a windowless structure with a single door that would provide lots of privacy and security, except for the easily accessed skylights on the roof. Lastly, we have the bomb shelter on the right, which provides maximum security and privacy but only if one is willing to trade off natural light and air.   

![Glass Brick Concrete](./img/L1_glass_brick_concrete.jpg)

**The Glass House is analogous to files on a local hard drive or in an email attachment.** Just as the thick panes of glass give the appearance of security but not any privacy, so does relying on file storage to keep your data safe. The contents of your files are visible to anyone with physical access to the storage device or network. While we can, of course, provide security ourselves $-$ with curtains for the house or encryption for the files $-$ doing so would potentially spoil the elegance and convenience of the original design. *So, unless you want to spend your time worrying about private data leaking out of your organization on thumb drives, email, etc. then do not rely on file storage to secure your files. It's about as insecure as it gets.*

**The brick Guest House is like direct access to a Database Management System (DBMS) over a secured local network.** By providing a single point of access (i.e., a thick wooden door), such systems make it fairly simple to secure the data behind a user authentication and authorization system. *However, just as the brick house's skylights can potentially be compromised, so can a database located on a local network drive. Its files may take longer to break into but with time can be compromised by those with the patience and skill to do so.*

**Finally, the bomb shelter is equivalent to an encrypted database that can only be accessed through a secure application server.** In this case, *all access* is limited to the specific commands (service requests) implemented by the app server. While in principle we could argue that a DBMS is itself an application server, in this case we have the ability to layer on more security beyond what is provided by the database vendor. It is certainly inconvenient, but is about as secure as it can possibly be.

**So, whether you are storing data in the cloud or or your local hard disk, be sure that you use a secured network, with a single point of access to the data. Even then, use encrypted file storage to provide one more layer of protection from those with physical access to the server. What you may lose in raw speed will be more than made up for with peace of mind.**

---
## **Software Architecture: Presentation / Logic / Data**

![Three Tiered Architecture](./img/L1_three_tiered_architecture.png)

**If we ignore networks and devices, then all remaining technology is software.** Conceptually, if technology is an organization's brain and central nervous system then software is its mind. The wiring, neurons, synapses and other parts exist to implement the thinking and executive processes needed to survive. Similarly, **information technology exists to implement the software** that makes it useful.

Virtually all modern software implements some variation of the three part structure shown above:
- The **presentation** tier (or layer) that the end user sees and interacts with. To many users this is the totality of the software. Perhaps the most familiar example is the web browser, which many Microsoft users of a certain age still call "the Internet" even though Internet Explorer has been officially dead for many years now.
- The **logic** tier that connects the user to other users, retains information that may be useful later, controls access to critical resources, etc. In its simplest form, it is defined by a set of actions or **requests** to be carried out by an application service. Continuing with our web example, each web page is assembled by the web browser through one or more requests made of the web server. Each request is received, authorized, and (potentially) executed, with a **response** (HTML, CSS, javascript, file, etc.) delivered back to the browser for presentation to the user.  
- The **data** tier that is responsible for reading and writing persistent data. If the entire system is rebooted from scratch, then all essential data should be restored from storage by the data tier. If any data cannot be restored then the data tier should initiate a **rollback** of any data that has become invald because of the loss.

In our look at software we will take an operational view of database systems:
- How the three tiers cooperate (via requests and responses) to carry out a business transaction
- The operations, functions, and features of database management systems (DBMS)
- SQL Standards for relational database systems
- How we will use SQL in this class



### **DBMS Operations, Functions, and Features**

As we saw with the sales transaction, even everyday business gets pretty complicated when we have to implement it in software. From the perspective of the database, however, there are only four kinds of **operations**:
- **Create** (add) new data. Upon storage, the DBMS should respond with an identifier for retrieving the data later.
- **Retrieve** specified data. The request is often called a *query* and the response includes a data *payload* and perhaps some descriptive *metadata*. It is possible, depending on the query language, to return collections of data if needed.
- **Update** specified data. The request indicates what data is to be updated and how it is to be modified. The response indicates whether the update was successful.
- **Delete** specified data. The DBMS either deletes the data or returns an error code if the data cannot be deleted.   

These fundamental operations are commonly referred to by the acronym CRUD and are found in every database system regardless of the technology or use case.

As DBMS technology has matured, the industry has agreed on a few standard functional definitions (shown below).  
  
![DBMS Functions and Features](./img/L1_dbms_features_functions.png)



### **SQL Database Standards**

**The *lingua franca* of DBMS technology is Structured Query Language.** It is the standard against which *every other* database technology is defined. Further, while there are dialectical differences between the various SQL implementations, they are relatively rare, allowing most SQL queries to run unmodified between DBMS vendors. There may be quicker or easier vendor-specific ways to code a particular query, but there will also be a *standard* way.

**SQL is more than just a language.** Each DBMS vendor provides a *platform* with tools, apps, and other utility software. To keep the chaos at a minimum, SQL Standards include specifications for DBMS functionality like ...
- How to connect to a database and initiate a request
- How data is stored and organized
- How transactions are handled to prevent data corruption
- How user permissions are granted and revoked

We will run a few simple SQL queries in the next section and then again for pretty much every lesson in this course.



### **Jupyter and %sql Magic**

In this course we will interact with a variety of different database servers, but we are generally only going to use one database client: `ipython-sql` running right here inside our Jupyter notebooks.

By default, Jupyter assumes you will be writing code in Python. In order to run SQL in Jupyter *without* Python, we will use [`JupySql`](https://jupysql.ploomber.io/en/latest/quick-start.html), a Jupyter add-on that also goes by the name "%sql magic." It does exactly what the name implies, doing all the hard work (i.e., magic) to interact with databases using just SQL. Recalling our earlier discussion of the three houses, %sql magic connections are like the Brick House with direct interactions with a remote database.



We'll start with a tiny bit of Python to let Jupyter know that we are going to be using `ipython-sql`. The cell below, typically located near the top of the notebook, is used to enable (load) the %sql magic. Jupyter will let us now if it has already been loaded but there is no harm in loading it twice.

![load sql magic](./img/L1_load_sql.png)

With %sql magic loaded we can now create and run SQL queries in any code cell. First, however, we will need to tell %sql magic where to find the database we want to connect to.

![SQLite in memory connection](./img/L1_sqlite_in_memory_connection.png)

You'll notice that the part after `%sql` looks a bit like a web URL, with a protocol (`sqlite`) followed by `://`. That's no accident. We call this a **connection string**, which includes all the information (protocol, user, password, server, database) needed to find and then connect to the database, which can reside just about anywhere on the Internet. In this case, we are actually working directly with a database *in memory* (i.e., no network, no files, ... right inside your browser), a trick that is unique to SQLite, which was originally designed for embedded use in tiny devices without file systems.




Once we have a database connection, we use the `%%sql` magic invocation (sort of like *abracadabra*) at the top of a code cell to indicate that all code after the first line is SQL. For example, the screenshot below includes a bit of SQL to create (or recreate) a table of customer profile data.

![Create Table Example](./img/L1_create_table_example.png)

Again, the text below the `%%sql` invocation is written in SQLite-compatible SQL. We will come back to this soon enough in the ***SQL AND BEYOND*** tutorial at the end of the lesson.



---
## **Data Architecture: Entities, Attributes, Values, and Relationships**

The figure belows show three different views of the same data:
* A receipt from a dry cleaner order from January 2, 2019
* An entity relationship diagram showing how the data is organized
* A table of invoices from January 2, 2019

In this section we explore data from the ground up, starting with basic definitions and issues, then moving on to data modeling and database design, and concluding with actual SQL code to implement the design.   

![DeluxCare](./img/L1_DeluxCare.png)



### **Data $\rightarrow$ Information $\rightarrow$ Knowledge**

In data analytics we often treat "data" as a general term for whatever evidence we use to build and test models about the world. If it feeds our models then it's data. However, for the purposes of this course that sort of thinking is putting the cart before the horse. There is a lot of work needed before we can use data for our analyses, work that requires a somewhat more nuanced understanding of data.

In its most fundamental form, **data is just facts that have been encoded so they can be stored.** The encoding itself is called a data type, of which there are many possibilities. Data may be unstructured text ("Hi there!"), categorical labels (red, blue, green), numerical quantities (ordinal, integer, rational, real), coordinates (latitude/longitude), images (bitmapped pixels, jpg, png, etc.) or even just a blob of binary data. If it can be stored, then it is data.



**Information is data that has meaning.** So, while the location `41.1588° N, 73.2574° W` is data, it is not very informative until we note that it is where we would find Fairfield University on a map. We give data meaning by providing **metadata ("data about data")** about the context. Metadata includes things like the data type, what the data represents, when the data was recorded, and cross-references to related data that can further provide context. If we can interpret it, then data is information.



**Knowledge is how we use information to do things.** The data models that we build are knowledge. They convert information into actionable insights. So, if we were creating a routing algorithm to get to the Fairfield University from anywhere else in the world, we would need information about the starting  location, maps that capture the possible pathways for different kinds of transportation, and perhaps a desired arrival time. The routing algorithm is the knowledge, the rest is information.



**Database systems are designed to manage information and make it available to other apps.** At a minimum they:
- Store facts as persistent data
- Structure the facts with metadata (stored with the data)
- Allow access to the data for use by other systems (CRUD operations)

### **Data Integrity: Do you really know what's in the data you're consuming?**

Converting data into useful information is hard work. It is often estimated that as much 80% of data analytics work is wrangling with data to prep it for use. If the dataset is small and the analytical models fairly simple, then we might barely notice having to clean up missing data, typos, or other problems.  




From a database perspective, where we don't always know exactly how the data will be used, we focus on three kinds of **data integrity**:

- **Domain integrity** is about how facts are encoded. Is it in a way that makes sense for potential uses? Can an expert in the field recognize and interpret the data in a meaningful way?
- **Entity integrity** is about data storage and retrieval. If we are seeking the facts for a given situation, can we retrieve just those facts and no others?
- **Referential integrity** is about how facts relate to each other. Do any and all cross-references connect the right facts? Are there any invalid cross-references that point to missing data? Are any required relationships missing entirely?  



One of the most frustrating things about data integrity is that it tends to degrade over time:
- People's understanding of the data may shift, invalidating domain integrity.
- Identifiers like names may change, making it hard to find exactly what is needed.
- Data is continually being added, updated, or deleted, with each transaction potentially causing a referential integrity error.

This is sort of like the database version of **entropy**, the law of physics that suggests that the universe will eventually wind up as a totally disorganized mess.

### **Data Models ... Once and Forever**

Maintaining data integrity is an active process that requires continual evolution of the database itself. That, in a nutshell, is the focus of database design.

All design starts and ends with modeling. In database design we focus on three kinds of data models:
- **Conceptual models** (diagrams) that permit visualization of the database structure before it is built. These are like blueprints for the database.
- **Logical models** (programs) that define processes for building, using, and maintaining the database.
- **Physical models** (technology) that define how the databases operate on real-world hardware.

*We will cover the basics of conceptual and logical design in lessons 4 and 5. Physical design is included here for completeness.*



#### **Conceptual Models**
The primary conceptual model used by database designers is the Entity Relationship Diagram (ERD). Interestingly, ERDs are not actually about data so much as the things (entities) the data is meant to describe. An **entity** is a *specific* thing we capture data about. Typically, the entity has a name or number that acts as an **unique identifier**; if any two things share the same identifier then neither is an entity. The facts that we attach to an entity are called **attributes**. Some attributes are special, used to refer to other entities, usually of a different type.

In the DeluxCare ERD below there are three types of entities (Customers, Invoices, and Garments), each shown as a rounded box. The name of the entity type is shown as a label at the top of the box. The attributes are then listed in the bottom portion of the box. If an attribute is to be used as an identifier then it has a `PK` (for *primary key*) listed to its left. Attributes that are used as cross-references are list with an `FK` (for *foreign key*) next to them. The nature of the relationships between the entities are specified with special notations at either end of the connecting lines. So, for example, the line connecting `Customer` to `Invoice`  specifies that each customer has zero or more invoices and that each invoice must relate to exactly one customer. (In fact, the `customer_id` foreign key on the Invoice entity is there because of the relationship. However, there is no corresponding `invoice_id` foreign key on the Customer entity. Why do you suppose that is?)

![DeluxCare ERD](./img/L1_DeluxCare_erd.png)



#### **Logical Models**

Logical models are more than diagrams. They are meant to be *runnable*, which in this class means SQL.

SQL database logic is said to be **relational**, which is a fancy way of the data is organized into tables where:
- Each table corresponds to a given entity type
- Each row of the table corresponds to one *instance* of the entity type
- Each column contains one attribute (fact) for the instance
- Row and column order do not affect data meaning
- Each row has one (or more) column(s) called a *primary key* that can be used to retrieve the row after it is created
- Any cross-references are implemented as *foreign key* attributes that indicate the primary key value of the entity being referenced. Note: a table does not have to have a foreign key.

The design of the `invoices` table for the DeluxCare database is shown below.

![DeluxCare Invoices Table](./img/L1_DeluxCare_invoices_table.png)



#### **Physical Models**

The physical model of a given database is somewhat vendor-specific. Each DBMS vendor provides different options, depending on the needs of their customers. Here we will focus on just two of the many of the possible physical design decisions behind a given DBMS. Both relate to how data redundancy is used to make the database more reliable and performant.

The first decision is how the data is stored and backed up. One option is to keep the data in files on a local hard drive. MySQL, for example, stores each table as a single file buried somewhere on a hard disk. SQLite does something similar, except in a location of the user's choosing. In either case, one can back up the data by copying the actual database files, along with perhaps some metadata about the files themselves. The alternative to locally-accessible files is to hide all storage implementation behind a service, which may be the database management system itself. In this case, backups are often retrieved as SQL 'dump files' that contain instructions for recreating the data from scratch. Of course, since the server is managed by the vendor, they may also offer a paid backup service that uses proprietary technology of their own design.

The second decision is how many live copies of the data to keep. The classic approach is to use a centralized database server. In this case there is only one live copy of the data to secure, backup, etc. However, this can cause delays for remote users, where each DB request and response may have to travel a long route through the Internet, causing significant latency for the end users. The solution is to use a decentralized architecture, with multiple copies of the database distributed close to the end users. It is then the database system's job to keep the various copies in sync so that two users on opposite sides of the world aren't working with different "facts" for the same entities.   



---
## **Tech Spotlight: SQLite ... Files optional, no server required**

We have already been introduced to SQLite a couple of times in this lesson. Now let's go deeper with a runnable demonstration.


### **What Is SQLite?**

From the [SQLite.org](https://sqlite.org/index.html) website:
>SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day. More Information...
>
>The SQLite file format is stable, cross-platform, and backwards compatible and the developers pledge to keep it that way through the year 2050. SQLite database files are commonly used as containers to transfer rich content between systems [1] [2] [3] and as a long-term archival format for data [4]. There are over 1 trillion (1e12) SQLite databases in active use [5].
>
>SQLite source code is in the public-domain and is free to everyone to use for any purpose.

There is a lot to parse there. Some highlights:

- Implemented in C for maximum speed and minimal installation footprint.
- Very, very popular with device makers of all sorts, with over 1 trillion SQLite databases in use.
- Stores data in files or in memory (as mentioned earlier in the lesson).
- Minimal file size allows it to be used as a transfer container for shipping large data repositories from one place to another.
- There is no DBMS per se. Remote access, security protections, etc. have to be provided through a separate application server.

In this class we are using the `sqlite3` Python library that comes built into Jupyter, which when combined with `%%sql` magic is the most convenient way to work with relational data in Jupyter.

> **What follows is a step-by-step tutorial. Each step has a brief explanation in Markdown followed by a code cell with some Python code for you to run right here in Jupyter. Run each of the code cells below, one at a time, as you read the text.**



### **1. Let Jupyter know that we will be using `%%sql` magic, `sqlite3`, and (possibly) `pandas`.**

This cell is mostly 'boilerplate' that we will run at the start of just about every lesson from now on.  

In [None]:
# Load %%sql magic
!pip install jupysql
%load_ext sql
%config SqlMagic.displaylimit = None

# Standard Imports
import sqlite3
import pandas as pd

- `!pip install jupysql` downloads and installs the `jupysql` package for use in Jupyter.
- `%load_ext sql` was explained above; it enables the `%%sql` magic from `jupysql`.   
- `%config SqlMagic.displaylimit = None` configures `%%sql` to show all rows selected. 
- The two standard imports `import ...` allow us to use the `sqlite3` and `pandas` packages, which come pre-installed with Jupyter. 

### **2. Initiate and test a SQLite connection.**

The cell below uses a special *connection string* to 'open' the database file for querying. Connection strings follow the [SqlAlchemy Database URL standards](https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls). For SQLite the format is `sqlite:///path/to/file` where `path/to/file` is a valid filepath *without spaces*. Most of the time connection URLs will be provided for you, but you should still pay attention to the way they are structured.

In [None]:
%sql sqlite:///data/DeluxCare/L1_DeluxCare.db

The cell below makes sure that the connection worked. It should return metadata listing the names and columns for two tables. The SQL shown is the actual code needed to create the table in the database.

In [None]:
%%sql
SELECT * FROM sqlite_master

### **3. Finally, we get to run some SQL queries.**

Now that we have access to the database let's explore a bit to see what we have. We'll title each query as a question. The code cells then answer the questions. Don't forget to run the code before reading the remarks.

#### **What information is available about each customer?**

In [None]:
%%sql
SELECT *
FROM customers 
LIMIT 20

REMARKS:
- This is 100% real data. The customer names, addresses, and other identifying information were scrubbed to protect customer privacy.
- Not every customer is a person. Some accounts represent hotels, restaurants, or other businesses.  
- Like all of the queries in this lesson, we are using a `SELECT` statement. The `FROM` clause tells the query what tables to use.
- While the rows are displayed here in chronological order, we can't count on that. (See the next query.)

ANSWER: Inspect the resultset for sample data. Note that the columns do not match the ERD exactly!

#### **Which are the oldest customer accounts?**

In [None]:
%%sql
SELECT *
FROM customers
ORDER BY first_date
LIMIT 10;

REMARKS:

- Customer 1 actually is a special account used to handle cash transactions that do not have a proper invoice. That's why it does not have a first_date.
- Similarly, customers 2, 3, and 4 represent industrial accounts that predate the system. December 30, 1899 is the earliest date the POS system could handle.
- All dates are encoded as TIMESTAMPs that include the time of day. In this case, they are all treated as happening at midnight.
- The `ORDER BY` clause sorts the data using the `first_date` column.
- The `LIMIT` clause tells the query to return up to 10 rows of data.   

ANSWER:  Customers 5 and 6 have the longest-running customer accounts. Both were created on February 2, 1997. The oldest account with activity in 2020 was customer 7. It really is remarkable how loyal these customers have been over the years.

#### **How many transactions (invoices) did customer 5 have in 2019?**

In [None]:
%%sql
SELECT count(invoice_id)
FROM invoices
WHERE customer_id=5;

REMARKS:

- We had to use the `invoices` table to answer this question. Each invoice has an `invoice_id` (primary key) and a `customer_id` (foreign key).
- The `WHERE` clause restricts the rows included in the query to just those with `customer_id` equal to 5.

ANSWER: Customer 5 had two transactions in 2019.

#### **Which customers had the most transactions in 2019?**

In [None]:
%%sql
SELECT customer_id, count(invoice_id) as invoice_count
FROM invoices
GROUP BY customer_id
ORDER BY invoice_count DESC
LIMIT 10;

REMARKS

- This query is 100% equivalent to an Excel pivot table.
- The `AS` keyword gives the calculation `count(invoice_id)` an alias (name)`invoice_count` that we can use later.
- The `GROUP BY` clause tells the query to do sub-counts broken down by `customer_id`.
- The `ORDER BY` clause sorts the groups in descending order (`DESC`) by `invoice_count`.

ANSWER: It looks like customer 1708 is the winner. Like the others on this list, it most certainly represents an industrial account. After all, who else goes to the cleaners multiple times per day?

#### **For each of the top 10 customers, how long have they been customers?**

In [None]:
%%sql
SELECT customer_id, date(first_date) as start_date, count(invoice_id) as invoice_count
FROM invoices JOIN customers USING (customer_id)
GROUP BY customer_id, first_date
ORDER BY invoice_count DESC
LIMIT 10;

REMARKS

- We had to get data from both tables to answer this one.
- We used the `date()` function (and an alias) to convert the `TIMESTAMP` data to dates without the time of day.
- The `JOIN` expression in the `FROM` clause matches each invoice with a customer `USING` the `customer_id`.
- The `customer_id` was included in the `GROUP BY` so that it could be used in the `SELECT` clause. Some DBMS's don't require this precaution but it's better to be safe than sorry.

#### **That last result is hard to read. Can we have it as a nicely formatted table?**

In [None]:
_.DataFrame()

REMARKS

- This is a convenient Jupyter trick. The results of the previous cell are always kept in a special variable called `_`. In this case it's a `%sql` magic resultset, which can be converted to a pandas DataFrame as shown.
- The `_` trick **only works once** for each run of a `%%sql` cell. 

---
## **Congratulations! You've made it to the end of Lesson 1.**

There are just 8 more to go. The purpose of this lesson was to provide necessary context *before* diving into SQL in Lesson 2.