<img src="https://github.com/christopherhuntley/DATA6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Lesson 1: Principles & Overview** 
_The whole course from 40,000 feet_

## **Learning Objectives**
### **Theory / Be able to explain ...**
- Importance of data for decision making
- Features and components of database systems 
- Data models and data integrity
- Functions of a DB Management System
- Terminology like apps, layers, DBMS, SQL, metadata, etc.


### **Skills / Know how to ...**
- Identify the parts of a database table
- Use keys to match records from separate tables
- Run SQLite queries in a Jupyter notebook


--------
## **LESSON 1 HIGHLIGHTS**

In [None]:
#@title Run this cell if video does not appear
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/-a9C4VWjr7Q" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

## **BIG PICTURE: Where does data come from? Why do we care?** 

Data lives in a somewhat unique place in between technology and people. **Without people there is no data.** Data is a strictly artificial (i.e., human generated) artifact, our record of facts observed and imagined. It is also artificial in another sense, in that **it can't exist without technology.** Facts that remain locked in our heads are not data. They are just thoughts and memories that have to be **encoded into data** so they can be communicated, stored, updated, and (eventually) purged altogether when we don't need them anymore. Technology is how we do all that. Without it we are lost. 

In a real sense, understanding how data works and where it comes from gives us a peek inside how people think and, more specifically, how people make good decisions. Every ***rational*** decision involves at least four stages:
- A **stimulus** that prompts the need for action
- Collection of ***potentially* relevant facts** about alternative courses of action
- Development and application of **decision rationale** (models) informed by the facts and objectives
- Selection and **execution of a course of action** in keeping with the rationale  

Or, to put it another way, when trying to hit a target when it really counts, the best plan of action is more like Ready $\rightarrow$ Aim $\rightarrow$ Fire with information and intention instead of flailing about randomly in the dark with hunches based on nothing, explainable to no one.

Good leaders *who have to be accountable to others* can explain *why* they make decisions, *what* supporting evidence was used, and *how* they can be persuaded to act differently. Otherwise, why would anyone in their right minds choose to follow them? A common thread here is access to relevant data, which allows them to formulate, validate, communicate, and act on their decisions. 

So, if you really want to succeed in business, it is best to treat data as a critical resource, worthy of continual investment of time, money, and attention. Can you access data when you need it? Can you trust what it is telling you? Is it relevant to what you need to know at the time? Can you integrate data from multiple sources? Can you then communicate decisions in a way that any rational person can agree (or disagree) with? If not, then expect lots of unfortunate surprises and in some cases outright failures. 

And where does that data reside? Hopefully, in a **database system** that has been designed to meet the specific needs of the people using it. In this lesson we will sample ideas from the lessons that follow, providing just enough information to explain where we are going and why we need to go there.   

> ### TLDR for the impatient
> * Access to data and information are fundamental to modern business
> * Management is about decision making
> * Good decisions require information and rationale
> * Good information requires relevant, accurate, and timely data  
> * Ideally, that data is managed in a database system that has been designed for the needs of those who use it
>
> $\Rightarrow$ Important to understand how databases work and interact with business applications, getting as close to original source data (i.e., the unadulterated truth) as you can manage




## **Database Systems: Three Different Perspectives**
Any system that relies on the use of a data store is considered a database system. With that very broad definition, just about any "smart" device or app you use today is a database system. A smart watch that collects and stores data about the wearer is a database system, as is an email client that retrieves and archives email for reading later or the point of sale system used at the bodega around the corner. 

In this lesson we will look at database systems three different ways:
- **Technical Architecture** describes hardware and software resources that need to be bought, installed, integrated, maintained, and secured.   
- **Software Architecture** describes the logical structure of the system and how data is processed. 
- **Data Architecture** describes how the data itself is organized, used, maintained. 

---
## **Technical Architecture: Networks, Devices, Apps, and Servers**

![Enterprise Architecture](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_enterprise_architecture.png)

When viewed with an IT director's eye, technical resources include anything that has to be installed. Consider, for example, the technology needs of a small regional retail chain, as diagrammed above. 

At each **retail location** (on the left), one would find a number of devices needed to complete sales, track inventory, and report to headquarters. While some of the technology might be proprietary, much of it would be licensed from vendors who specialize in retail systems. 

At the right is the **corporate headquarters**, where functional managers and executive staff make decisions about marketing, human resources, supply chains, technology, etc. The needs of these executive offices are somewhat less industrial than the retail locations, with more of a need for historical data that can be analyzed offline. While they may, in fact, need to monitor individual transactions (e.g., if fraud is detected) they usually work with aggregated data (data marts) constructed to support specific kinds of decisions.

Somewhere in between the stores and headquarters is a **centralized data center**. Invisible to the users, this is where all the work is done to process and record the transactions (sales, incoming orders, outgoing shipments, etc.) that are the beating heart of the enterprise. It cannot be exaggerated how critical these central servers are: if they go down or are hacked then *everything* else reverts back to pencil and paper. 

Connecting the various locations together is a **virtual private network (VPN)** that is secured using state of the art technology. Like the central servers, these networks are potentially always under attack. If a remote hacker is going to gain access to the systems then it is going to be over the VPN. (However, despite what you see in the movies, virtually all black hat hacking actually happens *within* the VPN rather than through some cryptographic hack of the network itself. Usually, a user exposes a password, installs a bit of malware, or is the criminal in question.)

### **Scale: Embedded / User / Workgroup / Enterprise**

Database systems come in different sizes, each with different needs and operating characteristics.

Some are so small that you barely know they are there, **embedded** in other hardware and software. So, for example, the scanning wand used by a point of sale system may keep a cache of recent scans. Or your sports watch keeps a record of your heart rate that is synced (uploaded and reset) through an app on your phone. Often, if the device is turned off (all the way off, not in hibernation) for a long enough period of time, the data is lost. However, with static memory and solid state disks becoming cheaper and smaller every day, even the smallest devices often come equipped with persistent storage that survives a reboot. 

The next level up is data stored in files by an **end user app**. Such data will usually survive a system restart, though perhaps with some corruption if the system was writing data to storage at the time. In our example, the point of sale system may have a local storage mechanism so it can recover from power outages, void incorrect transactions, etc. Similarly, behind the scenes most desktop software stores data in caches, documents, or other kinds of files in order to improve the overall user experience.  

At a broader level are so-called **workgroup applications**, where data access is shared with a limited number of other users and devices on the same local area network (LAN). In our retail example, the back office systems and inventory systems might share a workgroup server that keeps track of recent activity. At the other end of the diagram, a similar setup connects the executive information systems and the analyst workstations to the data marts and file archives needed to do their work. 

At the largest scale are the **enterprise** systems in the center of the diagram. They are not necessarily designed for raw speed but instead for throughput. The data on these servers may come from hundreds or even thousands of devices or users, and it is more important that each transaction complete correctly than that any particular transaction complete quickly. 

### **Usage: Transaction Processing vs Analytical Processing**

In our retail example, there was a contrast between the operational systems used in the retail locations on the left and the decision support systems used by the headquarters on the right. 

We call the kinds of work performed by the retail locations and the central data center **transaction processing**. The emphasis is on capturing *what is happening right now* in as much detail as needed and then storing it for posterity. The transaction server and database are thus designed for writing data quickly and accurately, without dropping any transactions due to bandwidth constraints or technical failures. 

The work performed at headquarters, meanwhile, tends to be what we call **analytical processing**. Here the emphasis is on aggregating and understanding the transactions data and perhaps integrating it with other data collected elsewhere. These sorts of activities are more about data integration and communication, with read-only access to (scrubbed and aggregated) historical data in a data warehouse or data mart. Such systems may be nearly as large as the transaction systems, in that they contain the same basic volume of facts, but they do not have to support as many users and are less subject to data corruption. Read-only data is not corruptible. If it is corrupt then it was so when it was created. 

### **Security: Files / DBMS / Services**

We conclude our discussion of information technology with a note about the effect of architecture on privacy and security. As an illustration please consider the three "houses" below, each of which are designed with security and privacy in mind. The first is Philip Johnson's world famous Glass House, which looks stunning but would not provide much in the way of privacy. In the center is Johnson's almost as famous brick Guest House, a windowless structure with a single door that would provide lots of privacy and security, except for the easily accessed skylights on the roof. Lastly, we have the bomb shelter on the right, which provides maximum security and privacy but only if one is willing to trade off natural light and air.   

![Glass Brick Concrete](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_glass_brick_concrete.jpg)

**The Glass House is analogous to files on a local hard drive or in an email attachment.** Just as the thick panes of glass give the appearance of security but not any privacy, so does relying on file storage to keep your data safe. The contents of your files are visible to anyone with physical access to the storage device or network. While we can, of course, provide security ourselves $-$ with curtains for the house or encryption for the files $-$ doing so would potentially spoil the elegance and convenience of the original design. *So, unless you want to spend your time worrying about private data leaking out of your organization on thumb drives, email, etc. then do not rely on file storage to secure your files. It's about as insecure as it gets.*

**The brick Guest House is like direct access to a Database Management System (DBMS) over a secured local network.** By providing a single point of access (i.e., a thick wooden door), such systems make it fairly simple to secure the data behind a user authentication and authorization system. *However, just as the brick house's skylights can potentially be compromised, so can a database located on a local network drive. Its files may take longer to break into but with time can be compromised by those with the patience and skill to do so.*

**Finally, the bomb shelter is equivalent to an encrypted database that can only be accessed through a secure application server.** In this case, *all access* is limited to the specific commands (service requests) implemented by the app server. While in principle we could argue that a DBMS is itself an application server, in this case we have the ability to layer on more security beyond what is provided by the database vendor. It is certainly inconvenient, but is about as secure as it can possibly be. 

**So, whether you are storing data in the cloud or or your local hard disk, be sure that you use a secured network, with a single point of access to the data. Even then, use encrypted file storage to provide one more layer of protection from those with physical access to the server. What you may lose in raw speed will be more than made up for with peace of mind.** 



---
## **Software Architecture: Presentation / Logic / Data**

![Three Tiered Architecture](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_three_tiered_architecture.png)

**If we ignore networks and devices, then all remaining technology is software.** Conceptually, if technology is an organization's brain and central nervous system then software is its mind. The wiring, neurons, synapses and other parts exist to implement the thinking and executive processes needed to survive. Similarly, **information technology exists to implement the software** that makes it useful. 

Virtually all modern software implements some variation of the three part structure shown above:
- The **presentation** tier (or layer) that the end user sees and interacts with. To many users this is the totality of the software. Perhaps the most familiar example is the web browser, which many Microsoft users of a certain age still call "the Internet" even though Internet Explorer has been officially dead for many years now. 
- The **logic** tier that connects the user to other users, retains information that may be useful later, controls access to critical resources, etc. In its simplest form, it is defined by a set of actions or **requests** to be carried out by an application service. Continuing with our web example, each web page is assembled by the web browser through one or more requests made of the web server. Each request is received, authorized, and (potentially) executed, with a **response** (HTML, CSS, javascript, file, etc.) delivered back to the browser for presentation to the user.  
- The **data** tier that is responsible for reading and writing persistent data. If the entire system is rebooted from scratch, then all essential data should be restored from storage by the data tier. If any data cannot be restored then the data tier should initiate a **rollback** of any data that has become invald because of the loss. 

In our look at software we will take an operational view of database systems:
- How the three tiers cooperate (via requests and responses) to carry out a business transaction 
- The operations, functions, and features of database management systems (DBMS)
- SQL Standards for relational database systems
- How we will use SQL in this class 

### **The Transaction Lifecycle**

Consider this everyday sales transaction at a mom-and-pop retailer near you:
1. The customer selects a few items off the shelves and then approaches the cashier to check out. 
2. The cashier asks for the customer's phone number or other identifying account information to "make future checkouts easier."   
    2a. If the customer refuses ("the number is unlisted") then the cashier enters a dummy customer number (like "000000") and continues on with the transaction.    
    2b. If the customer supplies a phone number, then the cashier looks up the customer in the system. If the customer does not exist in the system then the cashier asks for a name and creates a new customer account.  
3. The cashier rings up the items and calculates a total. 
4. The customer pays with cash or credit card. 
5. The system confirms the transaction as valid and complete. 
6. The cashier offers a receipt and tells the customer to have a good day. 

It seems pretty simple, right? Now let's look at the same transaction as a set of requests and responses between the Point of Sale terminal, the Transaction Server, and the Database Server. To keep things simple, let's assume that the customer does not have an account but is willing to set one up. Each arrow on this UML sequence diagram is a request (solid line) or a response (dashed line) from one system to another. The logical order is always top to bottom, with interactions at the top occurring before the ones below them. 

![Sales Transaction](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_sales_transaction.png)

Note that most server responses come only after issuing requests for help from the database. The requests are in SQL, of course, with responses that may include data or just a response code (e.g., "OK"). 

So ... it's not so simple after all! Here all of the requests succeed (with data or an "OK" response) but the system needs to also handle failed requests. We also didn't consider the possible transactions with the credit card company. Depending on the system design, those may be processed by the POS terminal or the Transaction Server. 

And if the system suffers a catastrophic failure, where does it look to start a reboot? With whatever data is in the database. In most cases, that's just fine. However, if the database itself shuts down in the middle of recording a transaction then it needs to *rollback* its data to just before the failure and notify the server of the error, which then gets reported to the sales terminal. We'll consider such cases in Lesson 8. 

### **DBMS Operations, Functions, and Features** 

As we saw with the sales transaction, even everyday business gets pretty complicated when we have to implement it in software. From the perspective of the database, however, there are only four kinds of **operations**: 
- **Create** (add) new data. Upon storage, the DBMS should respond with an identifier for retrieving the data later. 
- **Retrieve** specified data. The request is often called a *query* and the response includes a data *payload* and perhaps some descriptive *metadata*. It is possible, depending on the query language, to return collections of data if needed. 
- **Update** specified data. The request indicates what data is to be updated and how it is to be modified. The response indicates whether the update was successful. 
- **Delete** specified data. The DBMS either deletes the data or returns an error code if the data cannot be deleted.   

These fundamental operations are commonly referred to by the acronym CRUD and are found in every database system regardless of the technology or use case. 

As DBMS technology has matured, the industry has agreed on a few standard functional definitions (shown below). We will consider many of these in more detail in Lessons 7 and 8. 

Within and beyond these standards, there is plenty of room for DBMS vendors to innovate. We will consider vendor-specific features (also shown below) in our discussion of NoSQL and Distributed DBMS technology in Lessons 11 and 12. 
  
![DBMS Functions and Features](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_dbms_features_functions.png)

### **SQL Database Standards**

**The *lingua franca* of DBMS technology is Structured Query Language.** It is the standard against which *every other* database technology is defined. Further, while there are dialectical differences between the various SQL implementations, they are relatively rare, allowing most SQL queries to run unmodified between DBMS vendors. There may be quicker or easier vendor-specific ways to code a particular query, but there will also be a *standard* way. 

**SQL is more than just a language.** Each DBMS vendor provides a *platform* with tools, apps, and other utility software. To keep the chaos at a minimum, SQL Standards include specifications for DBMS functionality like ...
- How to connect to a database and initiate a request
- How data is stored and organized
- How transactions are handled to prevent data corruption
- How user permissions are granted and revoked

We will run a few simple SQL queries in the next section and then again for pretty much every lesson in this course. 

### **Jupyter, Colab, and %sql Magic**

In this course we will interact with a variety of different database servers, but we are generally only going to use one database client: `ipython-sql` running right here inside our Jupyter notebooks. 

We first learned about Jupyter in Lesson 0. It is a programming and reporting environment that combines formatted text and runnable code organized as "notebook" documents. There are different flavors of Jupyter notebooks from various vendors. What follows assumes that you are using Google Colab, though most actions translate pretty well to the other Jupyter variants. 

Text is entered in Markdown format into text cells. If you double click on this text you can see Markdown formatting for yourself in a fairly large text cell (screenshot below). When open this way, the cell is editable. If you modify anything then the formatted text (displayed to the right) also changes. Double-click the formatted text to hide the markdown text. 

![Markdown Screenshot](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_markdown_screenshot.png)

Runnable code is entered into code cells, identified by icons to the left of the cell. Pristine code that has not been run yet appears with an empty box icon.

![New code cell](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_code_cell_screenshot1.png)

In this case, the code is in Python but as we will see Jupyter can run code in many different languages. Python just happens to be the default. We will be using *mostly* SQL in this class. 

To run the code, hover over the cell and press the circular run icon.

![Run code cell](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_code_cell_screenshot2.png)

After the code has been run (and you are no longer hovering over the cell), the box icon returns, this time with a number inside. 

![Output code cell](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_code_cell_screenshot3.png)

The number indicates the order in which the cells in a notebook were run. It is possible to run the cells in a different order than they appear. That allows you to go back and debug things as you go along. However, you should also do a "clean run" (top-down after resetting the code cells) from time to time to be sure that the notebook works as written. 

Here, try it yourself.  The cell below is live. Run it. 



In [None]:
print('Hi There!')

Congrats. For many of you this is your first Python code. We'll see a little more Python before we switch to SQL pretty much full time. 

In order to run SQL in Jupyter *without* Python, we will use [`ipython-sql`](https://github.com/catherinedevlin/ipython-sql), a Jupyter add-on that also goes by the name "%sql magic." It does exactly what the name implies, doing all the hard work (i.e., magic) to interact with databases using just SQL. Recalling our earlier discussion of the three houses, %sql magic connections are like the Brick House with direct interactions with a remote database. 

We'll start with a tiny bit of Python to let Jupyter know that we are going to be using `ipython-sql`. The cell below, typically located near the top of the notebook, is used to enable (load) the %sql magic. Colab will let us now if it has already been loaded but there is no harm in loading it twice. 

![load sql magic](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_load_sql.png)

With %sql magic loaded we can now create and run SQL queries in any code cell. First, however, we will need to tell %sql magic where to find the database we want to connect to. 

![SQLite in memory connection](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_sqlite_in_memory_connection.png)

You'll notice that the part after `%sql` looks a bit like a web URL, with a protocol (`sqlite`) followed by `://`. That's no accident. We call this a **connection string**, which includes all the information (protocol, user, password, server, database) needed to find and then connect to the database, which can reside just about anywhere on the Internet. In this case, we are actually working directly with a database *in memory* (i.e., no network, no files, ... right inside your browser), a trick that is unique to SQLite, which was originally designed for embedded use in tiny devices without file systems. 

Once we have a database connection, we use the `%%sql` magic invocation (sort of like *abracadabra*) at the top of a code cell to indicate that all code after the first line is SQL. For example, the screenshot below includes a bit of SQL to create (or recreate) a table of customer profile data. 

![Create Table Example](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_create_table_example.png)

Again, the text below the `%%sql` invocation is written in SQLite-compatible SQL. We will come back to this soon enough in the ***SQL AND BEYOND*** tutorial at the end of the lesson. 


---
## **Data Architecture: Entities, Attributes, Values, and Relationships**

The figure belows show three different views of the same data:
* A receipt from a dry cleaner order from January 2, 2019
* An entity relationship diagram showing how the data is organized
* A table of invoices from January 2, 2019

In this section we explore data from the ground up, starting with basic definitions and issues, then moving on to data modeling and database design, and concluding with actual SQL code to implement the design.   

![DeluxCare](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_DeluxCare.png)

### **Data $\rightarrow$ Information $\rightarrow$ Knowledge**

In data analytics we often treat "data" as a general term for whatever evidence we use to build and test models about the world. If it feeds our models then it's data. However, for the purposes of this course that sort of thinking is putting the cart before the horse. There is a lot of work needed before we can use data for our analyses, work that requires a somewhat more nuanced understanding of data. 

In its most fundamental form, **data is just facts that have been encoded so they can be stored.** The encoding itself is called a data type, of which there are many possibilities. Data may be unstructured text ("Hi there!"), categorical labels (red, blue, green), numerical quantities (ordinal, integer, rational, real), coordinates (latitude/longitude), images (bitmapped pixels, jpg, png, etc.) or even just a blob of binary data. If it can be stored, then it is data.

**Information is data that has meaning.** So, while the location `41.1588° N, 73.2574° W` is data, it is not very informative until we note that it is where we would find Fairfield University on a map. We give data meaning by providing **metadata ("data about data")** about the context. Metadata includes things like the data type, what the data represents, when the data was recorded, and cross-references to related data that can further provide context. If we can interpret it, then data is information. 

**Knowledge is how we use information to do things.** The data models that we build are knowledge. They convert information into actionable insights. So, if we were creating a routing algorithm to get to the Fairfield University from anywhere else in the world, we would need information about the starting  location, maps that capture the possible pathways for different kinds of transportation, and perhaps a desired arrival time. The routing algorithm is the knowledge, the rest is information. 

**Database systems are designed to manage information and make it available to other apps.** At a minimum they:
- Store facts as persistent data
- Structure the facts with metadata (stored with the data)
- Allow access to the data for use by other systems (CRUD operations)


### **Data Integrity: Do you really know what's in the data you're consuming?**

Converting data into useful information is hard work. It is often estimated that as much 80% of data analytics work is wrangling with data to prep it for use. If the dataset is small and the analytical models fairly simple, then we might barely notice having to clean up missing data, typos, or other problems. However, once the datasets get larger, the problems get bigger as well. We will cover many of these problems in Lessons 4 and 9. For now, just know that as datasets get larger, the odds of running into every possible kind of data problem increase. For really big datasets, the odds are virtually 100% that there is at least one serious data problem to resolve before moving on to analytical modeling.  

From a database perspective, where we don't always know exactly how the data will be used, we focus on three kinds of **data integrity**:

- **Domain integrity** is about how facts are encoded. Is it in a way that makes sense for potential uses? Can an expert in the field recognize and interpret the data in a meaningful way? 
- **Entity integrity** is about data storage and retrieval. If we are seeking the facts for a given situation, can we retrieve just those facts and no others? 
- **Referential integrity** is about how facts relate to each other. Do any and all cross-references connect the right facts? Are there any invalid cross-references that point to missing data? Are any required relationships missing entirely?  

One of the most frustrating things about data integrity is that it tends to degrade over time:
- People's understanding of the data may shift, invalidating domain integrity.
- Identifiers like names may change, making it hard to find exactly what is needed.
- Data is continually being added, updated, or deleted, with each transaction potentially causing a referential integrity error. 
This is sort of like the database version of **entropy**, the law of physics that suggests that the universe will eventually wind up as a totally disorganized mess. 


### **Data Models ... Once and Forever**

Maintaining data integrity is an active process that requires continual evolution of the database itself. So, while it would be great to just think about data retrieval (i.e., SQL SELECTS), in this course we will by necessity spend a lot of time on **database design** and how to maintain databases using SQL CREATE, INSERT, UPDATE, and DELETE commands. 

All design starts and ends with modeling. In database design we focus on three kinds of data models:
- **Conceptual models** (diagrams) that permit visualization of the database structure before it is built. These are like blueprints for the database. 
- **Logical models** (programs) that define processes for building, using, and maintaining the database. 
- **Physical models** (technology) that define how the databases operate on real-world hardware. 

We will cover conceptual modeling in Lessons 5, 6, 9 and 10, Logical modeling in Lessons 4, 7 and 8, and Physical models in Lessons 11 and 12. However, we will conclude this section with a few essentials to get us started. 

#### **Conceptual Models**
The primary conceptual model used by database designers is the Entity Relationship Diagram (ERD). Interestingly, ERDs are not actually about data so much as the things (entities) the data is meant to describe. An **entity** is a *specific* thing we capture data about. Typically, the entity has a name or number that acts as an **unique identifier**; if any two things share the same identifier then neither is an entity. The facts that we attach to an entity are called **attributes**. Some attributes are special, used to refer to other entities, usually of a different type. 

In the DeluxCare ERD below there are three types of entities (Customers, Invoices, and Garments), each shown as a rounded box. The name of the entity type is shown as a label at the top of the box. The attributes are then listed in the bottom portion of the box. If an attribute is to be used as an identifier then it has a `PK` (for *primary key*) listed to its left. Attributes that are used as cross-references are list with an `FK` (for *foreign key*) next to them. The nature of the relationships between the entities are specified with special notations at either end of the connecting lines. So, for example, the line connecting `Customer` to `Invoice`  specifies that each customer has zero or more invoices and that each invoice must relate to exactly one customer. (In fact, the `customer_id` foreign key on the Invoice entity is there because of the relationship. However, there is no corresponding `invoice_id` foreign key on the Customer entity. Why do you suppose that is?)

![DeluxCare ERD](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_DeluxCare_erd.png)

#### **Logical Models**

Logical design focuses on two artifacts: 
- the tables that correspond to the entities in the ERDs
- the SQL code necessary to create the tables in the database

Since we will be working with live SQL in the next section, we will for now just consider table design. 

SQL database logic is said to be **relational**, which is a fancy way of the data is organized into tables where:
- Each table corresponds to a given entity type
- Each row of the table corresponds to one *instance* of the entity type 
- Each column contains one attribute (fact) for the instance
- Row and column order do not affect data meaning
- Each row has one (or more) column(s) called a *primary key* that can be used to retrieve the row after it is created
- Any cross-references are implemented as *foreign key* attributes that indicate the primary key value of the entity being referenced. Note: a table does not have to have a foreign key. 

The design of the `invoices` table for the DeluxCare database is shown below. 

![DeluxCare Invoices Table](https://github.com/christopherhuntley/DATA6510/raw/master/img/L1_DeluxCare_invoices_table.png)

#### **Physical Models**
The physical model of a given database is somewhat vendor-specific. Each DBMS vendor provides different options, depending on the needs of their customers. Here we will focus on just two of the many of the possible physical design decisions behind a given DBMS. Both relate to how data redundancy is used to make the database more reliable and performant. 

The first decision is how the data is stored and backed up. One option is to keep the data in files on a local hard drive. MySQL, for example, stores each table as a single file buried somewhere on a hard disk. SQLite does something similar, except in a location of the user's choosing. In either case, one can back up the data by copying the actual database files, along with perhaps some metadata about the files themselves. The alternative to locally-accessible files is to hide all storage implementation behind a service, which may be the database management system itself. In this case, backups are often retrieved as SQL 'dump files' that contain instructions for recreating the data from scratch. Of course, since the server is managed by the vendor, they may also offer a paid backup service that uses proprietary technology of their own design.

The second decision is how many live copies of the data to keep. The classic approach is to use a centralized database server. In this case there is only one live copy of the data to secure, backup, etc. However, this can cause delays for remote users, where each DB request and response may have to travel a long route through the Internet, causing significant latency for the end users. The solution is to use a decentralized architecture, with multiple copies of the database distributed close to the end users. It is then the database system's job to keep the various copies in sync so that two users on opposite sides of the world aren't working with different "facts" for the same entities.  

We will come back to these issues in Lesson 12 at the end of the course. 



---
## **PRO TIPS: How to pare a data model to its bare essentials**

Data modeling is about capturing the essence of the _things_ around us. What are they? What do they look (or feel or sound ...) like? What are they made of? What are they doing? What is acting on them? There are just so many questions to answer. 

However, everything starts with the *things* themselves. Anything that could be referred to with a noun is a potential thing that produces data. Further, since nouns are used as subjects and objects, anything described with a verb phrase is also a candidate for data collection. In just about any real-life situation one can generate a veritable mountain of data just by studying the things around us. 

So, where do we start? How do we keep our data to just the facts we need to know? Fortunately, we are not the first people to ask that question and there is plenty of practical philosophy to draw on. Here we'll cover a few of the best practices as a set of useful contrasts. 

### **Dynamic vs Static**
When observing a situation, it is often useful to take "snapshots" and then look for two kinds of facts:

- **Dynamic elements: Facts that change from frame to frame.** What key differences do you see over time? These changes hint at underlying processes. The changes themselves are called *events*. If the dynamic elements you detect are relevant, then capture the events so you can recreate the patterns over time. 
- **Static elements: Facts that do not change at all, ever.** These may be key structural elements that help us understand or predict the behaviors of the dynamic elements. In many cases they represent the essential makeup or composition of the entities we are interested in. 

**Wisdom:** When observing a situation, ask yourself what things you would need to know to simulate it. Focus your observations on things that change and things constrain the changes. The right data will be in there somewhere. 

### **Entities vs Data Objects**
As we have already discussed, entities are things that have unique identities. They are the things we need to capture data about. But what about the other things that are not entities? Either they are just noise to be ignored or they are **value objects** that make up the attributes of the entities. Every attribute has a data type and can take on some range of values. **These data objects are the things that make up the attribute ranges.** Often this is a matter of context. The color "brown," for example, might be an entity in a physics lab but is a value object when cited as hair color on a driver's license. 

**Wisdom:** Don't just ignore data objects. Instead, ask what attributes they represent and what entities have those attributes. 

### **Abstraction vs Specification**
Often the same thing can go by many different names. What is the essential difference between a car and an automobile? Or between employee and staff? 

Abstraction is about finding similarities. When faced with such fine distinctions that may not be relevant to your needs, it is common practice to aggregate similar things together into groups as a matter of efficiency. If they behave the same then why treat them differently? Instead, we abstract away the irrelevant differences and then isolate (specify) the relevant ones. To keep our specifications coherent we'll often invent new names for the entities and value objects so generated. 

This process of generalization and specification follows a cyclical pattern, where we look to generalize away specifics that don't matter and then specify what facts remain. We do it every day but expert data modelers do it better. 

**Wisdom:** Names and relationships matter. Try to standardize your language in ways that allow people to generalize and specify as needed. 

### **Data vs Database**

When capturing data it is easy to confuse database technology with the data itself. We use database technology to store and maintain data. However, the database is not the data. It's like the difference between your wallet and the data it contains. When you lose your wallet, you may lose some cash, credit cards, etc. (where data takes physical form) but you don't actually lose the right to drive or your accounts. 

**Wisdom:** Eliminate technical jargon whenever you can. If you can't imagine a person working in the field using a phrase in everyday speech, then it is likely not relevant. So, unless the business domain is IT, avoid using terms like "repository" or "key" or "server" that only make sense to IT folks. 

### **Conclusion: Start big, then pare down and stick with what works**

The above bits of wisdom only apply once you have captured plenty of facts. So, be expansive at first, taking in whatever you can. Then make an active effort to eliminate all nonessential details. You want a cohesive whole made of a minimal number of interconnected entity types where ...
- Each entity is as simple as possible
- Each attribute is as relevant and relatable as you can make it
- Each relationship is as precise as you can make it

**If we find ourselves continually changing out data models then we are just not doing it right. Essentials are eternal and *never* change. If they do then they are *accidental* instead of essential. Keep it simple and consistent over time.** 

---
## **SQL AND BEYOND: SQLite ... Files optional, no server required** 

We have already been introduced to SQLite a couple of times in this lesson. Now let's go deeper with a runnable demonstration. 

### **What Is SQLite?**

From the [SQLite.org](https://sqlite.org/index.html) website:
>SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day. More Information...
>
>The SQLite file format is stable, cross-platform, and backwards compatible and the developers pledge to keep it that way through the year 2050. SQLite database files are commonly used as containers to transfer rich content between systems [1] [2] [3] and as a long-term archival format for data [4]. There are over 1 trillion (1e12) SQLite databases in active use [5].
>
>SQLite source code is in the public-domain and is free to everyone to use for any purpose.

There is a lot to parse there. Some highlights:

- Implemented in C for maximum speed and minimal installation footprint. 
- Very, very popular with device makers of all sorts, with over 1 trillion SQLite databases in use. 
- Stores data in files or in memory (as mentioned earlier in the lesson). 
- Minimal file size allows it to be used as a transfer container for shipping large data repositories from one place to another. 
- There is no DBMS per se. Remote access, security protections, etc. have to be provided through a separate application server. 

In this class we are using the `sqlite3` Python library that comes built into Jupyter, which when combined with `%%sql` magic is the most convenient way to work with relational data in Colab. 

What follows is a step-by-step tutorial. Each step has a brief explanation in Markdown followed by a code cell with some Python code for you to run. Don't know Python? Don't worry. Your job is to run the cells, not write Python code (yet?).



### **1. Complete a few admin tasks.**

Before we can do much with queries, we'll need to do some administrative tasks in Google Drive.  Some cells may fail if it finds something it doesn't expect like ...
- You are using a browser other than Chrome. 
- Chrome thinks you are in something other than your `@student.fairfield.edu` account. 
- Colab thinks you are in something other than your `@student.fairfield.edu` account.
- Your Google Drive is not set up for Google Colab and DATA 6510 as directed in Lesson 0. 
- The necessary data files could not be found at GitHub.
- Google Drive is being *really slow*. 
Don't worry. Each of these things can usually be resolved in a few minutes. Ask for help on Slack if needed.

Run each code cell, one at a time, reading any notes provided (above or below the cell) that explain what the code is doing. 

#### **1a. Set up a workspace in Google Drive for our database files.**

The cell below connects to Google Drive and adds a folder to store our new files.  



In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Create the DATA6510/data/DeluxCare folder in Google Drive
from pathlib import Path
data_root = Path("./drive/My Drive/Colab Notebooks/DATA6510")
if not data_root.exists():
  print(
      '''
      Warning! The folder '/Colab Notebooks/DATA6510' could not be found in the connected Google Drive. 
      Please make 100% sure that both Colab and Chrome are set up use your @student.fairfield.edu account. 
      For now, a new folder with the correct path has been created in whatever Google Drive it found. 
      ''')
data_root = data_root / 'data' / 'DeluxCare'
data_root.mkdir(parents=True, exist_ok=True)

Mounted at /content/drive


The cell below makes a `data6510` symlink ("shortcut") to the `DATA6510` folder that works around spaces in the folder names (`My Drive`, `Colab Notebooks`, etc.). FWIW, `bash` is a commonly used shell environment for Unix and Linux servers. It's what you see by default in the MacOS Terminal.  

In [2]:
%%bash
ln -s drive/My\ Drive/Colab\ Notebooks/DATA6510 data6510

**Heads up: We will need to refresh the Google Drive connections and the symlink for each SQLite session. Colab forgets about all file related things between sessions. If you've been away for 12 hours or more then expect to have to set up Google Drive and the symlink again. All you'll need to do is rerun the cells provided, usually at the top of the notebook for each lesson.**

#### **1b. Retrieve source data and store it in Google Drive.**

This cell downloads a copy of the database from GitHub and saves it to Google Drive. The database contains data about every sale (invoice) in 2019 for the DeluxCare dry cleaner business. 

In [4]:
import requests  
file_url = "https://github.com/christopherhuntley/DATA6510/raw/master/data/DeluxCare/L1_DeluxCare.db"
r = requests.get(file_url, stream = True)  

db_path = data_root / "L1_DeluxCare.db" # note: data_root was set in 1a. 
with open(db_path, "wb") as file:  
    for block in r.iter_content(chunk_size = 1024): 
         if block:  
             file.write(block)

 

### **2. Connect to the Database**


#### **2a. Let Colab know that we will be using `%%sql` magic, `sqlite3`, and (possibly) pandas.**

This cell is mostly 'boilerplate' that we will run at the start of just about every lesson from now on. 



In [None]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3
import pandas as pd

#### **2b. Initiate and test a SQLite connection.**

The cell below uses a special *connection string* to 'open' the database file for querying. Connection strings follow the [SqlAlchemy Database URL standards](https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls). For SQLite the format is `sqlite:///path/to/file` where `path/to/file` is a valid filepath *without spaces*. We used the symlink in step 1a to avoid the spaces.  Most of the time connection URLs will be provided for you, but you should still pay attention to the way they are structured. 

In [None]:
%sql sqlite:///data6510/data/DeluxCare/L1_DeluxCare.db

'Connected: @data6510/data/DeluxCare/L1_DeluxCare.db'

The cell below makes sure that the connection worked. It should return metadata listing the names and columns for two tables. The SQL shown is the actual code needed to create the table in the database. (The SQL indentation is off, by the way. We will learn to do it properly in Lesson 7.)

In [None]:
%%sql
SELECT * FROM sqlite_master

 * sqlite:///data6510/data/DeluxCare/L1_DeluxCare.db
Done.


type,name,tbl_name,rootpage,sql
table,customers,customers,2,"CREATE TABLE customers (  customer_id INTEGER PRIMARY KEY,  first_date TIMESTAMP,  last_date TIMESTAMP )"
table,invoices,invoices,3,"CREATE TABLE invoices (  invoice_id INTEGER PRIMARY KEY,  customer_id INTEGER,  date_invoice TIMESTAMP,  date_finished TIMESTAMP,  date_paid TIMESTAMP,  date_pickup TIMESTAMP,  date_ready TIMESTAMP,  total REAL,  discount REAL,  prepaid REAL,  items INTEGER )"


### **3. Finally, we get to run some SQL queries.**

Now that we have access to the database let's explore a bit to see what we have. We'll title each query as a question. The code cells then answer the questions. Don't forget to run the code before reading the remarks. 

#### **How many customers are in the database?**

In [None]:
%%sql
SELECT count(DISTINCT customer_id) 
FROM customers;

 * sqlite:///data6510/data/DeluxCare/L1_DeluxCare.db
Done.


count(DISTINCT customer_id)
4169


REMARKS: 
- Not every customer is a person. Some represent hotels or other businesses.  
- Like all of the queries in this lesson, we are using a `SELECT` statement. The `FROM` clause tells the query what tables to use. 
- The query counts the number of unique (`DISTINCT`) `customer_id` values that appear in the `customers` table. Since `customer_id` is the primary key of the `customers` table, they are each unique and we do not actually need to use `DISTINCT`.

ANSWER: There are over 4169 customer accounts in the database. 

#### **Which are the oldest customer accounts?**

In [None]:
%%sql
SELECT * 
FROM customers
ORDER BY first_date
LIMIT 10; 

 * sqlite:///data6510/data/DeluxCare/L1_DeluxCare.db
Done.


customer_id,first_date,last_date
1,,2019-02-13 00:00:00
2,1899-12-30 00:00:00,2019-04-06 00:00:00
3,1899-12-30 00:00:00,2020-03-16 00:00:00
4,1899-12-30 00:00:00,2019-12-30 00:00:00
5,1997-02-19 00:00:00,2019-05-08 00:00:00
6,1997-02-19 00:00:00,2019-12-30 00:00:00
7,1997-03-11 00:00:00,2020-03-06 00:00:00
8,1997-03-11 00:00:00,2019-01-16 00:00:00
9,1997-03-11 00:00:00,2019-02-08 00:00:00
10,1997-03-11 00:00:00,2020-03-09 00:00:00


REMARKS: 

- This is 100% real data. The customer names, addresses, and other identifying information were scrubbed to protect customer privacy. 
- Customer 1 actually is a special account used to handle cash transactions that do not have a proper invoice. That's why it does not have a first_date. 
- Similarly, customers 2, 3, and 4 represent industrial accounts that predate the system. December 30, 1899 is the earliest date the POS system could handle.
- All dates are encoded as TIMESTAMPs that include the time of day. In this case, they are all treated as happening at midnight. 
- The `ORDER BY` clause sorts the data using the `first_date` column. 
- The `LIMIT` clause tells the query to return up to 10 rows of data.   

ANSWER:  Customers 5 and 6 have the longest-running customer accounts. Both were created on February 2, 1997. The oldest account with activity in 2020 was customer 7. It really is remarkable how loyal these customers have been over the years. 

#### **How many transactions (invoices) did customer 5 have in 2019?**

In [None]:
%%sql
SELECT count(invoice_id) 
FROM invoices
WHERE customer_id=5;

 * sqlite:///data6510/data/DeluxCare/L1_DeluxCare.db
Done.


count(invoice_id)
2


REMARKS:

- We had to use the `invoices` table to answer this question. Each invoice has an `invoice_id` (primary key) and a `customer_id` (foreign key). 
- The `WHERE` clause restricts the rows included in the query to just those with `customer_id` equal to 5. 

ANSWER: Customer 5 had two transactions in 2019.

#### **Which customers had the most transactions in 2019?**

In [None]:
%%sql
SELECT customer_id, count(invoice_id) as invoice_count
FROM invoices
GROUP BY customer_id
ORDER BY invoice_count DESC
LIMIT 10;

 * sqlite:///data6510/data/DeluxCare/L1_DeluxCare.db
Done.


customer_id,invoice_count
1708,508
2598,412
1147,400
2300,381
2090,379
2925,338
445,337
1164,323
2061,323
1975,276


REMARKS

- This query is 100% equivalent to an Excel pivot table. 
- The `AS` keyword gives the calculation `count(invoice_id)` an alias (name)`invoice_count` that we can use later. 
- The `GROUP BY` clause tells the query to do sub-counts broken down by `customer_id`.
- The `ORDER BY` clause sorts the groups in descending order (`DESC`) by `invoice_count`.

ANSWER: It looks like customer 1708 is the winner. Like the others on this list, it most certainly represents an industrial account. After all, who else goes to the cleaners multiple times per day?

#### **For each of the top 10 customers, how long have they been customers?**

In [None]:
%%sql
SELECT customer_id, date(first_date) as start_date, count(invoice_id) as invoice_count
FROM invoices JOIN customers USING (customer_id)
GROUP BY customer_id, first_date
ORDER BY invoice_count DESC
LIMIT 10;

 * sqlite:///data6510/data/DeluxCare/L1_DeluxCare.db
Done.


customer_id,start_date,invoice_count
1708,2008-11-24,508
2598,2016-04-20,412
1147,2005-03-21,400
2300,2014-02-06,381
2090,2012-07-02,379
2925,2018-01-02,338
445,1999-03-02,337
1164,2005-05-10,323
2061,2012-03-26,323
1975,2011-07-20,276


REMARKS

- We had to get data from both tables to answer this one. 
- We used the `date()` function (and an alias) to convert the `TIMESTAMP` data to dates without the time of day. 
- The `JOIN` expression in the `FROM` clause matches each invoice with a customer `USING` the `customer_id`. 
- The `customer_id` was included in the `GROUP BY` so that it could be used in the `SELECT` clause. Some DBMS's don't require this precaution but it's better to be safe than sorry. 

#### **That last one is hard to read. Can we have it as a nicely formatted table?**

In [None]:
_.DataFrame()

Unnamed: 0,customer_id,start_date,invoice_count
0,1708,2008-11-24,508
1,2598,2016-04-20,412
2,1147,2005-03-21,400
3,2300,2014-02-06,381
4,2090,2012-07-02,379
5,2925,2018-01-02,338
6,445,1999-03-02,337
7,1164,2005-05-10,323
8,2061,2012-03-26,323
9,1975,2011-07-20,276


REMARKS

- This is a convenient Jupyter trick. The results of the previous cell are always kept in a special variable called `_`. In this case it's a `%sql` magic resultset, which can be converted to a pandas DataFrame as shown. 
- The `_` trick only works once for each run of a `%%sql` cell. Can you figure out why? 

---
## **Congratulations! You've made it to the end of Lesson 1.**

There are just 11 more to go. For Quiz 1, focus on everything in this lesson but SQL. The purpose of this lesson was to provide necessary context *before* diving into SQL in Lesson 2. 

## **On your way out ... Be sure to save your work**.
Save this notebook file to your `DATA6510` folder so you can find it next time. 