<img src="https://github.com/christopherhuntley/DATA6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Lesson 4: The Relational Model** 
_A mostly gentle introduction to the mathematics of data modeling._

## **Learning Objectives**
### **Theory / Be able to explain ...**
- The elements of the relational data model
- Coherent relations
- SQL `SELECT` queries as Relational Algebra 
- Importance of Data Integrity
- The many kinds of keys and how they are used

### **Skills / Know how to ...**
- Assess data integrity by examining data and basic assumptions
- Identify (and eliminate) duplicate table rows
- Embed SQL into Python code (without `%%sql` magic)

--------
## **LESSON 4 HIGHLIGHTS**

In [None]:
#@title Run this cell if video is does not appear
%%html
<div style="max-width:1000px">
  <div style="position: relative;padding-bottom: 56.25%;height: 0;">
    <iframe style="position: absolute;top: 0;left: 0;width: 100%;height: 100%;" rel="0" modestbranding="1"  src="https://www.youtube.com/embed/spCA0XC6jNY" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
  </div>
</div>

---
## **BIG PICTURE: Why Spreadsheets are not Relational Databases (but also kind of are)**
So many analysts get their first taste of data analysis working with spreadsheets. In the first sitting they'll type in some data into a couple of columns, then maybe add some formatting and perhaps column totals. Then they learn about copying cells with calculations in them, the proper use of `$` to freeze the row or column addresses, etc. Eventually, they'll perhaps learn about defining cell ranges as "databases" that can be dragged into PivotTables. It all seems so easy! Why not just stick with that? 

The problem is that for all of the expressive power of Excel, it is a poor substitute for a *real* database (and no, MS Access doesn't count). Some issues:
- **Any cell can have data of any type.** So, we can't predict what is in a cell without looking inside. 
- **The data is only semi-structured.** The meaning of a given row or column can vary from row to row and column to column within the same data set. At the top of the spreadsheet the third column might be home addresses but lower down it might be average home values. 
- **Spreadsheets have limited capacity.** If you have 2 million rows of data then MS Excel just won't work. 
- **Using `lookup()` or other similar means to combine data from multiple sheets is slow and error prone.** There is no way to *really* know you got the cell references right in your `vlookup()` call except to know in advance what values the lookup should return. 
- **While spreadsheets do allow named ranges almost nobody actually uses them.** So, you have to know which rows and columns make up a given range (say `$B2:$F$1093`). Then if you copy a formula that uses those ranges to another cell you have to be careful about the `$` in the cell addresses. 

Consider for example, the sheet below:

![Spreadsheet bugs](https://github.com/christopherhuntley/DATA6510/raw/master/img/L4_Spreadsheet_Bugs.png)

A few issues:
- Why did `countif()` in column C get Greta Life's customer count wrong?   
  >*Because of a missing `$`.*
- Why might the `vlookup()` in column H have gone so terribly wrong?  
 >*Because the staff were not sorted alphabetically once Brock Lee was hired. If they fired Brock (or resorted the staff roster) then the bug would seem to go away.*
- What could go wrong if we added more customers?  
 >*If we just tack customers onto the end of columns F-H, does the customer count get updated too? It depends on how the `countif()` range is defined.*
- How about if we hired Brock Lee's dad, who is also named Brock Lee? What might happen then? How would we fix it?   
 >*The `vlookup()` in column H would get confused if there are two Brock Lees. The names would have to be disambiguated using something like Brock Lee Jr and Brock Lee Sr. That in turn could mess up the data in column G. If we forget to update even one name in that column then our data is corrupted.*

None of these problems are programming bugs on the part of Microsoft or Google. They are instead more like bad spreadsheet design hygiene. With different data the bugs might just go away. 

In fact, the *internal* data representation used by spreadsheet software is  rock solid, as database design goes. After all, spreadsheets have ...
- unique cell addresses (A1, B2, etc.) 
- well-behaved data types (general, number, etc.) 
- a stable data structure (fixed rows and columns) that does not change when we add new data
- traceable cell dependencies (see below)

 ![Spreadsheet trace precedents](https://github.com/christopherhuntley/DATA6510/raw/master/img/L4_excel_trace_precedents.png)

That's pretty much the definition of a relational database, buried deep inside of MS Excel but with lots of user interface conveniences that actually make it more likely that data will get corrupted. **So, while it *is possible* to use a spreadsheet as a crude database, it takes lots of discipline to keep the data clean and consistently structured.**

In this lesson we will consider the relational database model, which was designed specifically to prevent just these sorts of **integrity** issues. Rather than rely on end users to *just know* when they are corrupting their own data we will learn to apply a few rules that prevent data corruption in the first place.

---
## **Tables as Relations**

The Relational Data Model was introduced by E. F. Codd at IBM in 1970. It is based on so-called Relational Algebra, which defines a set of rules and operations that should apply to tabular data (a.k.a., relations). SQL (or Sequel back then) was the programming language IBM used to implement the relational model.  

### **Terminology and Equivalencies**
In the time before the relational model, all data existed in files. In order to read or write data, a program had to implement a file system. A computer operating system, for example, has a file system for just this purpose. Apps make use of it to access data storage. 

In old-school data **files**, each line of text (it was always text or at least text-encoded) was called a **record**. Each record had some (possibly variable) number of **fields**, each representing one datum. Fields were delimited with a separator character, often a `tab` because it was not likely to be present in the data. A modern example is the so-called CSV file, where CSV stands for *comma -separated values*.

One step up from a file is a data set, where the file is explicitly in a tabular format, with rows and columns. This is the model on which SQL is built. Note that we implement the relational model in SQL, but it is possible to do things in SQL that technically violate the rules of the relational model. 

When working with *data models*, possibly before we actually have any data, we don't actually refer to tables. Instead, we talk of *entity types*, *instances*, and *attributes*. We will get into this deeper in Lesson 6. However, it is generally acknowledged that each entity type corresponds to a table, an instance is a row, and an attribute is a column.

| Relation    | Tuple    | Attribute |
|:---------   |:---------|:--------- |
| File        | Record   | Field     |
| Table       | Row      | Column    |
| Entity Type | Instance | Attribute |

As shown in the table above, the relational model formalizes these general notions using mathematical language: **relations**, **tuples**, and **attributes**. We will get into the exact meaning of these terms, but for now you can just think of them as synonymous with tables, rows, and columns.  

### **Sets, Mappings, and Relations**

You may have learned about mathematical relations in middle school, probably in a lesson about sets. A set is a collection of items (numbers, words, pictures, cats, ...) without duplication. The items can be of mixed types (e.g., cats and pictures) as long as no item is represented twice. 

![Set with cats and gifs](https://github.com/christopherhuntley/DATA6510/raw/master/img/L4_Set.png)

If we have multiple sets then we can *map* one set to another. The most familiar kind of mapping is a function, which maps a *domain* set (i.e., all possible function inputs) to a *range* set (i.e., all possible outputs). 

![Functional Mapping](https://github.com/christopherhuntley/DATA6510/raw/master/img/L4_Functions.png)

In a **functional** (surjective) mapping, each item in the domain maps to exactly one item in the range. In other words, each time the function is called with a given input, the function always returns the same output. If the set of inputs is finite, we can replace any calculation with a table, with one row for each input value and its associated output value. The mathematical name for such pairings is *tuple*, which we'll come back to in a bit. 

![General Relational Mapping](https://github.com/christopherhuntley/DATA6510/raw/master/img/L4_Relations.png)

A **relation** is a more general kind of mapping where items in the domain can map to *multiple items* in the range. Like with a functional mapping, we can represent relations as tables, only this time we may need multiple rows per input item. As long as we can capture each mapping arrow as a pair, then we can call the relationship a relation.  

>**Heads up:** It often confuses people when they find out that the Relational Model is about relations, not relation*ships*. Roll with it. In the end it doesn't matter much ... unless you are a mathematician. 

We can extend the pairwise mappings to allow multiple range sets (or *codomains*). 

![Multiple Codomains](https://github.com/christopherhuntley/DATA6510/raw/master/img/L4_Relation_Multiple_Codomains.png)

The result is that as we add more codomains the pair-wise mappings become triplets, quadruplets, quin**tuple**ts, sex**tuple**ts, ... or, as we  generally call them, **tuples** that can have any number of values. Depending on the mappings, it is possible that some tuples will contain empty values, but that is allowed in the relational model. 

Going back to the table / row / column terminology, each row is a tuple and each column is a named attribute. A table is then a set of rows (tuples). **Within each tuple, every entity (row) in the domain is mapped to a value in each of the attribute codomains (columns with data types).** Further, there can be no row duplication; otherwise the relation would violate the first rule of sets. 

### **Coherent Relations**
A relation (table) is said to be *coherent* if:
- each row describes one domain entity (i.e., not a composite) 
- each column is one attribute 
- there are no duplicate rows or columns
- there is just one fact per (row, column) pair
- row and column order don't affect interpretation

Coherence in this case is in the mathematical sense. It just means that we can translate the relation's tuples back to the original set mappings. 

Without coherence, the rest of the relational model falls apart. These rules are the minimal requirements for making sense of what a table represents. 








---
## **Relational Algebra**

### **About Algebra**
**An algebra** $-$ note the phrasing $-$ is a mathematical system that implements operations like addition, subtraction, multiplication, and division. Generally, we define a kind of algebra based on the data it operates on and perhaps special rules that only apply to that kind of data. The **Elementary Algebra** you learned in school involves basic arithmetic operations on numerical data: 1+1=2, 2x3=6, ...

In our discussion of boolean expressions in Lesson 2 we encountered **Boolean Algebra** that operates on `0` and `1` data where
- `OR` $\Leftrightarrow$ `+`
- `AND` $\Leftrightarrow$ $\times$
- `NOT` $\Leftrightarrow$ `-`
- `1 OR 1 = 1`$\Leftrightarrow$ `1+1=1`

Yet another common algebra operates on sets. The **Set Algebra** operators are shown in the table below.

| Operator | Symbol | Meaning |
| :------  | :----: | :----- |
| Union    | $\cup$ | `A` $\cup$ `B` returns all items in `A` **or** `B`.|
| Intersection | $\cap$ | `A` $\cap$ `B` returns all items in `A` **and** `B`.|
| Difference | $-$ | `A` $-$ `B` returns all items in `A` **and not** in `B`.|
| Product | $\times$ | `A` $\times$ `B` is the set of all possible pairs `(a,b)` where `a` is in set `A` and `b` is in set `B`.| 

A couple remarks:
- If set `A` has 2 items and set `B` has 3 items, then `A` $\times$ `B` has 6 items. That's where the name `Product` comes from. 
- Boolean Algebra is a special case of Set Algebra where every set (boolean expression) is a singleton containing a `1` or `0` (for True or False). The dead giveaway here is the presence of **and**, **or**, and **and not** in the **Meaning** column of our table above. 

### **Relational Operators**
Relational algebra is a set of operations that can be applied to any relation (table). Like boolean algebra, relational algebra is a kind of extension of set algebra where the items are tuples within coherent relations. 

When applied to a relation, each operator produces a **resultset**, which is itself a relation. 

#### **Restrict**
The **Restrict** operator chooses which tuples to include in the resultset. It is equivalent to the SQL `WHERE` clause. 

#### **Project**
The **Project** operator indicates which attributes (columns) to include in the tuples. It is equivalent to the column list in the SQL `SELECT` clause.  Note that it is also how we can include calculated columns like counts and sums that do not exist in the original data.

#### **Product**
The **Product** operator calculates the cross product of two relations. It is equivalent to a list of tables in the SQL `FROM` clause. 

#### **Except and Union**
The **Except** operator calculates the *difference* between two relations with similar tuples. The **Union** operator *adds* one set of tuples to another. We covered these at the end of Lesson 3.  

#### **Chaining: Joins and Subqueries**

Since relational operators always produce relational resultsets, we can feed the results of one operator to the next in a chain of operators. 

So, for example, a table join is actually three operations. Let `TableA` and `TableB` be two relations, then a table join is equivalent to:  

>`TableA` **Product** `TableB` **Restrict** join-conditions **Project** columns

In fact, that is exactly how an implicit join expresses it. In SQL we have the `JOIN` operator so we don't accidentally forget to do the **Restrict** and **Project** after the **Product**. Otherwise it is totally redundant. 

Similarly, a SQL subquery is just a relation (in the form of a virtual table) that we insert in the chain of relational operators that make up a SQL query. 

tb_a = {'name':['

---
## **Data Integrity**
The relational model is designed to provide three kinds of **data integrity**:
- **Domain Integrity:** The data types (codomains) are known in advance and are configurable to suit a given usage. If birthdays are known to be dates, then they are *always* dates. **That way we always know how to interpret the facts.**
- **Entity Integrity:** Data about any given thing (in the domain) can be recalled precisely. **There is no risk of confusing one thing with another or losing track of a thing altogether.** 
- **Referential Integrity:** If tuple A refers to tuple B, then for sure tuple B exists in the data set. If tuple B ceases to exist, then the reference in tuple A is updated accordingly. **There are no bad references.**

We could go on about how great each of these three integrity rules are, but it should be self-evident by now. The purpose of a relational database is to keep data safe from corruption, especially as you add, alter, or delete data. If the database has rock solid integrity then it is pretty hard to mess things up ... unless you have terrible table designs, which we will get into in Lesson 5. 



---
## **The Many Kinds of Keys**
One of the ways we enforce data integrity is through the design of table **keys**. We have been using the term "key" pretty loosely so far in the course. We've seen primary keys, foreign keys, and surrogate keys. However, there is a lot more to it than that. We'll start with the concept of an *index* and then work our way through the various kinds of keys, one at a time. 

### **Indexes**
An **index** is a lookup table for finding things quickly. You will often find them in the back of nonfiction books to indicate the pages where a given topic is discussed. The purpose is to prevent the reader from having to search through the book page by page. 

Indexes (or sometimes called *indices*, pronounced with a soft second *i*) are used for a similar purpose in SQL. We can *index* a column to make it quick to find each occurence of each value in the column. 

Let's say that we have a table with people's names and email addresses, perhaps something like this:

| name | email| 
|---   | ---  |
Barb Ackue | backue@acmesales.com |
Buck Kinnear | bkinnear@acmesales.com
Greta Life | glife@acmesales.com
Ira Membrit	| imembrit@acmesales.com
Shonda Leer	 | sleer@acmesales.com
Brock Lee	| blee@acmesales.com
Brock Lee | bleesr@acmesales.com
Mario Speedwagon | 	mario.speedwagon@gmail.com
Petey Cruiser | 	pcruiser1958@hotmail.com
Anna Sthesia |	anna@noneofyourbusiness.org
Paul Molive	| pmolive@pmolive.com
Anna Mull	| Anna.Mull@ctspca.com
Gail Forcewind |	forcewindg@bwards.com
Paige Turner |	paigeturner@yahoo.com
Bob Frapples |	 bob888237@aol.com
Walter Melon |	Melon@camp.com
Nick R. Bocke	| Nicky@earthlink.com
Greta Life |	glife@acmesales.com

Then the index for the `name` column might look like this:

| name | rowid | 
|---   | ---  |
Anna Mull	| 12
Anna Sthesia |	10
Barb Ackue | 1 
Bob Frapples |	 15
Brock Lee	| 6
Brock Lee | 7
Buck Kinnear | 2
Gail Forcewind |	13
Greta Life | 3
Greta Life |	18
Ira Membrit	| 4
Mario Speedwagon | 	8
Nick R. Bocke	| 17
Paige Turner |	14
Paul Molive	| 11
Petey Cruiser | 	9
Shonda Leer	 | 5
Walter Melon |	16

Take note:
- The `name` values are sorted to make lookups really fast.
- the `rowid` (line #) of each occurrence is noted, even when there are duplicates. 

**Why do we care about this? Because every *key* is just a special kind of index.**

### **Candidate Keys & Primary Keys**

An index is a **candidate key** if the values are unique. In other words there are no repeats like Brock Lee or Greta Life. 

There can be multiple candidate keys in a given table. Only one of them gets to be called the **primary** key. That is a design choice made by the database designer. 

### **Composite Keys**

A **composite key** is a candidate key that is based on a multi-column index. So, we could create an index for the (name, email) pairs in our indexing example. The effect would be to disambiguate Brock Lee and his dad because they have distinct email addresses. 

### **Surrogate Keys**

Ideally, a designer wants a primary key that is:
- **short** to make them easy to type and conserve space
- **numeric** to save space
- **unique**, guaranteed never to have a conflict
- **permanent**, never changing
- **meaningless**, so that users don't try to change them

Such is the general ideal of a **surrogate** key. It numbers each row added to the table, typically counting upwards from 1. If a row is subsequently deleted, then the key values *are not renumbered*. We want a key to remain stable as a rock even when data changes. 

> **Heads Up: Current best practice is to use surrogate primary keys whenever possible. The naming convention is to append `id` to the table name. When creating a new database using anything but surrogate keys, expect to get lots of questions.**

### **Foreign Keys**
**A foreign key column is an index, not a candidate key.** We call it a key because it refers to a *foreign* candidate key but we can of course have duplicates in a foreign key column if multiple rows refer to the same foreign row. 
 


---
## **PRO TIPS: How to find *and remove* duplicate rows**
It is commonplace that source data will have lots of redundancies. While some kinds of redundancy can only be detected with the sorts of advanced techniques covered in Lesson 5, we can at least check for redundant rows with a simple SQL `SELECT` query.

Let's take our Staff table from the MS Excel example. Run the cell below to load it into a SQLite in-memory database.


In [None]:
# Load %%sql magic
%load_ext sql

# Standard Imports
import sqlite3

%sql sqlite://

%sql DROP TABLE IF EXISTS Staff;
%sql CREATE TABLE Staff (eid INTEGER PRIMARY KEY, name TEXT, email TEXT);
%sql INSERT INTO Staff (name,email) VALUES ('Barb Ackue','backue@acmesales.com'),('Buck Kinnear','bkinnear@acmesales.com'), ('Greta Life',	'glife@acmesales.com'), ('Greta Life',	'glife@acmesales.com'),('Greta Life',	'glife@acmesales.com')

The sql extension is already loaded. To reload it, use:
  %reload_ext sql
 * sqlite://
Done.
 * sqlite://
Done.
 * sqlite://
5 rows affected.


[]

If we look carefully we see that somehow Greta Life was recorded three times. Oops.

In [None]:
%%sql
SELECT * 
FROM Staff; 

 * sqlite://
Done.


eid,name,email
1,Barb Ackue,backue@acmesales.com
2,Buck Kinnear,bkinnear@acmesales.com
3,Greta Life,glife@acmesales.com
4,Greta Life,glife@acmesales.com
5,Greta Life,glife@acmesales.com


There are lots of ways to find duplicates like this one. The easiest is to use a `GROUP BY` to count the number of times each row appears.

In [None]:
%%sql
SELECT name, email
FROM Staff
GROUP BY name, email
HAVING count(*) > 1;

 * sqlite://
Done.


name,email
Greta Life,glife@acmesales.com


That seems simple enough. 

Now, in a quick preview of Lesson 8, we will show a quick fix. Finding and eliminating rows in our duplicates query would seem pretty simple: just return the rows without duplicates. 

 

In [None]:
%%sql
SELECT eid, name, email
FROM Staff
GROUP BY name, email
HAVING count(*) = 1;

 * sqlite://
Done.


eid,name,email
1,Barb Ackue,backue@acmesales.com
2,Buck Kinnear,bkinnear@acmesales.com


Maybe it's not quite so simple. The problem is how to eliminate *all but one* of the rows in each group. The query below uses a simple math trick. 

In [None]:
%%sql
SELECT min(eid), name, email
FROM Staff
GROUP BY name,email;

 * sqlite://
Done.


min(eid),name,email
1,Barb Ackue,backue@acmesales.com
2,Buck Kinnear,bkinnear@acmesales.com
3,Greta Life,glife@acmesales.com


We used the `min()` function to keep the row with the smallest `eid` value. Now we're getting somewhere. We *could*, of course, save this resultset as a new table. However, let's see if we can delete the rows we *don't* want. For that we'll use a `DELETE` query from Lesson 8 and a simple bit of logic to eliminate any rows on our non-duplicate rows list. 

In [None]:
%%sql
-- Delete all redundant rows
DELETE FROM Staff
WHERE eid NOT IN 
  -- subquery to get the eids of rows we want to keep
  ( SELECT min(eid)
    FROM Staff
    GROUP BY name,email );

-- Verify that it worked
SELECT * 
FROM Staff;

 * sqlite://
Done.
Done.


eid,name,email
1,Barb Ackue,backue@acmesales.com
2,Buck Kinnear,bkinnear@acmesales.com
3,Greta Life,glife@acmesales.com


It seems to have worked. All it took was a little understanding of relational algebra (**Restrict** and **Project** in this case) and some creativity. We will try out a few more edge cases like this in Lesson 8. 

> For SQL geeks only: Google BigQuery's Standard SQL can use [exclusions for column lists](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select_except) in the `SELECT` clause. Perhaps someday soon it will be possible to write our de-duping query without listing any columns except the primary key. It would look something like this:
```sql
-- Delete all redundant rows
DELETE FROM Staff
WHERE eid NOT IN 
  -- subquery to get the eids of rows we want to keep
  ( 
    SELECT min(eid)
    FROM Staff
    GROUP BY * EXCEPT eid -- every column except `eid`
  );
```
>This can save a lot of typing and potential bugs if there are a lot of columns. Who says you can't teach an old language new tricks?

---
## **SQL AND BEYOND: pandas DataFrames**




 







  

 








> **Heads Up:** This section is targeted toward those who know some Python. Those of you who aren't Python coders can relax. There will be no Python on the next quiz.

Beyond a few informal conventions (lists of dictionaries, dictionaries of lists, dataclasses (new in Python 3.11), etc.) there is no built-in support for tabular data structures in Python. Instead, everyone uses a third-party library called *pandas* $-$ note that the lowercase name is intentional $-$ whenever they need a well-behaved data table. Pandas actually includes two data structures of interest:
- `Series`, a sequence of values of the same data type
- `DataFrame`, a collection of `Series` of the same length, with each `Series` having a unique name

A `DataFrame` is roughly equivalent to a dictionary of `Series`, making it really convenient to do column-oriented operations like calculate sums and subtotals. In fact, every `DataFrame` comes equipped with functions that can be used to index, restrict, translate, merge, append, and exclude data as needed. (For those of you keeping track, this means that the `DataFrame` data structure is a functionally complete implementation of relational algebra.It even supports chaining.) 

The table below compares the features of a *pandas* `DataFrame` with its SQL equivalent. `df`, `df_a`, `df_b` are `DataFrames`. The tables `T`, `Ta`, `Tb` are their SQL equivalents. 

| Operation | SQL | *pandas* `DataFrame` |
|----|-----|--------|
| set indexing | primary key | index column |
| **Restrict** | `SELECT * FROM T WHERE id = 3` | `df[df.index=3]` |
| **Restrict**  | `SELECT * FROM T WHERE col1 > 3` | `df[df.col1 > 3]` |
| **Restrict** | `SELECT * FROM T LIMIT 10` | `df.head(10)` |
| **Project**  | `SELECT col1, col2 FROM T` | `df[['col1','col2']]` |
| **Project**  | `SELECT count(*) FROM T`  | `df.count(df.index)` |
| aggregation | `SELECT col1, count(col2) FROM T GROUP BY col1` | `df.groupby(df.col1).agg(col2="count")`|
| **Product** | `SELECT * FROM Ta, Tb` | `df_a.merge(df_b,how="cross")` |
| inner join |  `SELECT * FROM Ta JOIN Tb ON ...`|`pd.merge(df_a,df_b, on= ...)`  |
| left join | `SELECT * FROM Ta LEFT JOIN Tb ON ...`|`pd.merge(df_a,df_b, how='left', on= ...)` |
| **Union** | `SELECT * FROM Ta UNION SELECT * FROM Tb`| `pd.concat(df_a, df_b).drop_duplicates()`|
| **Except** | `SELECT * FROM Ta EXCEPT SELECT * FROM Tb` | `df_a[~df_a.is_in(df_b).all(1)]` | 


The similarities go well beyond these examples. Here is the same calculation in SQL and pandas.
```sql 
-- SQL 
-- Adapted from Lesson 3
-- lineup_play_facts is a table of NBA play stats
SELECT year, team, 
       sum(play_length_mins) as minutes, 
       36*(sum(p_points) - sum(m_points))/minutes) as `plus_minus_36m`
FROM lineup_play_facts
ORDER BY year, lineup

```

```python
# Python 
# lineup_play_facts_df is a DataFrame of NBA play stats
lineup_facts_df = lineup_play_facts_df
                      .groupby(['year','lineup'], as_index = False) # as_index=False retains year and lineup as columns
                      .agg( { 'team':'first',
                              'minutes':'sum',                        
                              'p_points':'sum',
                              'm_points':'sum',
                            }
                      )
                      .sort_values(['year','lineup'])

# add a new column for plus_minus_36m
lineup_facts_df['plus_minus_36m'] = 
        36*(lineup_facts_df['p_points'] - lineup_facts_df['m_points'])/lineup_facts_df['minutes']
```

In many ways SQL and *pandas* are complementary. If your focus in on managing data, then use SQL. If your focus is on building analytical models, then use pandas. SQL and Python eventually meet somewhere in the middle, usually as part of a data pipeline process. 

In so-called ETL pipelines, for example, it is pretty common to **make SQL calls from within Python code.** The *pandas* library comes with the [`pd.read_sql()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html#pandas.read_sql) function and the [`df.to_sql()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)  method that allow a Python program to 
- extract data using a SQL query, 
- process it in Python as a `DataFrame`, and then 
- write the data back using SQL. 

There are even functions and methods for cloud-services like BigQuery that don't use connection strings. 

---
## **Congratulations! You've made it to the end of Lesson 4.**

In this lesson we covered the essential mathematical theory that underlies the relational database model. In the next two lessons we will apply what we've learned to database design.  



## **On your way out ... Be sure to save your work**.
In Google Drive, save this notebook file to your `DATA6510` folder so you can find it next time.