## Using Joins on Census Data

Population Distribution and Change: 2000 to 2010

- https://www.census.gov/prod/cen2010/briefs/c2010br-01.pdf

Data dictionary
- https://www.census.gov/prod/cen2010/doc/pl94-171.pdf



In [1]:
%load_ext sql

Connect to the empty database made with pgadmin

In [2]:
%sql postgresql://postgres:eric@localhost:5432/analysis

'Connected: postgres@analysis'

### Create some experimental tables

In [6]:
%%sql
CREATE TABLE departments (
    dept_id bigserial,
    dept varchar(100),
    city varchar(100),
    CONSTRAINT dept_key PRIMARY KEY (dept_id),
    CONSTRAINT dept_city_unique UNIQUE (dept, city)
);

CREATE TABLE employees (
    emp_id bigserial,
    first_name varchar(100),
    last_name varchar(100),
    salary integer,
    dept_id integer REFERENCES departments (dept_id),
    CONSTRAINT emp_key PRIMARY KEY (emp_id),
    CONSTRAINT emp_dept_unique UNIQUE (emp_id, dept_id)
);

 * postgresql://postgres:***@localhost:5432/analysis
Done.
Done.


[]

In [8]:
%%sql
INSERT INTO departments (dept, city)
VALUES
    ('Tax', 'Atlanta'),
    ('IT', 'Boston');

INSERT INTO employees (first_name, last_name, salary, dept_id)
VALUES
    ('Nancy', 'Jones', 62500, 1),
    ('Lee', 'Smith', 59300, 1),
    ('Soo', 'Nguyen', 83000, 2),
    ('Janet', 'King', 95000, 2);

 * postgresql://postgres:***@localhost:5432/analysis
2 rows affected.
4 rows affected.


[]

In [9]:
%sql select * from departments;

 * postgresql://postgres:***@localhost:5432/analysis
2 rows affected.


dept_id,dept,city
1,Tax,Atlanta
2,IT,Boston


In [10]:
%sql select * from employees;

 * postgresql://postgres:***@localhost:5432/analysis
4 rows affected.


emp_id,first_name,last_name,salary,dept_id
1,Nancy,Jones,62500,1
2,Lee,Smith,59300,1
3,Soo,Nguyen,83000,2
4,Janet,King,95000,2


In [11]:
%%sql
SELECT * 
FROM employees JOIN departments
ON employees.dept_id = departments.dept_id



 * postgresql://postgres:***@localhost:5432/analysis
4 rows affected.


emp_id,first_name,last_name,salary,dept_id,dept_id_1,dept,city
1,Nancy,Jones,62500,1,1,Tax,Atlanta
2,Lee,Smith,59300,1,1,Tax,Atlanta
3,Soo,Nguyen,83000,2,2,IT,Boston
4,Janet,King,95000,2,2,IT,Boston


### Join theory

JOIN Returns rows from both tables where matching values are found in the joined columns of both tables. Alternate syntax is INNER JOIN.

LEFT JOIN Returns every row from the left table plus rows that match values in the joined column from the right table. When a left table row doesn’t have a match in the right table, the result shows no values from the right table

RIGHT JOIN Returns every row from the right table plus rows that match the key values in the key column from the left table. When a right table row doesn’t have a match in the left table, the result shows no values from the left table.

FULL OUTER JOIN Returns every row from both tables and matches rows; then joins the rows where values in the joined columns match. If there’s no match for a value in either the left or right table, the query result contains an empty row for the other table.

CROSS JOIN Returns every possible combination of rows from both tables.


Creating two tables to explore JOIN types

In [12]:
%%sql
CREATE TABLE schools_left (
    id integer CONSTRAINT left_id_key PRIMARY KEY,
    left_school varchar(30)
);

CREATE TABLE schools_right (
    id integer CONSTRAINT right_id_key PRIMARY KEY,
    right_school varchar(30)
);

INSERT INTO schools_left (id, left_school) VALUES
    (1, 'Oak Street School'),
    (2, 'Roosevelt High School'),
    (5, 'Washington Middle School'),
    (6, 'Jefferson High School');

INSERT INTO schools_right (id, right_school) VALUES
    (1, 'Oak Street School'),
    (2, 'Roosevelt High School'),
    (3, 'Morrison Elementary'),
    (4, 'Chase Magnet Academy'),
    (6, 'Jefferson High School');

 * postgresql://postgres:***@localhost:5432/analysis
Done.
Done.
4 rows affected.
5 rows affected.


[]

In [15]:
%sql select * from schools_left;

 * postgresql://postgres:***@localhost:5432/analysis
4 rows affected.


id,left_school
1,Oak Street School
2,Roosevelt High School
5,Washington Middle School
6,Jefferson High School


In [16]:
%sql select * from schools_right;

 * postgresql://postgres:***@localhost:5432/analysis
5 rows affected.


id,right_school
1,Oak Street School
2,Roosevelt High School
3,Morrison Elementary
4,Chase Magnet Academy
6,Jefferson High School


### JOIN

In [18]:
%%sql
SELECT *
FROM schools_left JOIN schools_right
ON schools_left.id = schools_right.id;

 * postgresql://postgres:***@localhost:5432/analysis
3 rows affected.


id,left_school,id_1,right_school
1,Oak Street School,1,Oak Street School
2,Roosevelt High School,2,Roosevelt High School
6,Jefferson High School,6,Jefferson High School


returns only the three rows of those IDs that match.  join doesn't provide rows that exist in only one of the tables - use other joins for this

### Left and Right Join
LEFT JOIN and RIGHT JOIN keywords each return all rows from one table and display blank rows from the other table if no matching values are found in the joined columns.

In [21]:
%%sql
SELECT *
FROM schools_left LEFT JOIN schools_right
ON schools_left.id = schools_right.id

 * postgresql://postgres:***@localhost:5432/analysis
4 rows affected.


id,left_school,id_1,right_school
1,Oak Street School,1.0,Oak Street School
2,Roosevelt High School,2.0,Roosevelt High School
5,Washington Middle School,,
6,Jefferson High School,6.0,Jefferson High School


shows all four rows from schools_left as well as the three rows in schools_right where the id fields matched. Because schools_right doesn’t contain a value of 5 in its right_id column, there’s no match, so LEFT JOIN shows an empty row on the right rather than omitting the entire row from the left table as with JOIN.



In [22]:
%%sql
SELECT *
FROM schools_left RIGHT JOIN schools_right
ON schools_left.id = schools_right.id

 * postgresql://postgres:***@localhost:5432/analysis
5 rows affected.


id,left_school,id_1,right_school
1.0,Oak Street School,1,Oak Street School
2.0,Roosevelt High School,2,Roosevelt High School
,,3,Morrison Elementary
,,4,Chase Magnet Academy
6.0,Jefferson High School,6,Jefferson High School


same situation for right join.  use left and right joins for:

- query results to contain all the rows from one of the tables
- look for missing values in one of the tables; for example, when you’re comparing data about an entity representing two different time periods.
- When you know some rows in a joined table won’t have matching values.

### Full Outer Join
see all rows

In [24]:
%%sql
SELECT * 
FROM schools_left FULL OUTER JOIN schools_right
ON schools_left.id = schools_right.id

 * postgresql://postgres:***@localhost:5432/analysis
6 rows affected.


id,left_school,id_1,right_school
1.0,Oak Street School,1.0,Oak Street School
2.0,Roosevelt High School,2.0,Roosevelt High School
5.0,Washington Middle School,,
6.0,Jefferson High School,6.0,Jefferson High School
,,4.0,Chase Magnet Academy
,,3.0,Morrison Elementary


get all the rows.  use for:
- to merge two data sources that partially overlap
- visualize the degree to which the tables share matching values.

### Cross Join
Cartesian product - all combinations

In [25]:
%%sql
SELECT *
FROM schools_left CROSS JOIN schools_right;


 * postgresql://postgres:***@localhost:5432/analysis
20 rows affected.


id,left_school,id_1,right_school
1,Oak Street School,1,Oak Street School
1,Oak Street School,2,Roosevelt High School
1,Oak Street School,3,Morrison Elementary
1,Oak Street School,4,Chase Magnet Academy
1,Oak Street School,6,Jefferson High School
2,Roosevelt High School,1,Oak Street School
2,Roosevelt High School,2,Roosevelt High School
2,Roosevelt High School,3,Morrison Elementary
2,Roosevelt High School,4,Chase Magnet Academy
2,Roosevelt High School,6,Jefferson High School


Result has 20 rows - four rows in left table times the five rows in the right

### Specifying columns

In [27]:
%%sql
SELECT id 
FROMschools_left LEFT JOIN schools_right
ON schools_left.id = schools_right.id

 * postgresql://postgres:***@localhost:5432/analysis
(psycopg2.ProgrammingError) syntax error at or near "LEFT"
LINE 2: FROMschools_left LEFT JOIN schools_right
                         ^
 [SQL: 'SELECT id \nFROMschools_left LEFT JOIN schools_right\nON schools_left.id = schools_right.id'] (Background on this error at: http://sqlalche.me/e/f405)


throws an error as id exists in both tables.  add the table name to fix

In [29]:
%%sql
SELECT schools_left.id,
        schools_left.left_school,
        schools_right.right_school
FROM schools_left LEFT JOIN schools_right
ON schools_left.id = schools_right.id

 * postgresql://postgres:***@localhost:5432/analysis
4 rows affected.


id,left_school,right_school
1,Oak Street School,Oak Street School
2,Roosevelt High School,Roosevelt High School
5,Washington Middle School,
6,Jefferson High School,Jefferson High School


### Using Aliases 
Improve readability of joins

In [3]:
%%sql
SELECT lt.id, lt.left_school,rt.right_school
FROM schools_left AS lt LEFT JOIN schools_right AS rt
ON lt.id = rt.id

 * postgresql://postgres:***@localhost:5432/analysis
4 rows affected.


id,left_school,right_school
1,Oak Street School,Oak Street School
2,Roosevelt High School,Roosevelt High School
5,Washington Middle School,
6,Jefferson High School,Jefferson High School
