**Querying postgreSQL in Jupyter notebook**

Accessing the database with the Python packages and in the notebook has been useful for writing notes. It's also been nice for breaking down complicated queries and then piecing them back together.
-Ben

In [2]:
import pandas as pd
import sqlalchemy
import sqlalchemy_utils
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2

# Accessing database Alise setup

In [3]:
# Working with PostgreSQL in Python
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(host = 'ec2-54-245-31-214.us-west-2.compute.amazonaws.com',
                       port = '5291',
                       database = 'ecommerce',
                       user = 'sqlpractice',
                       password = 'iloveSQL!')

# Notes

## Order of execution

1. FROM, including JOINs
2. WHERE
3. GROUP BY
4. HAVING
5. WINDOW functions
6. SELECT
7. DISTINCT
8. UNION
9. ORDER BY
10. LIMIT and OFFSET

It can be summarized as:
**calling -> aggregating -> displaying -> filter**

- First creates a working dataset, then filters with rows added by conditions
- Column name aliases are not accessed for all commands except after SELECT (aliases for table names are okay)

## Window functions

- Window functions (always in SELECT statement; it can also be in ORDER BY)
    - Window functions can contain multiple types of functions, including aggregate, rank 
    
- Window functions
    - RANK() 
    - DENSE_RANK() - no gaps in rank values
    - ROW_NUMBER() - assign a unique sequential interger to rows within a partition of a result set, the first row starts with 1
    - NTILE() - to identify percentile or quartile
    - LAG() - pulls from previous row
    - LEAD() - pulls from following row
    

## Other topics to study

- Self joins
- Cross joins
- Data types
- DATES/DATETIME
- Built-in SQL functions
    - ROUND()
    - CAST() - moves something into a float to allow division - see above
- Creating and updating tables
- Pivoting tables with CASE statements

# Queries

## Overview of tables

Set to ecommerce database

In [4]:
## Show all tables in database
sql_query = """
SELECT table_name
FROM information_schema.tables
WHERE table_schema='public'
AND table_type='BASE TABLE';
"""

df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,table_name
0,categories
1,customergroupthresholds
2,customers
3,products
4,suppliers
5,employees
6,orderdetails
7,orders
8,shippers


In [5]:
# Preview table
sql_query = """
SELECT * FROM categories
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,categoryid,categoryname,description
0,1,Beverages,"Soft drinks, coffees, teas, beers, and ales"
1,2,Condiments,"Sweet and savory sauces, relishes, spreads, an..."
2,3,Confections,"Desserts, candies, and sweet breads"
3,4,Dairy Products,Cheeses
4,5,Grains/Cereals,"Breads, crackers, pasta, and cereal"


In [6]:
# Overview of table
sql_query = """
SELECT *
FROM customergroupthresholds
LIMIT 5;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,customergroupname,rangebottom,rangetop
0,Low,$0.00,"$1,000.00"
1,Medium,"$1,000.00","$5,000.00"
2,High,"$5,000.00","$10,000.00"
3,Very High,"$10,000.00","$922,337,203,685,477.58"


In [7]:
## Overview of table
sql_query = """
SELECT *
FROM customers
LIMIT 3;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,customerid,companyname,contactname,contacttitle,address,city,region,postalcode,country,phone,fax
0,ALFKI,Alfreds Futterkiste,Maria Anders,Sales Representative,Obere Str. 57,Berlin,,12209,Germany,030-0074321,030-0076545
1,ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,Owner,Avda. de la Constitucin 2222,Mxico D.F.,,5021,Mexico,(5) 555-4729,(5) 555-3745
2,ANTON,Antonio Moreno Taquera,Antonio Moreno,Owner,Mataderos 2312,Mxico D.F.,,5023,Mexico,(5) 555-3932,


In [8]:
## Overview of table
sql_query = """
SELECT *
FROM suppliers
LIMIT 3;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,supplierid,companyname,contactname,contacttitle,address,city,region,postalcode,country,phone,fax,homepage
0,1,Exotic Liquids,Charlotte Cooper,Purchasing Manager,49 Gilbert St.,London,,EC1 4SD,UK,(171) 555-2222,,
1,2,New Orleans Cajun Delights,Shelley Burke,orders Administrator,P.O. Box 78934,New Orleans,LA,70117,USA,(100) 555-4822,,#CAJUN.HTM#
2,3,Grandma Kelly's Homestead,Regina Murphy,Sales Representative,707 Oxford Rd.,Ann Arbor,MI,48104,USA,(313) 555-5735,(313) 555-3349,


In [9]:
## Overview of table
sql_query = """
SELECT *
FROM employees
LIMIT 3;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,employeeid,lastname,firstname,title,titleofcourtesy,birthdate,hiredate,address,city,region,postalcode,country,homephone,extension,notes,reportsto,photopath
0,1,Davolio,Nancy,Sales Representative,Ms.,1966-12-08,2010-05-01,507 - 20th Ave. E. Apt.57,New York,NY,10027,USA,(206) 555-9857,5467,Education includes a BA in psychology from Col...,2.0,http://accweb/emmployees/davolio.bmp
1,2,Fuller,Andrew,"Vice President, Sales",Dr.,1970-02-19,2010-08-14,908 W. Capital Way,Tacoma,WA,98401,USA,(206) 555-9482,3457,Andrew received his BTS commercial in 1974 and...,,http://accweb/emmployees/fuller.bmp
2,3,Leverling,Janet,Sales Representative,Ms.,1981-08-30,2010-04-01,722 Moss Bay Blvd.,Kirkland,WA,98033,USA,(206) 555-3412,3355,Janet has a BS degree in chemistry from Boston...,2.0,http://accweb/emmployees/leverling.bmp


In [10]:
## Overview of table
sql_query = """
SELECT *
FROM orderdetails
LIMIT 3;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10248,11,$14.00,12,0.0
1,10248,42,$9.80,10,0.0
2,10248,72,$34.80,5,0.0


In [11]:
## Overview of table
sql_query = """
SELECT *
FROM orders
LIMIT 3;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,orderid,customerid,employeeid,orderdate,requireddate,shippeddate,shipvia,freight,shipname,shipaddress,shipcity,shipregion,shippostalcode,shipcountry
0,10248,VINET,5,2014-07-04 08:00:00,2014-08-01,2014-07-16,3,$32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
1,10249,TOMSP,6,2014-07-05 04:00:00,2014-08-16,2014-07-10,1,$11.61,Toms Spezialitten,Luisenstr. 48,Mnster,,44087,Germany
2,10250,HANAR,4,2014-07-08 15:00:00,2014-08-05,2014-07-12,2,$65.83,Hanari Carnes,"Rua do Pao, 67",Rio de Janeiro,RJ,05454-876,Brazil


In [12]:
## Overview of table
sql_query = """
SELECT *
FROM shippers
LIMIT 3;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,shipperid,companyname,phone
0,1,Speedy Express,(503) 555-9831
1,2,United Package,(503) 555-3199
2,3,Federal Shipping,(503) 555-9931


## Window functions: LAG() and LEAD()

What is the difference in quantity ordered of a product from a previous order?

In [13]:
sql_query = """
SELECT *
FROM orderdetails
LIMIT 3;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10248,11,$14.00,12,0.0
1,10248,42,$9.80,10,0.0
2,10248,72,$34.80,5,0.0


In [14]:
# Limit to one product
sql_query = """
SELECT *
FROM orderdetails
WHERE productid=11
LIMIT 10;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10248,11,$14.00,12,0.0
1,10296,11,$16.80,12,0.0
2,10327,11,$16.80,50,0.2
3,10353,11,$16.80,12,0.2
4,10365,11,$16.80,24,0.0
5,10407,11,$16.80,30,0.0
6,10434,11,$16.80,6,0.0
7,10442,11,$16.80,30,0.0
8,10443,11,$16.80,6,0.2
9,10466,11,$16.80,10,0.0


In [15]:
# Show each individually - have to confirm that the orderid is in order
sql_query = """
SELECT *,
       LAG(quantity, 1) OVER() AS quantity_prev,
       LEAD(quantity, 1) OVER() AS quantity_ahead
FROM orderdetails
WHERE productid=11
LIMIT 10;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,orderid,productid,unitprice,quantity,discount,quantity_prev,quantity_ahead
0,10248,11,$14.00,12,0.0,,12
1,10296,11,$16.80,12,0.0,12.0,50
2,10327,11,$16.80,50,0.2,12.0,12
3,10353,11,$16.80,12,0.2,50.0,24
4,10365,11,$16.80,24,0.0,12.0,30
5,10407,11,$16.80,30,0.0,24.0,6
6,10434,11,$16.80,6,0.0,30.0,30
7,10442,11,$16.80,30,0.0,6.0,6
8,10443,11,$16.80,6,0.2,30.0,10
9,10466,11,$16.80,10,0.0,6.0,5


In [16]:
# Calculate difference
sql_query = """
SELECT t.*,
       t.quantity-t.quantity_prev AS q_diff_fr_prev,
       t.quantity-t.quantity_ahead AS q_diff_fr_ahead
FROM
    (SELECT *,
           LAG(quantity, 1) OVER() AS quantity_prev,
           LEAD(quantity, 1) OVER() AS quantity_ahead
    FROM orderdetails
    WHERE productid=11) AS t;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,orderid,productid,unitprice,quantity,discount,quantity_prev,quantity_ahead,q_diff_fr_prev,q_diff_fr_ahead
0,10248,11,$14.00,12,0.0,,12.0,,0.0
1,10296,11,$16.80,12,0.0,12.0,50.0,0.0,-38.0
2,10327,11,$16.80,50,0.2,12.0,12.0,38.0,38.0
3,10353,11,$16.80,12,0.2,50.0,24.0,-38.0,-12.0
4,10365,11,$16.80,24,0.0,12.0,30.0,12.0,-6.0
5,10407,11,$16.80,30,0.0,24.0,6.0,6.0,24.0
6,10434,11,$16.80,6,0.0,30.0,30.0,-24.0,-24.0
7,10442,11,$16.80,30,0.0,6.0,6.0,24.0,24.0
8,10443,11,$16.80,6,0.2,30.0,10.0,-24.0,-4.0
9,10466,11,$16.80,10,0.0,6.0,5.0,4.0,5.0


## Query using CASE

Orders from the USA that were placed in the summer months (June, July, August) need to get shipped from Canada. All others can stay the same. Return the customerid, country, orderdate, shipcountry and a new column you're creating called newshipcountry.

In [17]:
# Overview of customers table
sql_query = """
SELECT *
FROM customers
LIMIT 3;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,customerid,companyname,contactname,contacttitle,address,city,region,postalcode,country,phone,fax
0,ALFKI,Alfreds Futterkiste,Maria Anders,Sales Representative,Obere Str. 57,Berlin,,12209,Germany,030-0074321,030-0076545
1,ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,Owner,Avda. de la Constitucin 2222,Mxico D.F.,,5021,Mexico,(5) 555-4729,(5) 555-3745
2,ANTON,Antonio Moreno Taquera,Antonio Moreno,Owner,Mataderos 2312,Mxico D.F.,,5023,Mexico,(5) 555-3932,


In [18]:
# Overview of ship table
sql_query = """
SELECT *
FROM orders
LIMIT 3;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,orderid,customerid,employeeid,orderdate,requireddate,shippeddate,shipvia,freight,shipname,shipaddress,shipcity,shipregion,shippostalcode,shipcountry
0,10248,VINET,5,2014-07-04 08:00:00,2014-08-01,2014-07-16,3,$32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
1,10249,TOMSP,6,2014-07-05 04:00:00,2014-08-16,2014-07-10,1,$11.61,Toms Spezialitten,Luisenstr. 48,Mnster,,44087,Germany
2,10250,HANAR,4,2014-07-08 15:00:00,2014-08-05,2014-07-12,2,$65.83,Hanari Carnes,"Rua do Pao, 67",Rio de Janeiro,RJ,05454-876,Brazil


In [21]:
# Limit to US orders
sql_query = """
SELECT c.customerid,
       c.country,
       o.orderdate,
       o.shipcountry
FROM customers AS c
JOIN orders AS o
ON c.customerid=o.customerid
WHERE c.country='USA';
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,customerid,country,orderdate,shipcountry
0,RATTC,USA,2014-07-22 19:00:00,USA
1,WHITC,USA,2014-07-31 00:00:00,USA
2,SPLIR,USA,2014-08-01 05:00:00,USA
3,RATTC,USA,2014-08-02 03:00:00,USA
4,RATTC,USA,2014-08-30 05:00:00,USA
...,...,...,...,...
117,GREAL,USA,2016-04-22 13:00:00,USA
118,GREAL,USA,2016-04-30 00:00:00,USA
119,SAVEA,USA,2016-05-01 01:00:00,USA
120,WHITC,USA,2016-05-01 12:00:00,USA


In [22]:
# Limit to US orders + CASE statement for the date
sql_query = """
SELECT c.customerid,
       c.country,
       o.orderdate,
       o.shipcountry,
       CASE WHEN o.orderdate BETWEEN '2014-06-01' AND '2014-08-31'
                 THEN 'Canada'
            ELSE o.shipcountry
            END AS newshipcountry
FROM customers AS c
JOIN orders AS o
ON c.customerid=o.customerid
WHERE c.country='USA';
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,customerid,country,orderdate,shipcountry,newshipcountry
0,RATTC,USA,2014-07-22 19:00:00,USA,Canada
1,WHITC,USA,2014-07-31 00:00:00,USA,Canada
2,SPLIR,USA,2014-08-01 05:00:00,USA,Canada
3,RATTC,USA,2014-08-02 03:00:00,USA,Canada
4,RATTC,USA,2014-08-30 05:00:00,USA,Canada
...,...,...,...,...,...
117,GREAL,USA,2016-04-22 13:00:00,USA,USA
118,GREAL,USA,2016-04-30 00:00:00,USA,USA
119,SAVEA,USA,2016-05-01 01:00:00,USA,USA
120,WHITC,USA,2016-05-01 12:00:00,USA,USA


In [23]:
# Limit to US orders + CASE statement for the summer months
sql_query = """
SELECT c.customerid,
       c.country,
       o.orderdate,
       o.shipcountry,
       CASE WHEN EXTRACT (MONTH FROM o.orderdate) BETWEEN 6 AND 8
                 THEN 'Canada'
            ELSE o.shipcountry
            END AS newshipcountry
FROM customers AS c
JOIN orders AS o
ON c.customerid=o.customerid
WHERE c.country='USA';
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,customerid,country,orderdate,shipcountry,newshipcountry
0,RATTC,USA,2014-07-22 19:00:00,USA,Canada
1,WHITC,USA,2014-07-31 00:00:00,USA,Canada
2,SPLIR,USA,2014-08-01 05:00:00,USA,Canada
3,RATTC,USA,2014-08-02 03:00:00,USA,Canada
4,RATTC,USA,2014-08-30 05:00:00,USA,Canada
...,...,...,...,...,...
117,GREAL,USA,2016-04-22 13:00:00,USA,USA
118,GREAL,USA,2016-04-30 00:00:00,USA,USA
119,SAVEA,USA,2016-05-01 01:00:00,USA,USA
120,WHITC,USA,2016-05-01 12:00:00,USA,USA


In [26]:
# Limit to US orders + CASE statement for the summer months - check with ORDER BY RANDOM()
sql_query = """
SELECT c.customerid,
       c.country,
       o.orderdate,
       o.shipcountry,
       CASE WHEN EXTRACT (MONTH FROM o.orderdate) BETWEEN 6 AND 8
                 THEN 'Canada'
            ELSE o.shipcountry
            END AS newshipcountry
FROM customers AS c
JOIN orders AS o
ON c.customerid=o.customerid
WHERE c.country='USA'
ORDER BY RANDOM();
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,customerid,country,orderdate,shipcountry,newshipcountry
0,SAVEA,USA,2016-04-06 14:00:00,USA,USA
1,LONEP,USA,2016-02-12 01:00:00,USA,USA
2,SAVEA,USA,2015-10-10 19:00:00,USA,USA
3,WHITC,USA,2015-04-11 00:00:00,USA,USA
4,SPLIR,USA,2016-03-25 21:00:00,USA,USA
...,...,...,...,...,...
117,RATTC,USA,2014-11-05 13:00:00,USA,USA
118,WHITC,USA,2015-03-10 20:00:00,USA,USA
119,GREAL,USA,2016-01-06 22:00:00,USA,USA
120,LONEP,USA,2014-09-30 00:00:00,USA,USA


# Practicing Leetcode problems


**Rank scores**

Write a SQL query to rank scores. If there is a tie between two scores, both should have the same ranking. Note that after a tie, the next ranking number should be the next consecutive integer value. In other words, there should be no "holes" between ranks.

| Id | Score |
|---|-----|
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |
| 6  | 3.65  |



## Creating a temporary table

In [27]:
# Creating input table
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,score
0,1,3.5
1,2,3.65
2,3,4.0
3,4,3.85
4,5,4.0
5,6,3.65


In [28]:
# Query with RANK()
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score,
       RANK() OVER(ORDER BY score DESC) AS score_rank
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,score,score_rank
0,4.0,1
1,4.0,1
2,3.85,3
3,3.65,4
4,3.65,4
5,3.5,6


In [29]:
# Query with RANK()
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score,
       DENSE_RANK() OVER(ORDER BY score DESC) AS score_rank
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,score,score_rank
0,4.0,1
1,4.0,1
2,3.85,2
3,3.65,3
4,3.65,3
5,3.5,4


**Bonus: How would you do this without a window function?**

In [32]:
# Without a window function
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score,
       (SELECT COUNT(DISTINCT(score)) 
        FROM input_table
        WHERE input_table.score >= it.score) AS rank
FROM input_table AS it
ORDER BY rank;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,score,rank
0,4.0,1
1,4.0,1
2,3.85,2
3,3.65,3
4,3.65,3
5,3.5,4


## Validating solution with temporary table

Do this with a window function first

In [19]:
# Query with RANK()



In [20]:
# Query with dense_rank()



# --