<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Postgres SQL Lab


---
We are going to continue working on the northwind database:


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#1.-Inspect-the-database" data-toc-modified-id="1.-Inspect-the-database-0.1">1. Inspect the database</a></span></li><li><span><a href="#2.-Print-schemas" data-toc-modified-id="2.-Print-schemas-0.2">2. Print schemas</a></span></li><li><span><a href="#3.-Table-peek" data-toc-modified-id="3.-Table-peek-0.3">3. Table peek</a></span></li></ul></li><li><span><a href="#4.-Investigating-products" data-toc-modified-id="4.-Investigating-products-1">4. Investigating products</a></span><ul class="toc-item"><li><span><a href="#4.1-What-categories-of-products-is-the-company-selling?" data-toc-modified-id="4.1-What-categories-of-products-is-the-company-selling?-1.1">4.1 What categories of products is the company selling?</a></span></li><li><span><a href="#4.2-How-many-products-per-category-does-the-catalog-contain?" data-toc-modified-id="4.2-How-many-products-per-category-does-the-catalog-contain?-1.2">4.2 How many products per category does the catalog contain?</a></span></li><li><span><a href="#4.3-How-many-not-discontinued-products-are-there-per-category?" data-toc-modified-id="4.3-How-many-not-discontinued-products-are-there-per-category?-1.3">4.3 How many <em>not discontinued</em> products are there per category?</a></span></li><li><span><a href="#4.4-What-are-the-top-five-most-expensive-products-(not-discontinued)?" data-toc-modified-id="4.4-What-are-the-top-five-most-expensive-products-(not-discontinued)?-1.4">4.4 What are the top five most expensive products (not discontinued)?</a></span></li><li><span><a href="#4.5-How-many-units-of-each-of-these-5-products-are-there-in-stock?" data-toc-modified-id="4.5-How-many-units-of-each-of-these-5-products-are-there-in-stock?-1.5">4.5 How many units of each of these 5 products are there in stock?</a></span></li><li><span><a href="#4.6-Use-pandas-to-make-a-useful-bar-chart-of-the-product-data." data-toc-modified-id="4.6-Use-pandas-to-make-a-useful-bar-chart-of-the-product-data.-1.6">4.6 Use pandas to make a useful bar chart of the product data.</a></span></li></ul></li><li><span><a href="#5.-Investigating-orders" data-toc-modified-id="5.-Investigating-orders-2">5. Investigating orders</a></span><ul class="toc-item"><li><span><a href="#5.1-How-many-orders-in-total?" data-toc-modified-id="5.1-How-many-orders-in-total?-2.1">5.1 How many orders in total?</a></span></li><li><span><a href="#5.2-How-many-orders-per-year?" data-toc-modified-id="5.2-How-many-orders-per-year?-2.2">5.2 How many orders per year?</a></span></li><li><span><a href="#5.3-How-many-orders-per-quarter?" data-toc-modified-id="5.3-How-many-orders-per-quarter?-2.3">5.3 How many orders per quarter?</a></span></li><li><span><a href="#5.4-Which-country-is-receiving-the-most-orders?" data-toc-modified-id="5.4-Which-country-is-receiving-the-most-orders?-2.4">5.4 Which country is receiving the most orders?</a></span></li><li><span><a href="#5.5-Which-country-is-receiving-the-least?" data-toc-modified-id="5.5-Which-country-is-receiving-the-least?-2.5">5.5 Which country is receiving the least?</a></span></li><li><span><a href="#5.6-What's-the-average-shipping-time-(ShippedDate---OrderDate)?" data-toc-modified-id="5.6-What's-the-average-shipping-time-(ShippedDate---OrderDate)?-2.6">5.6 What's the average shipping time (ShippedDate - OrderDate)?</a></span></li><li><span><a href="#5.7-What-customer-is-submitting-the-highest-number-of-orders?" data-toc-modified-id="5.7-What-customer-is-submitting-the-highest-number-of-orders?-2.7">5.7 What customer is submitting the highest number of orders?</a></span></li><li><span><a href="#5.8-What-customer-is-generating-the-highest-revenue?" data-toc-modified-id="5.8-What-customer-is-generating-the-highest-revenue?-2.8">5.8 What customer is generating the highest revenue?</a></span></li><li><span><a href="#5.9-[Challenge]-What-fraction-of-the-revenue-is-generated-by-the-top-5-customers?" data-toc-modified-id="5.9-[Challenge]-What-fraction-of-the-revenue-is-generated-by-the-top-5-customers?-2.9">5.9 [Challenge] What fraction of the revenue is generated by the top 5 customers?</a></span></li></ul></li></ul></div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set(font_scale=1.5)

In [2]:
import psycopg2

# DSN (data source name) format for database connections:  
# [protocol / database  name]://[username]:[password]@[hostname / ip]:[port]/[database name here]


# on your computer you are the user postgres (full administrative access)
db_user = 'postgres'
# if you need a password to access a database, put it here
db_password = ''
# on your computer, use localhost
db_host = 'localhost'
# the default port for postgres is 5432
db_port = 5432
# we want to connect to the northwind database
database = 'northwind'

conn_str = f'postgresql://{db_user}:{db_password}@{db_host}:{db_port}/{database}'
conn = psycopg2.connect(conn_str)

**And generate dataframes from string queries using pandas `.read_sql` like so:**

In [3]:
pd.read_sql("SELECT tablename FROM pg_catalog.pg_tables WHERE schemaname='public'", con=conn)

Unnamed: 0,tablename
0,customer_customer_demo
1,suppliers
2,shippers
3,customer_demographics
4,territories
5,region
6,us_states
7,categories
8,employees
9,employee_territories


### 1. Inspect the database

If we were connected via console, it would be easy to list all tables using `\dt`. We can also access table information by running a query on the `information_schema.tables` table.

**Write a `SELECT` statement that lists all the tables in the public schema of the `northwind` database, sorted alphabetically.**

```*.sql
SELECT tablename 
FROM pg_catalog.pg_tables 
WHERE schemaname='public'
```

In [4]:
# A:
query = """
SELECT tablename 
FROM pg_catalog.pg_tables 
WHERE schemaname='public'
"""

pd.read_sql(query, con=conn)

Unnamed: 0,tablename
0,customer_customer_demo
1,suppliers
2,shippers
3,customer_demographics
4,territories
5,region
6,us_states
7,categories
8,employees
9,employee_territories


### 2. Print schemas

The table `information_schema.columns` contains schema information on each schema.

Query it to display schemas of all the public tables. In particular we are interested in the column names and data types. Make sure you only include public schemas to avoid cluttering your results with a bunch of postgres related stuff.

Specifically, select columns `table_name`, `data_type`, and `table_schema` from the table only where `table_schema` is "public".

In [5]:
# A:
query = """
SELECT table_name, column_name, data_type, table_schema
FROM information_schema.columns
WHERE table_schema = 'public'
order by table_name
"""

pd.read_sql(query, con=conn).head()

Unnamed: 0,table_name,column_name,data_type,table_schema
0,categories,description,text,public
1,categories,category_name,character varying,public
2,categories,category_id,smallint,public
3,categories,picture,bytea,public
4,customer_customer_demo,customer_id,character,public


### 3. Table peek

Another way of quickly looking at table information is to query the first few rows. Do this for a table or two, for example: `orders`, `products`, `us_states`. 

Display only the first 3 rows.

In [6]:
# A:


> Some tables (like `us_states` or `region`) contain information that is probably less prone to change than other tables (like `orders` or `order_details`). This database is well organized to avoid unnecessary duplication. Let's start digging deeper in the data.

## 4. Investigating products

---

What products is this company selling? The `products` and `categories` tables contain information to answer this question.

We will use a combination of SQL queries and Pandas to answer the following questions:

1. What categories of products is the company selling?
- How many products per category does the catalog contain?
- Let's focus only on products that have not been discontinued => how many products per category?
- What are the most expensive 5 products (not discontinued)?
- How many units of each of these 5 products are there in stock?
- Construct a bar chart of the data with pandas.

### 4.1 What categories of products is the company selling?

> Remember that PostgreSQL is case sensitive.

### 4.2 How many products per category does the catalog contain?


### 4.3 How many _not discontinued_ products are there per category?

### 4.4 What are the top five most expensive products (not discontinued)?

### 4.5 How many units of each of these 5 products are there in stock?

### 4.6 Use pandas to make a useful bar chart of the product data.

## 5. Investigating orders

---

Now that we have a better understanding of products, let's start digging into orders.

1. How many orders are there in total?
- How many orders per year?
- How many orders per quarter?
- Which country is receiving the most orders?
- Which country is receiving the least?
- What's the average shipping time (shipped_date - order_date)?
- What customer is submitting the highest number of orders?
- What customer is generating the highest revenue (need to pd.merge with order_details)?
- [Challenge] What fraction of the revenue is generated by the top 5 customers?

### 5.1 How many orders in total?

### 5.2 How many orders per year?  
The SQL [`Extract`](https://www.tutorialspoint.com/sql/sql-date-functions.htm#function_extract) function will be useful here.

### 5.3 How many orders per quarter?

Make a line plot of this data as well.

Here is another shorthand for casting the dates as integers..

```*.sql
    EXTRACT(year FROM "OrderDate"):: INT AS year, 
    EXTRACT(quarter FROM "OrderDate"):: INT AS quarter 
```

### 5.4 Which country is receiving the most orders?

### 5.5 Which country is receiving the least?

### 5.6 What's the average shipping time (ShippedDate - OrderDate)?

### 5.7 What customer is submitting the highest number of orders?

### 5.8 What customer is generating the highest revenue?

> Hint: You will need to join `orders` with `order_details`. And don't forget the discount column!

### 5.9 [Challenge] What fraction of the revenue is generated by the top 5 customers?