# Lesson 1.4 - Starter Code 
## Advanced SQL Walkthrough & Independant Practice

Here's the situation - your working with a Postgre Database at a large wine distributor who needs you to maintain their database. You'll use some of your advanced SQl skills to take care of customer cases. Let's begin! 

GA provided database credentials:

psql -h dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com -p 5432 -U dsi_student northwind
password: gastudents

First, let's load in the ipython sql extension so that we can use sql within the ipython notebook. 

In [1]:
#Connect to the remote database with paramaters provided
import pandas as pd
#import psycopg2 
import sqlalchemy


engine = sqlalchemy.create_engine('postgresql://dsi_student:gastudents@dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com:5432/northwind')

# This can work but sometime there are issues with the connection being specifically supported by psql
# params = {
#   'dbname': 'northwind',
#   'user': 'dsi_student',
#   'password': 'gastudents',
#   'host': 'dsi.c20gkj5cvu3l.us-east-1.rds.amazonaws.com',
#   'port': 5432
# }

# conn = psycopg2.connect(**params)

Let's check out the schema and tables of northwind database: https://northwinddatabase.codeplex.com/

In [2]:
pd.read_sql_query('SELECT DISTINCT(table_schema) FROM information_schema.tables\
                          ORDER BY 1;',con=engine)

Unnamed: 0,table_schema
0,information_schema
1,pg_catalog
2,public


In [6]:
pd.read_sql("SELECT DISTINCT(table_type) FROM  information_schema.tables;",con=engine)

Unnamed: 0,table_type
0,BASE TABLE
1,VIEW


In [3]:
#View tables in this database

pd.read_sql("SELECT table_schema,table_name, table_type FROM  information_schema.tables WHERE table_schema = 'public';",con=engine)

Unnamed: 0,table_schema,table_name,table_type
0,public,categories,BASE TABLE
1,public,customercustomerdemo,BASE TABLE
2,public,customerdemographics,BASE TABLE
3,public,customers,BASE TABLE
4,public,employees,BASE TABLE
5,public,employeeterritories,BASE TABLE
6,public,order_details,BASE TABLE
7,public,orders,BASE TABLE
8,public,products,BASE TABLE
9,public,region,BASE TABLE


Check the database for syntax and helpful queries for when things go wrong!

In [7]:
pd.read_sql_query("SELECT * FROM order_details LIMIT 1", con=engine)

Unnamed: 0,OrderID,ProductID,UnitPrice,Quantity,Discount
0,10248,11,14.0,12,0.0


In [8]:
pd.read_sql_query("SELECT * FROM products LIMIT 3", con=engine)

Unnamed: 0,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued
0,1,Chai,8,1,10 boxes x 30 bags,18.0,39,0,10,1
1,2,Chang,1,1,24 - 12 oz bottles,19.0,17,40,25,1
2,3,Aniseed Syrup,1,2,12 - 550 ml bottles,10.0,13,70,25,0


## WALK THROUGH FUNCTIONS WE REVIEWED TODAY
*Consider remove "LIMIT 10" clause at the end of each sql statement, I included for readibility

In [10]:
#Leverage CASE statement to label products that have been discontinued
wt_query0 = """\
SELECT "City",
 CASE WHEN "City" = 'Berlin' THEN 'one'
 ELSE NULL END AS "New City"
 FROM customers
 LIMIT 10;
"""

pd.read_sql(wt_query0, engine)

Unnamed: 0,City,New City
0,Berlin,one
1,México D.F.,
2,México D.F.,
3,London,
4,Luleå,
5,Mannheim,
6,Strasbourg,
7,Madrid,
8,Marseille,
9,Tsawassen,


In [14]:
#Leverage HAVING to find only products with average price greater than 15
wt_query1 = """\
SELECT "ProductID", avg("UnitPrice")
FROM order_details
GROUP BY 1
HAVING avg("UnitPrice") > 15
LIMIT 10;
"""

pd.read_sql(wt_query1, engine)

Unnamed: 0,ProductID,avg
0,43,43.042857
1,8,38.769231
2,11,19.6
3,39,16.68
4,16,16.376745
5,61,27.7875
6,14,21.347727
7,17,36.470271
8,28,41.975757
9,36,17.896774


In [18]:
#Concat
wt_query3 = """\
SELECT CONCAT("City", ', ',  "Country") AS Mailing_Destination
FROM customers
LIMIT 10;
"""

pd.read_sql(wt_query3, engine)

Unnamed: 0,mailing_destination
0,"Berlin, Germany"
1,"México D.F., Mexico"
2,"México D.F., Mexico"
3,"London, UK"
4,"Luleå, Sweden"
5,"Mannheim, Germany"
6,"Strasbourg, France"
7,"Madrid, Spain"
8,"Marseille, France"
9,"Tsawassen, Canada"


In [19]:
#Lower
wt_query4 = """\
SELECT LOWER("City") FROM customers
LIMIT 10; 
"""

pd.read_sql(wt_query4, engine)

Unnamed: 0,lower
0,berlin
1,méxico d.f.
2,méxico d.f.
3,london
4,luleå
5,mannheim
6,strasbourg
7,madrid
8,marseille
9,tsawassen


## INDEPENDENT PRACTICE SOLUTIONS

In [22]:
#query to check to make sure table names are accurate
query_0 = """
SELECT * 
FROM order_details
LIMIT 1
"""
print pd.read_sql_query(query_0, con=engine) #look at column names

query_1 = """
SELECT count(distinct "OrderID") \
from order_details;"""

print 'number of non-unique records in OrderID column: \n', pd.read_sql_query(query_1, con=engine)

   OrderID  ProductID  UnitPrice  Quantity  Discount
0    10248         11       14.0        12       0.0
number of non-unique records in OrderID column: 
   count
0    830


**1. Order Subtotals**

For each order, calculate a subtotal for each Order (identified by OrderID). 

Table: order_details

Comments for solution Query: This can be done with query using GROUP BY to aggregate data for each order.

**Questions 2: Alphabetical List of Products**

Learn more about the products they have on stock in their store. Are you interested in all the products? Even the discontinued ones?

This is a rather simple query to get an alphabetical list of products.

**Question 3:  Sales by Year**

Find the subtotal of order by ship year.

This query shows how to get the year part from Shipped_Date column. A subtotal is calculated by a sub-query for each order. The sub-query forms a table and then joined with the Orders table.

Question 4: 

More on sales and products, especially after discounts

a) Find sales price by product after discount
b) Then find the highest grossing products, after discount

This query calculates sales price for each order after discount is applied.

**Question 5: Customers and Suppliers by City**

What type of relationships do you have in each city? Your sales teams wants to know so they can better allocate regions and hire more staff.

HINT: UNION and consider adding a new constant from both tables to distingish between table joined
    


**Question 6: Find the products sold and total sale by category and product name**

For each category, we get the list of products sold and the total sales amount. 

Comments for solution Query: Note that, the inner query for the nested table (i.e. "nested_table") is to get sales for each product on each order. It then joins with outer query on Product_ID. In the outer query, products are grouped for each category.

**Question 7: How many units are in stock by category and supplier continent?**

HINT: USE "IN", "CASE" , "GROUP BY"
Use the case statement to transform countries into continent allocations

Comments for solution Query:  This query shows that case statement is used in GROUP BY clause to list the number of units in stock for each product category and supplier's continent. Note that, if only s.Country (not the case statement) is used in the GROUP BY, duplicated rows will exist for each product category and supplier continent.