**Querying postgreSQL in Jupyter notebook**

Useful for writing notes and iterating over SQL queries. You can look at the "hard" examples down below to show how queries can be broken down into smaller parts and then combined into a more complicated query.
-Ben

In [3]:
import pandas as pd
import sqlalchemy
import sqlalchemy_utils
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2

In [3]:
# Alise's database
# Working with PostgreSQL in Python
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(host = 'ec2-54-245-31-214.us-west-2.compute.amazonaws.com',
                       port = '5291',
                       database = 'ecommerce',
                       user = 'sqlpractice',
                       password = 'iloveSQL!')

In [4]:
# Define a database name
# Set your postgres username
dbname = "baseball"
username = "lacar"  # change this to your username

# Working with PostgreSQL in Python
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(database=dbname, user=username)

# Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine("postgres://%s@localhost/%s" % (username, dbname))
print(engine.url)

postgres://lacar@localhost/baseball


# Notes

## Order of execution

1. FROM /JOIN (subqueries)
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. DISTINCT
7. ORDER BY
8. LIMIT/OFFSET

It can be summarized as:
**calling -> aggregating -> displaying -> filter**

<br>
- First creates a working dataset, then filters with rows added by conditions
<br>
- Column name aliases are not accessed for all commands except after SELECT (aliases for table names are okay) (postgreSQL seems to have some exceptions for SELECT with column name aliasing)

# Alise session 1

## Window functions

- Window functions (always in SELECT statement; it can also be in ORDER BY)
    - Window functions can contain multiple types of functions, including aggregate, rank 
    
- Window functions
    - RANK() 
    - DENSE_RANK() - no gaps in rank values
    - ROW_NUMBER() - assign a unique sequential interger to rows within a partition of a result set, the first row starts with 1
    - NTILE() - to identify percentile or quartile **no median but can possibly use this?**
    - LAG() - pulls from previous row to compare rows to preceding **important for some questions**
    - LEAD() - pulls from following row to compare rows to following **important for some questions**
    

## Optimizing SQL queries

- Query runtime can be affected by multiple factors, including:
    - table size
    - joins
    - aggregations
    
- Amount of data and desired output influences runtime and methods of optimization

- Optimizing query runtime can be done by implementing practices such as:
    - EXPLAIN; understand the runtime of query - place before any query
    - Filtering data with WHERE or LIMIT and selecting only columns you need
    - Aggregate tables before joining them
    - Break query into multiple queries


### Other topics to study

- Self joins
- Cross joins
- Data types
- DATES/DATETIME
- Built-in SQL functions
    - ROUND()
    - CAST() - moves something into a float to allow division - see above
- Creating and updating tables
- Pivoting tables with CASE statements

## Overview of tables

In [4]:
## Overview of table
sql_query = """
SELECT table_name
FROM information_schema.tables
WHERE table_schema='public'
AND table_type='BASE TABLE';
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

                table_name
0               categories
1  customergroupthresholds
2                customers
3                 products
4                suppliers
5                employees
6             orderdetails
7                   orders
8                 shippers


In [5]:
## Overview of table
sql_query = """
SELECT * FROM public.employees 
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   employeeid   lastname firstname                  title titleofcourtesy  \
0           1    Davolio     Nancy   Sales Representative             Ms.   
1           2     Fuller    Andrew  Vice President, Sales             Dr.   
2           3  Leverling     Janet   Sales Representative             Ms.   
3           4    Peacock  Margaret   Sales Representative            Mrs.   
4           5   Buchanan    Steven          Sales Manager             Mr.   

   birthdate   hiredate                    address      city region  \
0 1966-12-08 2010-05-01  507 - 20th Ave. E. Apt.57  New York     NY   
1 1970-02-19 2010-08-14         908 W. Capital Way    Tacoma     WA   
2 1981-08-30 2010-04-01         722 Moss Bay Blvd.  Kirkland     WA   
3 1955-09-19 2011-05-03       4110 Old Redmond Rd.   Redmond     WA   
4 1973-03-04 2011-10-17            14 Garrett Hill    London   None   

  postalcode country       homephone extension  \
0      10027     USA  (206) 555-9857      5467   
1      984

In [66]:
sql_query = """
SELECT COUNT(*) FROM public.employees 
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   count
0      9


## Question 1
Return a distinct list of titles and the count of employees with this title

In [67]:
sql_query = """
SELECT DISTINCT(title) AS unique_title, COUNT(employeeid) 
FROM employees 
GROUP BY title;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

               unique_title  count
0             Sales Manager      1
1     Vice President, Sales      1
2      Sales Representative      6
3  Inside Sales Coordinator      1


## Question 2
Return the employee(s) first name, last name and title who has "Sales" in their title and hire date before January 1, 2011

In [68]:
sql_query = """
SELECT firstname, lastname, title 
FROM employees 
WHERE title LIKE '%Sales%'
AND hiredate < '2011-01-01';
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

  firstname   lastname                  title
0     Nancy    Davolio   Sales Representative
1    Andrew     Fuller  Vice President, Sales
2     Janet  Leverling   Sales Representative


## Question 3
Return the employees first and last name and the number of years of employment

In [69]:
sql_query = """
SELECT firstname, lastname, EXTRACT(YEAR FROM CURRENT_DATE)-EXTRACT(YEAR FROM hiredate) AS employment_length
FROM employees;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

  firstname   lastname  employment_length
0     Nancy    Davolio               10.0
1    Andrew     Fuller               10.0
2     Janet  Leverling               10.0
3  Margaret    Peacock                9.0
4    Steven   Buchanan                9.0
5   Michael     Suyama                9.0
6    Robert       King                8.0
7     Laura   Callahan                8.0
8      Anne  Dodsworth                8.0


## Question 4
Return the maximum number of years of employment for any employee

In [70]:
# Method 1
sql_query = """
SELECT EXTRACT(YEAR FROM CURRENT_DATE)-EXTRACT(YEAR FROM hiredate) AS employment_length
FROM employees
ORDER BY employment_length DESC
LIMIT 1;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

   employment_length
0               10.0


In [71]:
# Method 2
sql_query = """
SELECT MAX(EXTRACT(YEAR FROM CURRENT_DATE)-EXTRACT(YEAR FROM hiredate)) AS max_years
FROM employees;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

   max_years
0       10.0


## Question 5
Return the employee city ordered by the total number of unique employees currently employed (Assume each employeeid appears only once)

In [72]:
# Method 1
sql_query = """
SELECT city, COUNT(DISTINCT employeeid) AS no_unique_employees
FROM employees
GROUP BY city
ORDER BY no_unique_employees DESC;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

         city  no_unique_employees
0      London                    2
1       Miner                    1
2    New York                    1
3     Redmond                    1
4     Seattle                    1
5      Tacoma                    1
6    Kirkland                    1
7  Winchester                    1


In [73]:
# Method 2 (Alise method)

sql_query = """
SELECT city, COUNT(employeeid) AS employeecount 
FROM public.employees 
GROUP BY 1 
ORDER BY COUNT(employeeid) DESC;
"""

df_query = pd.read_sql_query(sql_query,con)
print(df_query)

         city  employeecount
0      London              2
1     Seattle              1
2    Kirkland              1
3  Winchester              1
4       Miner              1
5      Tacoma              1
6    New York              1
7     Redmond              1


## Question 6

Return each city and the birthdate of the oldest employee as "mindate"

In [74]:
# Method 1

sql_query =  """
SELECT city, MIN(birthdate) AS mindate
FROM employees
GROUP BY city
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

         city    mindate
0    New York 1966-12-08
1     Seattle 1976-01-09
2    Kirkland 1981-08-30
3      London 1973-03-04
4  Winchester 1978-05-29
5       Miner 1981-07-02
6      Tacoma 1970-02-19
7     Redmond 1955-09-19


In [75]:
# Method 2

sql_query =  """
SELECT city, MIN(birthdate) AS mindate
FROM employees
GROUP BY 1
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

         city    mindate
0    New York 1966-12-08
1     Seattle 1976-01-09
2    Kirkland 1981-08-30
3      London 1973-03-04
4  Winchester 1978-05-29
5       Miner 1981-07-02
6      Tacoma 1970-02-19
7     Redmond 1955-09-19


## Question 7

Return the first and last name and age of the youngest employee
<br>
**Review years extraction from date**

In [76]:
# Method 1

sql_query =  """
SELECT firstname, lastname, EXTRACT(YEAR FROM CURRENT_DATE)-EXTRACT(YEAR FROM birthdate) AS age
FROM employees
ORDER BY age ASC
LIMIT 1;
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

  firstname   lastname   age
0      Anne  Dodsworth  36.0


In [77]:
# Method 2 - use of MAX in where

sql_query =  """
SELECT firstname, lastname, (EXTRACT(YEAR FROM CURRENT_TIMESTAMP) - EXTRACT(YEAR FROM birthdate)) AS age 
FROM public.employees 
WHERE birthdate = (SELECT MAX(birthdate) AS maxbdate FROM public.employees);;
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

  firstname   lastname   age
0      Anne  Dodsworth  36.0


## Question 8

Return firstname, lastname and title for employees with a title containing 'Sales' 


In [78]:

sql_query =  """
SELECT firstname, lastname, title
FROM employees
WHERE title LIKE '%Sales%';
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

  firstname   lastname                     title
0     Nancy    Davolio      Sales Representative
1    Andrew     Fuller     Vice President, Sales
2     Janet  Leverling      Sales Representative
3  Margaret    Peacock      Sales Representative
4    Steven   Buchanan             Sales Manager
5   Michael     Suyama      Sales Representative
6    Robert       King      Sales Representative
7     Laura   Callahan  Inside Sales Coordinator
8      Anne  Dodsworth      Sales Representative


## Question 9

In [79]:
## Overview of table
sql_query = """
SELECT * FROM public.categories 
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   categoryid    categoryname  \
0           1       Beverages   
1           2      Condiments   
2           3     Confections   
3           4  Dairy Products   
4           5  Grains/Cereals   

                                         description  
0        Soft drinks, coffees, teas, beers, and ales  
1  Sweet and savory sauces, relishes, spreads, an...  
2                Desserts, candies, and sweet breads  
3                                            Cheeses  
4                Breads, crackers, pasta, and cereal  


# Alise session 2

3/5/20

## View database

In [154]:
## Show TABLES in database

sql_query = """
SELECT table_name
FROM information_schema.tables
WHERE table_schema='public'
AND table_type='BASE TABLE';
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

                table_name
0               categories
1  customergroupthresholds
2                customers
3                 products
4                suppliers
5                employees
6             orderdetails
7                   orders
8                 shippers


In [90]:
sql_query = """
SELECT * FROM public.products;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

    productid                    productname  supplierid  categoryid  \
0           1                           Chai           1           1   
1           2                          Chang           1           1   
2           3                  Aniseed Syrup           1           2   
3           4   Chef Anton's Cajun Seasoning           2           2   
4           5         Chef Anton's Gumbo Mix           2           2   
..        ...                            ...         ...         ...   
72         73                      Rd Kaviar          17           8   
73         74                  Longlife Tofu           4           7   
74         75             Rhnbru Klosterbier          12           1   
75         76                     Lakkalikri          23           1   
76         77  Original Frankfurter grne Soe          12           2   

        quantityperunit unitprice  unitsinstock  unitsonorder  reorderlevel  \
0    10 boxes x 20 bags    $18.00            39         

## Medium problem 1

Show the list with CategoryNames and the TotalNumber of products for each Category sorted by number of products in desc order. (8 rows). Here you have to combine JOIN and GROUP BY.

In [91]:

sql_query = """
SELECT categoryname, COUNT(productid) AS totalnumber
FROM public.categories 
JOIN public.products
ON public.categories.categoryid=public.products.categoryid
GROUP BY categoryname
ORDER BY totalnumber DESC;

"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

     categoryname  totalnumber
0     Confections           13
1      Condiments           12
2         Seafood           12
3       Beverages           12
4  Dairy Products           10
5  Grains/Cereals            7
6    Meat/Poultry            6
7         Produce            5


## Medium problem 2

In the Customers table, show the total number of Customers per Country and City. (69 rows)

In [93]:
sql_query = """
SELECT * FROM customers
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

  customerid                         companyname         contactname  \
0      ALFKI                 Alfreds Futterkiste        Maria Anders   
1      ANATR  Ana Trujillo Emparedados y helados        Ana Trujillo   
2      ANTON              Antonio Moreno Taquera      Antonio Moreno   
3      AROUT                     Around the Horn        Thomas Hardy   
4      BERGS                   Berglunds snabbkp  Christina Berglund   

           contacttitle                       address        city region  \
0  Sales Representative                 Obere Str. 57      Berlin   None   
1                 Owner  Avda. de la Constitucin 2222  Mxico D.F.   None   
2                 Owner               Mataderos  2312  Mxico D.F.   None   
3  Sales Representative               120 Hanover Sq.      London   None   
4  orders Administrator                Berguvsvgen  8        Lule   None   

  postalcode  country           phone             fax  
0      12209  Germany     030-0074321     030-0076545 

In [96]:
sql_query = """
SELECT country, city, COUNT(customerid) AS no_customers
FROM customers
GROUP BY city, country
ORDER BY no_customers DESC
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

     country            city  no_customers
0         UK          London             6
1     Mexico      Mxico D.F.             5
2     Brazil       Sao Paulo             4
3  Argentina    Buenos Aires             3
4     Brazil  Rio de Janeiro             3


## Medium problem 3
What Products do we have in our inventory that needs to be reordered? Sort results by ProductID.  (22 rows). Use UnitsInStock field  <=  ReorderLevel field only


In [8]:
sql_query = """
SELECT *
FROM products
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   productid                   productname  supplierid  categoryid  \
0          1                          Chai           1           1   
1          2                         Chang           1           1   
2          3                 Aniseed Syrup           1           2   
3          4  Chef Anton's Cajun Seasoning           2           2   
4          5        Chef Anton's Gumbo Mix           2           2   

       quantityperunit unitprice  unitsinstock  unitsonorder  reorderlevel  \
0   10 boxes x 20 bags    $18.00            39             0            10   
1   24 - 12 oz bottles    $19.00            17            40            25   
2  12 - 550 ml bottles    $10.00            13            70            25   
3       48 - 6 oz jars    $22.00            53             0             0   
4             36 boxes    $21.35             0             0             0   

   discontinued  
0         False  
1         False  
2         False  
3         False  
4          True  


## Medium problem 4
ncorporate UnitsInOrder and Discontinued fields. So, now UnitsInOrder + UnitsInStock <= ReorderLevel and  Discontinued flag is False. (2 rows)


In [13]:
sql_query = """
SELECT productid
FROM products p
WHERE (p.unitsinstock + p.unitsonorder) <= p.reorderlevel
AND p.discontinued='False';
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   productid
0         30
1         70


In [14]:
sql_query = """
SELECT *
FROM customers
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   customerid                         companyname              contactname  \
0       ALFKI                 Alfreds Futterkiste             Maria Anders   
1       ANATR  Ana Trujillo Emparedados y helados             Ana Trujillo   
2       ANTON              Antonio Moreno Taquera           Antonio Moreno   
3       AROUT                     Around the Horn             Thomas Hardy   
4       BERGS                   Berglunds snabbkp       Christina Berglund   
..        ...                                 ...                      ...   
86      WARTH                      Wartian Herkku         Pirkko Koskitalo   
87      WELLI              Wellington Importadora            Paula Parente   
88      WHITC                White Clover Markets           Karl Jablonski   
89      WILMK                         Wilman Kala          Matti Karttunen   
90      WOLZA                      Wolski  Zajazd  Zbyszek Piestrzeniewicz   

                 contacttitle                       address    

## Medium problem 6

We want to investigate some more shipping options for our customers. Return the three countries with the highest average freight ordered by average freight in descending order. Use GROUP BY, AVG() and LIMIT statements.

In [14]:
sql_query = """
SELECT * FROM customers
LIMIT 3;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

  customerid                         companyname     contactname  \
0      ALFKI                 Alfreds Futterkiste    Maria Anders   
1      ANATR  Ana Trujillo Emparedados y helados    Ana Trujillo   
2      ANTON              Antonio Moreno Taquera  Antonio Moreno   

           contacttitle                       address        city region  \
0  Sales Representative                 Obere Str. 57      Berlin   None   
1                 Owner  Avda. de la Constitucin 2222  Mxico D.F.   None   
2                 Owner               Mataderos  2312  Mxico D.F.   None   

  postalcode  country         phone           fax  
0      12209  Germany   030-0074321   030-0076545  
1      05021   Mexico  (5) 555-4729  (5) 555-3745  
2      05023   Mexico  (5) 555-3932          None  


In [15]:
sql_query = """
SELECT * FROM orders
LIMIT 3;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   orderid customerid  employeeid           orderdate requireddate  \
0    10248      VINET           5 2014-07-04 08:00:00   2014-08-01   
1    10249      TOMSP           6 2014-07-05 04:00:00   2014-08-16   
2    10250      HANAR           4 2014-07-08 15:00:00   2014-08-05   

  shippeddate  shipvia freight                   shipname         shipaddress  \
0  2014-07-16        3  $32.38  Vins et alcools Chevalier  59 rue de l'Abbaye   
1  2014-07-10        1  $11.61          Toms Spezialitten       Luisenstr. 48   
2  2014-07-12        2  $65.83              Hanari Carnes      Rua do Pao, 67   

         shipcity shipregion shippostalcode shipcountry  
0           Reims       None          51100      France  
1          Mnster       None          44087     Germany  
2  Rio de Janeiro         RJ      05454-876      Brazil  


In [22]:
# Assume I just need the shipcountry in the orders table
sql_query = """
SELECT shipcountry, AVG(freight::decimal) AS avg_freight
FROM orders
GROUP BY shipcountry
ORDER BY avg_freight DESC
LIMIT 3;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

  shipcountry  avg_freight
0     Austria   184.787500
1     Ireland   145.012632
2         USA   112.879426


In [23]:
# Assume I just need the shipcountry in the orders table - can also use numeric
sql_query = """
SELECT shipcountry, AVG(freight::numeric) AS avg_freight
FROM orders
GROUP BY shipcountry
ORDER BY avg_freight DESC
LIMIT 3;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

  shipcountry  avg_freight
0     Austria   184.787500
1     Ireland   145.012632
2         USA   112.879426


## Medium problem 7

Return the same result as previous problem, but for only 2015. 

In [24]:
sql_query = """
SELECT *
FROM orders
LIMIT 3
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   orderid customerid  employeeid           orderdate requireddate  \
0    10248      VINET           5 2014-07-04 08:00:00   2014-08-01   
1    10249      TOMSP           6 2014-07-05 04:00:00   2014-08-16   
2    10250      HANAR           4 2014-07-08 15:00:00   2014-08-05   

  shippeddate  shipvia freight                   shipname         shipaddress  \
0  2014-07-16        3  $32.38  Vins et alcools Chevalier  59 rue de l'Abbaye   
1  2014-07-10        1  $11.61          Toms Spezialitten       Luisenstr. 48   
2  2014-07-12        2  $65.83              Hanari Carnes      Rua do Pao, 67   

         shipcity shipregion shippostalcode shipcountry  
0           Reims       None          51100      France  
1          Mnster       None          44087     Germany  
2  Rio de Janeiro         RJ      05454-876      Brazil  


In [27]:
# Assume I just need the shipcountry in the orders table - can also use numeric
# Also use quotes to get alias name as case sensitive

sql_query = """
SELECT shipcountry, AVG(freight::decimal) AS "AvgFreight"
FROM orders
WHERE shippeddate BETWEEN '2015-01-01' AND '2015-12-31'
GROUP BY shipcountry
ORDER BY "AvgFreight" DESC
LIMIT 3;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   shipcountry  AvgFreight
0      Austria    186.5130
1  Switzerland    117.1775
2       Sweden    105.1600


## Medium problem 8

Return the same result as previous problem, but for the last 12 months only

In [32]:
# Testing dates


sql_query = """
SELECT 
    CURRENT_DATE,
    CURRENT_DATE - interval '12 months' AS "Date_1_year_ago";
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

  current_date Date_1_year_ago
0   2020-03-12      2019-03-12


In [33]:
# Assume I just need the shipcountry in the orders table - can also use numeric
# Using date interval


sql_query = """
SELECT shipcountry, AVG(freight::decimal) AS "AvgFreight"
FROM orders
WHERE shippeddate BETWEEN (CURRENT_DATE - interval '1 year') AND CURRENT_DATE
GROUP BY shipcountry
ORDER BY "AvgFreight" DESC
LIMIT 3;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

Empty DataFrame
Columns: [shipcountry, AvgFreight]
Index: []


In [38]:
# The table is probably old so take the max date

sql_query = """
SELECT shipcountry, AVG(freight::decimal) AS "AvgFreight"
FROM orders
WHERE shippeddate BETWEEN
    (SELECT MAX(shippeddate) - interval '1 year' FROM orders)
    AND 
    (SELECT MAX(shippeddate) FROM orders)
GROUP BY shipcountry
ORDER BY "AvgFreight" DESC
LIMIT 3;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

  shipcountry  AvgFreight
0     Austria  213.055833
1     Ireland  200.210000
2         USA  121.691071


In [37]:
sql_query = """
SELECT MAX(shippeddate)
FROM orders;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

         max
0 2016-05-06


## Medium problem 9

Show OrderId, Quantity, ProductName, EmployeeID and LastName for each order in Orders table. Sort by OrderId and ProductID.(2155 rows). Yes, you have to JOIN 4 tables and select only these columns.

In [45]:
sql_query = """
SELECT * FROM orders
LIMIT 5;
"""
pd.read_sql_query(sql_query,con)

Unnamed: 0,orderid,customerid,employeeid,orderdate,requireddate,shippeddate,shipvia,freight,shipname,shipaddress,shipcity,shipregion,shippostalcode,shipcountry
0,10248,VINET,5,2014-07-04 08:00:00,2014-08-01,2014-07-16,3,$32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
1,10249,TOMSP,6,2014-07-05 04:00:00,2014-08-16,2014-07-10,1,$11.61,Toms Spezialitten,Luisenstr. 48,Mnster,,44087,Germany
2,10250,HANAR,4,2014-07-08 15:00:00,2014-08-05,2014-07-12,2,$65.83,Hanari Carnes,"Rua do Pao, 67",Rio de Janeiro,RJ,05454-876,Brazil
3,10251,VICTE,3,2014-07-08 14:00:00,2014-08-05,2014-07-15,1,$41.34,Victuailles en stock,"2, rue du Commerce",Lyon,,69004,France
4,10252,SUPRD,4,2014-07-09 01:00:00,2014-08-06,2014-07-11,2,$51.30,Suprmes dlices,"Boulevard Tirou, 255",Charleroi,,B-6000,Belgium


In [48]:
sql_query = """
SELECT * FROM orderdetails
LIMIT 5;
"""
pd.read_sql_query(sql_query,con)

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10248,11,$14.00,12,0.0
1,10248,42,$9.80,10,0.0
2,10248,72,$34.80,5,0.0
3,10249,14,$18.60,9,0.0
4,10249,51,$42.40,40,0.0


In [46]:
sql_query = """
SELECT * FROM customers
LIMIT 5;
"""
pd.read_sql_query(sql_query,con)

Unnamed: 0,customerid,companyname,contactname,contacttitle,address,city,region,postalcode,country,phone,fax
0,ALFKI,Alfreds Futterkiste,Maria Anders,Sales Representative,Obere Str. 57,Berlin,,12209,Germany,030-0074321,030-0076545
1,ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,Owner,Avda. de la Constitucin 2222,Mxico D.F.,,05021,Mexico,(5) 555-4729,(5) 555-3745
2,ANTON,Antonio Moreno Taquera,Antonio Moreno,Owner,Mataderos 2312,Mxico D.F.,,05023,Mexico,(5) 555-3932,
3,AROUT,Around the Horn,Thomas Hardy,Sales Representative,120 Hanover Sq.,London,,WA1 1DP,UK,(171) 555-7788,(171) 555-6750
4,BERGS,Berglunds snabbkp,Christina Berglund,orders Administrator,Berguvsvgen 8,Lule,,S-958 22,Sweden,0921-12 34 65,0921-12 34 67


In [47]:
sql_query = """
SELECT * FROM products
LIMIT 5;
"""
pd.read_sql_query(sql_query,con)

Unnamed: 0,productid,productname,supplierid,categoryid,quantityperunit,unitprice,unitsinstock,unitsonorder,reorderlevel,discontinued
0,1,Chai,1,1,10 boxes x 20 bags,$18.00,39,0,10,False
1,2,Chang,1,1,24 - 12 oz bottles,$19.00,17,40,25,False
2,3,Aniseed Syrup,1,2,12 - 550 ml bottles,$10.00,13,70,25,False
3,4,Chef Anton's Cajun Seasoning,2,2,48 - 6 oz jars,$22.00,53,0,0,False
4,5,Chef Anton's Gumbo Mix,2,2,36 boxes,$21.35,0,0,0,True


In [49]:
sql_query = """
SELECT * FROM employees
LIMIT 5;
"""
pd.read_sql_query(sql_query,con)

Unnamed: 0,employeeid,lastname,firstname,title,titleofcourtesy,birthdate,hiredate,address,city,region,postalcode,country,homephone,extension,notes,reportsto,photopath
0,1,Davolio,Nancy,Sales Representative,Ms.,1966-12-08,2010-05-01,507 - 20th Ave. E. Apt.57,New York,NY,10027,USA,(206) 555-9857,5467,Education includes a BA in psychology from Col...,2.0,http://accweb/emmployees/davolio.bmp
1,2,Fuller,Andrew,"Vice President, Sales",Dr.,1970-02-19,2010-08-14,908 W. Capital Way,Tacoma,WA,98401,USA,(206) 555-9482,3457,Andrew received his BTS commercial in 1974 and...,,http://accweb/emmployees/fuller.bmp
2,3,Leverling,Janet,Sales Representative,Ms.,1981-08-30,2010-04-01,722 Moss Bay Blvd.,Kirkland,WA,98033,USA,(206) 555-3412,3355,Janet has a BS degree in chemistry from Boston...,2.0,http://accweb/emmployees/leverling.bmp
3,4,Peacock,Margaret,Sales Representative,Mrs.,1955-09-19,2011-05-03,4110 Old Redmond Rd.,Redmond,WA,98052,USA,(206) 555-8122,5176,Margaret holds a BA in English literature from...,2.0,http://accweb/emmployees/peacock.bmp
4,5,Buchanan,Steven,Sales Manager,Mr.,1973-03-04,2011-10-17,14 Garrett Hill,London,,SW1 8JR,UK,(71) 555-4848,3453,Steven Buchanan graduated from St. Andrews Uni...,2.0,http://accweb/emmployees/buchanan.bmp


In [57]:
# OrderId, Quantity, ProductName, EmployeeID and LastName

# Let's say you have 3 tables. Can you join on one key from one table, then on another key from the second table?
# Yes - see joining on orderdetails

sql_query = """
SELECT
    orders.orderid,
    orderdetails.quantity,
    products.productname,
    employees.employeeid,
    employees.lastname
FROM orders
JOIN orderdetails
    ON 
    orders.orderid=orderdetails.orderid
JOIN products
    ON orderdetails.productid=products.productid
JOIN employees
    ON orders.employeeid=employees.employeeid
ORDER BY orders.orderid, products.productid;
"""
pd.read_sql_query(sql_query,con)

Unnamed: 0,orderid,quantity,productname,employeeid,lastname
0,10248,12,Queso Cabrales,5,Buchanan
1,10248,10,Singaporean Hokkien Fried Mee,5,Buchanan
2,10248,5,Mozzarella di Giovanni,5,Buchanan
3,10249,9,Tofu,6,Suyama
4,10249,40,Manjimup Dried Apples,6,Suyama
...,...,...,...,...,...
2150,11077,2,Wimmers gute Semmelkndel,1,Davolio
2151,11077,1,Louisiana Hot Spiced Okra,1,Davolio
2152,11077,2,Rd Kaviar,1,Davolio
2153,11077,4,Rhnbru Klosterbier,1,Davolio


In [62]:
# Same but using table aliases for easier typing

sql_query = """
SELECT
    o.orderid,
    od.quantity,
    p.productname,
    e.employeeid,
    e.lastname
FROM orders o
JOIN orderdetails od
    ON o.orderid=od.orderid
JOIN products p
    ON od.productid=p.productid
JOIN employees e
    ON o.employeeid=e.employeeid
ORDER BY o.orderid, p.productid;
"""
pd.read_sql_query(sql_query,con)

Unnamed: 0,orderid,quantity,productname,employeeid,lastname
0,10248,12,Queso Cabrales,5,Buchanan
1,10248,10,Singaporean Hokkien Fried Mee,5,Buchanan
2,10248,5,Mozzarella di Giovanni,5,Buchanan
3,10249,9,Tofu,6,Suyama
4,10249,40,Manjimup Dried Apples,6,Suyama
...,...,...,...,...,...
2150,11077,2,Wimmers gute Semmelkndel,1,Davolio
2151,11077,1,Louisiana Hot Spiced Okra,1,Davolio
2152,11077,2,Rd Kaviar,1,Davolio
2153,11077,4,Rhnbru Klosterbier,1,Davolio


## Medium problem 10
Show all "customers" you have never placed an order.(2 rows).

In [210]:
# Look at customer table
sql_query = """
SELECT
*
FROM customers
LIMIT 3;
"""

pd.read_sql_query(sql_query,con)

Unnamed: 0,customerid,companyname,contactname,contacttitle,address,city,region,postalcode,country,phone,fax
0,ALFKI,Alfreds Futterkiste,Maria Anders,Sales Representative,Obere Str. 57,Berlin,,12209,Germany,030-0074321,030-0076545
1,ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,Owner,Avda. de la Constitucin 2222,Mxico D.F.,,5021,Mexico,(5) 555-4729,(5) 555-3745
2,ANTON,Antonio Moreno Taquera,Antonio Moreno,Owner,Mataderos 2312,Mxico D.F.,,5023,Mexico,(5) 555-3932,


In [212]:
# Look at orders table
sql_query = """
SELECT
*
FROM orders
LIMIT 3;
"""

pd.read_sql_query(sql_query,con)

Unnamed: 0,orderid,customerid,employeeid,orderdate,requireddate,shippeddate,shipvia,freight,shipname,shipaddress,shipcity,shipregion,shippostalcode,shipcountry
0,10248,VINET,5,2014-07-04 08:00:00,2014-08-01,2014-07-16,3,$32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
1,10249,TOMSP,6,2014-07-05 04:00:00,2014-08-16,2014-07-10,1,$11.61,Toms Spezialitten,Luisenstr. 48,Mnster,,44087,Germany
2,10250,HANAR,4,2014-07-08 15:00:00,2014-08-05,2014-07-12,2,$65.83,Hanari Carnes,"Rua do Pao, 67",Rio de Janeiro,RJ,05454-876,Brazil


In [218]:
# Find a column where you can get Null values
sql_query = """
SELECT customers.customerid, orders.orderdate
FROM customers
LEFT JOIN orders
ON customers.customerid=orders.customerid
WHERE orders.orderdate IS NULL;
"""

pd.read_sql_query(sql_query,con)

Unnamed: 0,customerid,orderdate
0,PARIS,
1,FISSA,


In [220]:
# Finalize query
sql_query = """
SELECT customers.customerid
FROM customers
LEFT JOIN orders
ON customers.customerid=orders.customerid
WHERE orders.orderdate IS NULL;
"""

pd.read_sql_query(sql_query,con)

Unnamed: 0,customerid
0,PARIS
1,FISSA


## Medium problem 11
Show all "customers" you have never placed an order with *Employee 4*. (16 rows).

In [239]:
# Look at orders table
sql_query = """
SELECT *
FROM orders
LIMIT 5;
"""

pd.read_sql_query(sql_query,con)

Unnamed: 0,orderid,customerid,employeeid,orderdate,requireddate,shippeddate,shipvia,freight,shipname,shipaddress,shipcity,shipregion,shippostalcode,shipcountry
0,10248,VINET,5,2014-07-04 08:00:00,2014-08-01,2014-07-16,3,$32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
1,10249,TOMSP,6,2014-07-05 04:00:00,2014-08-16,2014-07-10,1,$11.61,Toms Spezialitten,Luisenstr. 48,Mnster,,44087,Germany
2,10250,HANAR,4,2014-07-08 15:00:00,2014-08-05,2014-07-12,2,$65.83,Hanari Carnes,"Rua do Pao, 67",Rio de Janeiro,RJ,05454-876,Brazil
3,10251,VICTE,3,2014-07-08 14:00:00,2014-08-05,2014-07-15,1,$41.34,Victuailles en stock,"2, rue du Commerce",Lyon,,69004,France
4,10252,SUPRD,4,2014-07-09 01:00:00,2014-08-06,2014-07-11,2,$51.30,Suprmes dlices,"Boulevard Tirou, 255",Charleroi,,B-6000,Belgium


In [237]:
# Look at employees table - don't need this table
# sql_query = """
# SELECT *
# FROM employees
# LIMIT 2;
# """

# pd.read_sql_query(sql_query,con)

In [241]:
# Show all the orders with Employee 4

# Look at orders table, get a list of customer id
sql_query = """
SELECT *
FROM orders
WHERE employeeid=4
;
"""

pd.read_sql_query(sql_query,con)

Unnamed: 0,orderid,customerid,employeeid,orderdate,requireddate,shippeddate,shipvia,freight,shipname,shipaddress,shipcity,shipregion,shippostalcode,shipcountry
0,10250,HANAR,4,2014-07-08 15:00:00,2014-08-05,2014-07-12,2,$65.83,Hanari Carnes,"Rua do Pao, 67",Rio de Janeiro,RJ,05454-876,Brazil
1,10252,SUPRD,4,2014-07-09 01:00:00,2014-08-06,2014-07-11,2,$51.30,Suprmes dlices,"Boulevard Tirou, 255",Charleroi,,B-6000,Belgium
2,10257,HILAA,4,2014-07-16 15:00:00,2014-08-13,2014-07-22,3,$81.91,HILARION-Abastos,Carrera 22 con Ave. Carlos Soublette #8-35,San Cristbal,Tchira,5022,Venezuela
3,10259,CENTC,4,2014-07-18 16:00:00,2014-08-15,2014-07-25,3,$3.25,Centro comercial Moctezuma,Sierras de Granada 9993,Mxico D.F.,,05022,Mexico
4,10260,OTTIK,4,2014-07-19 09:00:00,2014-08-16,2014-07-29,1,$55.09,Ottilies Kseladen,Mehrheimerstr. 369,Kln,,50739,Germany
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151,11044,WOLZA,4,2016-04-23 19:00:00,2016-05-21,2016-05-01,1,$8.72,Wolski Zajazd,ul. Filtrowa 68,Warszawa,,01-012,Poland
152,11061,GREAL,4,2016-04-30 00:00:00,2016-06-11,NaT,3,$14.01,Great Lakes Food Market,2732 Baker Blvd.,Eugene,OR,97403,USA
153,11062,REGGC,4,2016-04-30 00:00:00,2016-05-28,NaT,2,$29.93,Reggiani Caseifici,Strada Provinciale 124,Reggio Emilia,,42100,Italy
154,11072,ERNSH,4,2016-05-05 15:00:00,2016-06-02,NaT,2,$258.64,Ernst Handel,Kirchgasse 6,Graz,,8010,Austria


In [243]:
# Show all the orders with Employee 4

# Look at orders table, get a list of customer id
sql_query = """
SELECT DISTINCT(customerid)
FROM orders
WHERE employeeid=4
;
"""

pd.read_sql_query(sql_query,con)

Unnamed: 0,customerid
0,TOMSP
1,LONEP
2,OLDWO
3,WARTH
4,MAGAA
...,...
70,LILAS
71,RICSU
72,WHITC
73,TRAIH


In [249]:
# Use the distinct list from above in the WHERE

sql_query = """
SELECT DISTINCT(customerid)
FROM orders
WHERE customerid NOT IN (SELECT DISTINCT(customerid)
                         FROM orders
                         WHERE employeeid=4)
;
"""

pd.read_sql_query(sql_query,con)

Unnamed: 0,customerid
0,VINET
1,DUMON
2,PRINI
3,PERIC
4,CONSH
5,LAZYK
6,THEBI
7,GROSR
8,FRANR
9,LAUGB


In [250]:
# Above query is almost there - forget about customers who have never placed an order (question 10 above)

# Final query, join employees and orders table
sql_query = """
SELECT customers.customerid
FROM customers
LEFT JOIN orders
ON customers.customerid=orders.customerid
WHERE orders.orderdate IS NULL;
;
"""

pd.read_sql_query(sql_query,con)

Unnamed: 0,customerid
0,PARIS
1,FISSA


## Hard problem 1

Assuming that now is January 1, 2017, we want to find all high-value customers from 2016, who've made at least one order with a total value of $10,000 and give them VIP status. (6 rows)

In [99]:
sql_query = """
SELECT *
FROM orders
LIMIT 2;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   orderid customerid  employeeid           orderdate requireddate  \
0    10248      VINET           5 2014-07-04 08:00:00   2014-08-01   
1    10249      TOMSP           6 2014-07-05 04:00:00   2014-08-16   

  shippeddate  shipvia freight                   shipname         shipaddress  \
0  2014-07-16        3  $32.38  Vins et alcools Chevalier  59 rue de l'Abbaye   
1  2014-07-10        1  $11.61          Toms Spezialitten       Luisenstr. 48   

  shipcity shipregion shippostalcode shipcountry  
0    Reims       None          51100      France  
1   Mnster       None          44087     Germany  


In [100]:
sql_query = """
SELECT *
FROM orderdetails
LIMIT 2;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

   orderid  productid unitprice  quantity  discount
0    10248         11    $14.00        12       0.0
1    10248         42     $9.80        10       0.0


In [103]:
sql_query = """
SELECT unitprice
FROM orderdetails
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

  ?column?
0   $42.00
1   $29.40
2  $104.40
3   $55.80
4  $127.20


In [118]:
sql_query = """
SELECT orders.customerid, orders.orderid, SUM((orderdetails.unitprice * orderdetails.quantity)) AS total_value
FROM orders
JOIN orderdetails
ON orders.orderid=orderdetails.orderid
WHERE orders.orderdate BETWEEN '2016-01-01' AND '2016-12-31'
GROUP BY orders.customerid, orders.orderid
HAVING SUM((orderdetails.unitprice * orderdetails.quantity)::numeric)  > 10000;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

  customerid  orderid total_value
0      HUNGO    10897  $10,835.24
1      RATTC    10889  $11,380.00
2      HANAR    10981  $15,810.00
3      QUICK    10865  $17,250.00
4      SAVEA    11030  $16,321.90
5      KOENE    10817  $11,490.70


## Hard problem 2

What if a sales person changes her mind and instead of 1 order with at least 10,000 would ask about 15,000 in total during 2016?

In [119]:
sql_query = """
SELECT orders.customerid, orders.orderid, SUM((orderdetails.unitprice * orderdetails.quantity)) AS total_value
FROM orders
JOIN orderdetails
ON orders.orderid=orderdetails.orderid
WHERE orders.orderdate BETWEEN '2016-01-01' AND '2016-12-31'
GROUP BY orders.customerid, orders.orderid
HAVING SUM((orderdetails.unitprice * orderdetails.quantity)::numeric)  > 15000;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

  customerid  orderid total_value
0      HANAR    10981  $15,810.00
1      QUICK    10865  $17,250.00
2      SAVEA    11030  $16,321.90


## Hard problem 3
    
Change the above query including Discount in your calculations.  Order it by total amount which includes discount.

In [None]:
# Need to edit this

sql_query = """
SELECT orders.customerid, orders.orderid, SUM((orderdetails.unitprice * orderdetails.quantity)) AS total_value
FROM orders
JOIN orderdetails
ON orders.orderid=orderdetails.orderid
WHERE orders.orderdate BETWEEN '2016-01-01' AND '2016-12-31'
GROUP BY orders.customerid, orders.orderid
HAVING SUM((orderdetails.unitprice * orderdetails.quantity)::numeric)  > 15000;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

## Hard problem 4

At the end of the month salespeople are trying harder to get orders. Select all orders which were placed on the last day of the month. Ordered by EmployeeID and OrderID. (24 rows)

In [12]:
# View whole table
sql_query = """
SELECT *
FROM orders
LIMIT 5;
"""

pd.read_sql_query(sql_query,con)    

Unnamed: 0,orderid,customerid,employeeid,orderdate,requireddate,shippeddate,shipvia,freight,shipname,shipaddress,shipcity,shipregion,shippostalcode,shipcountry
0,10248,VINET,5,2014-07-04 08:00:00,2014-08-01,2014-07-16,3,$32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
1,10249,TOMSP,6,2014-07-05 04:00:00,2014-08-16,2014-07-10,1,$11.61,Toms Spezialitten,Luisenstr. 48,Mnster,,44087,Germany
2,10250,HANAR,4,2014-07-08 15:00:00,2014-08-05,2014-07-12,2,$65.83,Hanari Carnes,"Rua do Pao, 67",Rio de Janeiro,RJ,05454-876,Brazil
3,10251,VICTE,3,2014-07-08 14:00:00,2014-08-05,2014-07-15,1,$41.34,Victuailles en stock,"2, rue du Commerce",Lyon,,69004,France
4,10252,SUPRD,4,2014-07-09 01:00:00,2014-08-06,2014-07-11,2,$51.30,Suprmes dlices,"Boulevard Tirou, 255",Charleroi,,B-6000,Belgium


In [21]:
# Test day extraction
sql_query = """
SELECT orderdate, EXTRACT(DAY FROM orderdate)
FROM orders
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

            orderdate  date_part
0 2014-07-04 08:00:00        4.0
1 2014-07-05 04:00:00        5.0
2 2014-07-08 15:00:00        8.0
3 2014-07-08 14:00:00        8.0
4 2014-07-09 01:00:00        9.0


In [22]:
# Test month extraction
sql_query = """
SELECT orderdate, EXTRACT(MONTH FROM orderdate)
FROM orders
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

            orderdate  date_part
0 2014-07-04 08:00:00        7.0
1 2014-07-05 04:00:00        7.0
2 2014-07-08 15:00:00        7.0
3 2014-07-08 14:00:00        7.0
4 2014-07-09 01:00:00        7.0


In [27]:
# Adding a date interval 
# https://www.postgresql.org/docs/9.0/functions-datetime.html
sql_query = """
SELECT orderdate, orderdate - interval '1 day'
FROM orders
LIMIT 5;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

            orderdate            ?column?
0 2014-07-04 08:00:00 2014-07-03 08:00:00
1 2014-07-05 04:00:00 2014-07-04 04:00:00
2 2014-07-08 15:00:00 2014-07-07 15:00:00
3 2014-07-08 14:00:00 2014-07-07 14:00:00
4 2014-07-09 01:00:00 2014-07-08 01:00:00


In [33]:
# Final query
sql_query = """
SELECT orderid, orderdate, employeeid
FROM orders
WHERE EXTRACT(MONTH FROM orderdate) <> EXTRACT(MONTH FROM orderdate + INTERVAL '1 day')
ORDER BY employeeid, orderid;
"""

df_query = pd.read_sql_query(sql_query,con)    
print(df_query)

    orderid           orderdate  employeeid
0     10461 2015-02-28 00:00:00           1
1     10616 2015-07-31 00:00:00           1
2     10583 2015-06-30 00:00:00           2
3     10686 2015-09-30 00:00:00           2
4     10989 2016-03-31 00:00:00           2
5     11060 2016-04-30 00:00:00           2
6     10432 2015-01-31 00:00:00           3
7     10806 2015-12-31 11:00:00           3
8     10988 2016-03-31 00:00:00           3
9     11063 2016-04-30 00:00:00           3
10    10343 2014-10-31 00:00:00           4
11    10522 2015-04-30 00:00:00           4
12    10584 2015-06-30 00:00:00           4
13    10617 2015-07-31 00:00:00           4
14    10725 2015-10-31 00:00:00           4
15    10807 2015-12-31 11:00:00           4
16    11061 2016-04-30 00:00:00           4
17    11062 2016-04-30 00:00:00           4
18    10269 2014-07-31 00:00:00           5
19    10317 2014-09-30 00:00:00           6
20    10490 2015-03-31 00:00:00           7
21    10399 2014-12-31 00:00:00 

## Hard problem 5

We want to identify the size of the table for our website. Show the top 10 orders with maximum items. Ordered by number of items.

I'm not entirely sure what the question is asking for but I will just get the orderids with the number of items and take the top 10

In [8]:
# Look at orders table
sql_query = """
SELECT *
FROM orders
LIMIT 3;
"""
pd.read_sql_query(sql_query,con)

Unnamed: 0,orderid,customerid,employeeid,orderdate,requireddate,shippeddate,shipvia,freight,shipname,shipaddress,shipcity,shipregion,shippostalcode,shipcountry
0,10248,VINET,5,2014-07-04 08:00:00,2014-08-01,2014-07-16,3,$32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
1,10249,TOMSP,6,2014-07-05 04:00:00,2014-08-16,2014-07-10,1,$11.61,Toms Spezialitten,Luisenstr. 48,Mnster,,44087,Germany
2,10250,HANAR,4,2014-07-08 15:00:00,2014-08-05,2014-07-12,2,$65.83,Hanari Carnes,"Rua do Pao, 67",Rio de Janeiro,RJ,05454-876,Brazil


In [9]:
# View order details table
sql_query = """
SELECT *
FROM orderdetails
LIMIT 3;
"""
pd.read_sql_query(sql_query,con)    

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10248,11,$14.00,12,0.0
1,10248,42,$9.80,10,0.0
2,10248,72,$34.80,5,0.0


In [15]:
# View order details table
sql_query = """
SELECT orderid, SUM(quantity) AS total_qty_order
FROM orderdetails
GROUP BY orderid
ORDER BY total_qty_order DESC
LIMIT 10;
"""
pd.read_sql_query(sql_query,con)    

Unnamed: 0,orderid,total_qty_order
0,10895,346
1,11030,330
2,10847,288
3,10515,286
4,10678,280
5,10612,263
6,10990,256
7,10658,255
8,10263,250
9,10845,245


**After looking at solution, they were seeking the number of orderids**

In [16]:
# View order details table
sql_query = """
SELECT orderid, COUNT(*) AS num_of_orderids
FROM orderdetails
GROUP BY orderid
ORDER BY num_of_orderids DESC
LIMIT 10;
"""
pd.read_sql_query(sql_query,con)    

Unnamed: 0,orderid,num_of_orderids
0,11077,25
1,10979,6
2,10657,6
3,10847,6
4,10360,5
5,10893,5
6,10553,5
7,10294,5
8,10514,5
9,11064,5


## Hard question 6

Select 2% of the OrderDetails table. (41 rows). Use statement  random() < 0.02

In [19]:
# Preview orderdetails table
sql_query = """
SELECT *
FROM orderdetails
LIMIT 3;
"""
pd.read_sql_query(sql_query,con)    

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10248,11,$14.00,12,0.0
1,10248,42,$9.80,10,0.0
2,10248,72,$34.80,5,0.0


In [18]:
# Understand random - generates a number between 0 and 1
# https://www.postgresqltutorial.com/postgresql-random-range/

sql_query = """
SELECT RANDOM()
FROM orderdetails
LIMIT 3;
"""
pd.read_sql_query(sql_query,con)    

Unnamed: 0,random
0,0.236654
1,0.277124
2,0.460387


In [21]:
# Enter an argument
sql_query = """
SELECT RANDOM(orderid)
FROM orderdetails
LIMIT 3;
"""
pd.read_sql_query(sql_query,con)    

DatabaseError: Execution failed on sql '
SELECT RANDOM(orderid)
FROM orderdetails
LIMIT 3;
': function random(integer) does not exist
LINE 2: SELECT RANDOM(orderid)
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.


In [23]:
# See output of random
sql_query = """
SELECT RANDOM() < 0.02
FROM orderdetails;
"""
pd.read_sql_query(sql_query,con)    

Unnamed: 0,?column?
0,False
1,False
2,False
3,False
4,False
...,...
2150,False
2151,False
2152,False
2153,False


In [33]:
# See output of random - I get 43 rows (not 41)
sql_query = """
SELECT *
FROM orderdetails
ORDER BY RANDOM()

--You could put FLOOR to be more explicit but this works as is
LIMIT (SELECT COUNT(*) FROM orderdetails)*0.02
;
"""
pd.read_sql_query(sql_query,con)    

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10396,71,$17.20,60,0.0
1,10364,69,$28.80,30,0.0
2,10868,49,$20.00,42,0.1
3,10468,30,$20.70,8,0.0
4,11038,52,$7.00,2,0.0
5,10729,21,$10.00,30,0.0
6,11030,2,$19.00,100,0.25
7,10441,27,$35.10,50,0.0
8,10889,38,$263.50,40,0.0
9,10623,24,$4.50,3,0.0


In [27]:
# Check what limit does with decimals
sql_query = """
SELECT *
FROM orderdetails
ORDER BY RANDOM()
LIMIT 2.5
;
"""
pd.read_sql_query(sql_query,con)    

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10746,69,$36.00,40,0.0
1,10445,54,$5.90,15,0.0
2,10558,47,$9.50,25,0.0


In [28]:
# Check what limit does with decimals - it rounds to nearest whole number
sql_query = """
SELECT *
FROM orderdetails
ORDER BY RANDOM()
LIMIT 2.4
;
"""
pd.read_sql_query(sql_query,con)    

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10709,51,$53.00,28,0.0
1,10296,69,$28.80,15,0.0


In [29]:
# Solution has a simpler way (but I get 40 rows)

# See output of random
sql_query = """
SELECT *
FROM orderdetails
WHERE RANDOM() < 0.02
;
"""
pd.read_sql_query(sql_query,con)    

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10248,11,$14.00,12,0.0
1,10287,16,$13.90,40,0.15
2,10318,41,$7.70,20,0.0
3,10333,14,$18.60,10,0.0
4,10367,77,$10.40,7,0.0
5,10369,29,$99.00,20,0.0
6,10394,62,$39.40,10,0.0
7,10403,16,$13.90,21,0.15
8,10436,75,$6.20,24,0.1
9,10458,26,$24.90,30,0.0


## Hard question 7

One of the salespeople thinks that she accidentally entered a line item twice on an order, each time with different ProductId, but the same quantity.  She remembers that quantity was 60 or more. Show all OrderID that match this, ordered by OrderID. (5 rows)

In [35]:
sql_query = """
SELECT *
FROM orderdetails
LIMIT 5
"""
pd.read_sql_query(sql_query, con)

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10248,11,$14.00,12,0.0
1,10248,42,$9.80,10,0.0
2,10248,72,$34.80,5,0.0
3,10249,14,$18.60,9,0.0
4,10249,51,$42.40,40,0.0


In [49]:
sql_query = """

SELECT DISTINCT(t1.orderid)
FROM orderdetails AS t1
JOIN orderdetails AS t2
    ON t1.orderid=t2.orderid
    AND t1.quantity=t2.quantity
    AND t1.productid!=t2.productid
    
WHERE t1.quantity >=60
AND t2.quantity >=60
ORDER BY t1.orderid ASC;    
"""
pd.read_sql_query(sql_query, con)

Unnamed: 0,orderid
0,10263
1,10658
2,10990
3,11030


## Hard question 8

For all orders from the previous question, show OrderDetails.

In [59]:
# You can use t1.* for the table

sql_query = """
SELECT t1.*
FROM orderdetails AS t1
JOIN orderdetails AS t2
    ON t1.orderid=t2.orderid
    AND t1.quantity=t2.quantity
    AND t1.productid!=t2.productid
WHERE t1.quantity >=60
AND t2.quantity >=60
ORDER BY t1.orderid ASC;    
"""
pd.read_sql_query(sql_query, con)

Unnamed: 0,orderid,productid,unitprice,quantity,discount
0,10263,30,$20.70,60,0.25
1,10263,16,$13.90,60,0.25
2,10263,74,$8.00,65,0.25
3,10263,24,$3.60,65,0.0
4,10658,40,$18.40,70,0.05
5,10658,77,$13.00,70,0.05
6,10990,21,$10.00,65,0.0
7,10990,55,$24.00,65,0.15
8,11030,2,$19.00,100,0.25
9,11030,59,$55.00,100,0.25


## Hard question 9

Get the list of all orders that are late. Ordered by delay.

In [53]:
sql_query = """
SELECT *
FROM orders
LIMIT 2;    
"""
pd.read_sql_query(sql_query, con)

Unnamed: 0,orderid,customerid,employeeid,orderdate,requireddate,shippeddate,shipvia,freight,shipname,shipaddress,shipcity,shipregion,shippostalcode,shipcountry
0,10248,VINET,5,2014-07-04 08:00:00,2014-08-01,2014-07-16,3,$32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
1,10249,TOMSP,6,2014-07-05 04:00:00,2014-08-16,2014-07-10,1,$11.61,Toms Spezialitten,Luisenstr. 48,Mnster,,44087,Germany


In [60]:
# Not entirely sure what late refers to but I'll look at required and shipped date

sql_query = """
SELECT *, 
    (shippeddate-requireddate) AS delay
FROM orders
WHERE shippeddate > requireddate
ORDER BY delay DESC
;    
"""
pd.read_sql_query(sql_query, con)

Unnamed: 0,orderid,customerid,employeeid,orderdate,requireddate,shippeddate,shipvia,freight,shipname,shipaddress,shipcity,shipregion,shippostalcode,shipcountry,delay
0,10777,GOURL,7,2015-12-15 08:00:00,2015-12-29,2016-01-21,2,$3.01,Gourmet Lanchonetes,"Av. Brasil, 442",Campinas,SP,04876-786,Brazil,23 days
1,10423,GOURL,6,2015-01-23 01:00:00,2015-02-06,2015-02-24,3,$24.50,Gourmet Lanchonetes,"Av. Brasil, 442",Campinas,SP,04876-786,Brazil,18 days
2,10726,EASTC,4,2015-11-03 05:00:00,2015-11-17,2015-12-05,1,$16.56,Eastern Connection,35 King George,London,,WX3 6FW,UK,18 days
3,10970,BOLID,9,2016-03-24 09:00:00,2016-04-07,2016-04-24,1,$16.16,Blido Comidas preparadas,"C/ Araquil, 67",Madrid,,28023,Spain,17 days
4,10515,QUICK,2,2015-04-23 17:00:00,2015-05-07,2015-05-23,1,$204.47,QUICK-Stop,Taucherstrae 10,Cunewalde,,01307,Germany,16 days
5,10827,BONAP,1,2016-01-12 19:00:00,2016-01-26,2016-02-06,2,$63.54,Bon app',"12, rue des Bouchers",Marseille,,13008,France,11 days
6,10663,BONAP,2,2015-09-10 12:00:00,2015-09-24,2015-10-03,2,$113.15,Bon app',"12, rue des Bouchers",Marseille,,13008,France,9 days
7,10660,HUNGC,8,2015-09-08 03:00:00,2015-10-06,2015-10-15,1,$111.29,Hungry Coyote Import Store,City Center Plaza 516 Main St.,Elgin,OR,97827,USA,9 days
8,10828,RANCH,9,2016-01-13 04:00:00,2016-01-27,2016-02-04,1,$90.85,Rancho grande,Av. del Libertador 900,Buenos Aires,,1010,Argentina,8 days
9,10451,QUICK,4,2015-02-19 08:00:00,2015-03-05,2015-03-12,3,$189.09,QUICK-Stop,Taucherstrae 10,Cunewalde,,01307,Germany,7 days


## Hard question 10

Which salespeople have the most late orders?

In [62]:
# count, group by

sql_query = """
SELECT employeeid, COUNT(*) AS num_of_delays
FROM orders
WHERE shippeddate > requireddate
GROUP BY employeeid
ORDER BY num_of_delays DESC
;    
"""
pd.read_sql_query(sql_query, con)

Unnamed: 0,employeeid,num_of_delays
0,4,10
1,3,5
2,7,4
3,9,4
4,2,4
5,8,4
6,6,3
7,1,2


**Wanted names so I edited query for the following**

In [65]:
# count, group by

sql_query = """
SELECT *
FROM employees
LIMIT 3
;    
"""
pd.read_sql_query(sql_query, con)

Unnamed: 0,employeeid,lastname,firstname,title,titleofcourtesy,birthdate,hiredate,address,city,region,postalcode,country,homephone,extension,notes,reportsto,photopath
0,1,Davolio,Nancy,Sales Representative,Ms.,1966-12-08,2010-05-01,507 - 20th Ave. E. Apt.57,New York,NY,10027,USA,(206) 555-9857,5467,Education includes a BA in psychology from Col...,2.0,http://accweb/emmployees/davolio.bmp
1,2,Fuller,Andrew,"Vice President, Sales",Dr.,1970-02-19,2010-08-14,908 W. Capital Way,Tacoma,WA,98401,USA,(206) 555-9482,3457,Andrew received his BTS commercial in 1974 and...,,http://accweb/emmployees/fuller.bmp
2,3,Leverling,Janet,Sales Representative,Ms.,1981-08-30,2010-04-01,722 Moss Bay Blvd.,Kirkland,WA,98033,USA,(206) 555-3412,3355,Janet has a BS degree in chemistry from Boston...,2.0,http://accweb/emmployees/leverling.bmp


In [67]:
# count, group by

sql_query = """
SELECT orders.employeeid, e.firstname, e.lastname, COUNT(*) AS num_of_delays
FROM orders
JOIN employees AS e
ON orders.employeeid=e.employeeid 
WHERE shippeddate > requireddate
GROUP BY orders.employeeid, e.firstname, e.lastname
ORDER BY num_of_delays DESC
;    
"""
pd.read_sql_query(sql_query, con)

Unnamed: 0,employeeid,firstname,lastname,num_of_delays
0,4,Margaret,Peacock,10
1,3,Janet,Leverling,5
2,8,Laura,Callahan,4
3,7,Robert,King,4
4,2,Andrew,Fuller,4
5,9,Anne,Dodsworth,4
6,6,Michael,Suyama,3
7,1,Nancy,Davolio,2


# Insight paired mock interview problems

## SQL question 1
Calculate the distances between each two points first, and then display the minimum one.

### Create a temporary table

In [159]:
# Create temporary tables only for the purpose of testing the queries
# The WITH line is creating the table

sql_query = """
WITH  point (x) AS (VALUES (-1), (0), (2))
SELECT * FROM point;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,x
0,-1
1,0
2,2


In [160]:
# Solution provided
sql_query = """
WITH  point (x) AS (VALUES (-1), (0), (2))

SELECT
    p1.x, p2.x, ABS(p1.x - p2.x) AS distance
FROM
    point p1
        JOIN
    point p2 ON p1.x != p2.x;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,x,x.1,distance
0,-1,0,1
1,-1,2,3
2,0,-1,1
3,0,2,2
4,2,-1,3
5,2,0,2


### If there was an id field

In [111]:
sql_query = """
WITH  point (id, x) AS (VALUES (1, -1), (2, 0), (3, 2))

SELECT * FROM point;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,x
0,1,-1
1,2,0
2,3,2


In [115]:
# Use the id to join on the next row and create a difference field

sql_query = """
WITH  point (id, x) AS (VALUES (1, -1), (2, 0), (3, 2))

SELECT p1.x, p2.x, (p2.x-p1.x) AS diff
FROM point AS p1
JOIN point AS p2
ON p1.id=p2.id-1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,x,x.1,diff
0,-1,0,1
1,0,2,2


In [116]:
# Finalize query - select minimum difference

sql_query = """
WITH  point (id, x) AS (VALUES (1, -1), (2, 0), (3, 2))

SELECT MIN(p2.x-p1.x) AS shortest_distance
FROM point AS p1
JOIN point AS p2
ON p1.id=p2.id-1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,shortest_distance
0,1


## SQL question 2

(double click to fix formatting)

In social networks like Facebook or Twitter, people send friend requests and accept others' requests as well.
Table request_accepted
+--------------+-------------+------------+
| requester_id | accepter_id | accept_date|
|--------------|-------------|------------|
| 1            | 2           | 2016_06-03 |
| 1            | 3           | 2016-06-08 |
| 2            | 3           | 2016-06-08 |
| 3            | 4           | 2016-06-09 |
+--------------+-------------+------------+
This table holds the data of friend acceptance, while requester_id and accepter_id both are the id of a person.

Write a query to find the the people who has most friends and the most friends number under the following rules:
It is guaranteed there is only 1 person having the most friends.
The friend request could only be accepted once, which means there are no multiple records with the same requester_id and accepter_id value.
For the sample data above, the result is:
Result table:
+------+------+
| id   | num  |
|------|------|
| 3    | 3    |
+------+------+
The person with id '3' is a friend of people '1', '2' and '4', so he has 3 friends in total, which is the most number than any other.


In [119]:
# Create temporary tables only for the purpose of testing the queries
# The WITH line is creating the request_accepted table

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT * FROM request_accepted;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,requester_id,accepter_id
0,1,2
1,1,3
2,2,3
3,3,4


### Strategy 1: Try solution's approach with UNION
- Create two tables, one for requester_id, one for accepter_id
- Then concat horizontally with UNION (like an rbind) and select off that

In [155]:
# Just see how the UNION works

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

(SELECT requester_id, COUNT(*) 
FROM request_accepted
GROUP BY requester_id)
UNION
(SELECT accepter_id, COUNT(*) 
FROM request_accepted
GROUP BY accepter_id)
;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,requester_id,count
0,4,1
1,3,1
2,2,1
3,1,2
4,3,2


NOTE that UNION doesn't maintain the order. Requester_id and accepter_id queries are all mixed together.
Interestingly, the requester_id column is kept but this will be alias'd to avoid confusion.

In [129]:
# Turn the UNION into one big table and select off that

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT big_table.id, SUM(big_table.total) AS total_friends
FROM
    ((SELECT requester_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY requester_id)
    UNION
    (SELECT accepter_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY accepter_id)) AS big_table
GROUP BY big_table.id
ORDER BY total_friends DESC
LIMIT 1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,total_friends
0,3,3.0


### Strategy 2: Try doing a JOIN between the two tables
- Create two tables, one for requester_id, one for accepter_id
- Then concat vertically with OUTER JOIN (like cbind), sum, and select off that

In [140]:
# See if the JOIN works

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT r_table.id, r_table.total
FROM
    (SELECT requester_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY requester_id) AS r_table
JOIN
    (SELECT accepter_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY accepter_id) AS a_table
ON r_table.id=a_table.id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,total
0,2,1
1,3,1


It does not have id 4

In [156]:
# Try with an OUTER JOIN and do sum

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT id, r_table.total, a_table.total
FROM
    (SELECT requester_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY requester_id) AS r_table
FULL OUTER JOIN
    (SELECT accepter_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY accepter_id) AS a_table
ON r_table.id=a_table.id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT id, r_table.total, a_table.total
FROM
    (SELECT requester_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY requester_id) AS r_table
FULL OUTER JOIN
    (SELECT accepter_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY accepter_id) AS a_table
ON r_table.id=a_table.id;
': column reference "id" is ambiguous
LINE 4: SELECT id, r_table.total, a_table.total
               ^


The id field is treated as ambiguous

In [149]:
# Need an OUTER JOIN and do sum

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT r_table.id, r_table.total, a_table.id, a_table.total
FROM
    (SELECT requester_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY requester_id) AS r_table
FULL OUTER JOIN
    (SELECT accepter_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY accepter_id) AS a_table
ON r_table.id=a_table.id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,total,id.1,total.1
0,1.0,2.0,,
1,2.0,1.0,2.0,1.0
2,3.0,1.0,3.0,2.0
3,,,4.0,1.0


Outer join approach likely doesn't work so far since the id fields can't be disambiguated

## Coding question 1

In [184]:
def bubble_sort_algo(arr):
    '''
    input: an array of size n
    output: sorted array
    No use of sort function
    '''
    for j in range(len(arr)-1):
        for i in range(len(arr)-j-1):
            if arr[i] > arr[i+1]:
                temp = arr[i]
                arr[i] = arr[i+1]
                arr[i+1] = temp
        print(j, arr)
      
    return arr

In [185]:
my_array = [2,0,2,1,1,0]
bubble_sort_algo(my_array)

0 [0, 2, 1, 1, 0, 2]
1 [0, 1, 1, 0, 2, 2]
2 [0, 1, 0, 1, 2, 2]
3 [0, 0, 1, 1, 2, 2]
4 [0, 0, 1, 1, 2, 2]


[0, 0, 1, 1, 2, 2]

In [186]:
def onepass_sort(arr):
    '''
    input: an array of size n
    output: sorted array
    No use of sort function
    '''
    p1=0
    p2=len(arr)-1
    cur=0
    
    while cur <= p2:
        if arr[cur]==0:
            arr[p1], arr[cur] = arr[cur], arr[p1]
            p1 += 1
            cur += 1
        elif arr[cur]==2:
            arr[p2], arr[cur] = arr[cur], arr[p2]
            p2 -= 1
        else:
            cur += 1
            
    return arr

In [187]:
my_array = [2,0,2,1,1,0]
onepass_sort(my_array)

[0, 0, 1, 1, 2, 2]

## Coding question 2

In [195]:
def maxMoney(arr):
    '''
    input: an array representing values of the houses
    output: maximum value
    '''
    
    # Strategy - determine the sum of each non-adjacent pair
    for i in range(len(arr)-1):
        for j in range(len(arr)-1):
            if abs(j-i)!=1:
                val = sum(arr[i]+arr[j])
                print(val)
                
    # 
    
    #return max_val

In [196]:
my_array= [2,7,9,3,1]
maxMoney(my_array)

TypeError: 'int' object is not iterable

In [197]:
sum(my_array)

22

In [198]:
len(my_array)

5

Leetcode 198

In [203]:
def rob(arr):
    prevMax = 0
    currMax = 0
    for x in arr:
        temp = currMax
        currMax = np.max(prevMax + x, currMax);
        prevMax = temp;
    return currMax
    

In [204]:
rob[2,7,9,3,1]

TypeError: 'function' object is not subscriptable

In [None]:
public int rob(int[] num) {
    int prevMax = 0;
    int currMax = 0;
    for (int x : num) {
        int temp = currMax;
        currMax = Math.max(prevMax + x, currMax);
        prevMax = temp;
    }
    return currMax;
}


# BRL SQL questions

In [None]:
/*
-- user_summary
-- user_id | join_ts | join_client | country

-- problems_solved
-- user_id | problem_id | ts | action | answer_is_correct
-- We have the following actions:
Viewed problem
Tried problem

*/

-- In the last 14d, what are the top 5 countries of people joining Brilliant

SELECT country, COUNT(*) AS no_of_people
FROM user_summary
WHERE join_ts BETWEEN (NOW() - 'interval 14 days') AND NOW()
GROUP BY country
ORDER BY no_of_people DESC
LIMIT 5

In [None]:
-- For each country, what are the average and total number of problems viewed 4 hours after a user joins?
-- | country | user_id | total_number_of_problems

WITH t1 AS
  (SELECT country, user_id, COUNT(*) AS number_of_problems
  FROM user_summary AS us
  JOIN problems_solved AS ps
  ON us.user_id=ps.user_id
  WHERE ps.ts BETWEEN us.join_ts AND (us.join_ts + 'interval + 4 hours')
  AND ps.action='viewed'
  GROUP BY us.country, us.user_id)
  
SELECT t1.country, 
	     AVG(number_of_problems) AS avg_number_of_problems,
       SUM(number_of_problems) AS total_number_of_problems
FROM t1;

In [None]:
-- For the US, if someone tried a problem 4 hours after joining, what % of them joined via iOS? Trend this by the date someone joined
-- | date | percentage |
-- date including zeros for ios

--join_client in ('desktop-browser','mobile-browser','ios-native','android-native')

WITH t1 AS
  (SELECT DISTINCT user_id, 
   	      us.join_ts, 
          COUNT(CASE WHEN join_client = 'ios-native' END) AS ios_counts,
   				COUNT(DISTINCT user_id) AS total_counts,
  FROM user_summary AS us
  JOIN problems_solved AS ps
  ON us.user_id=ps.user_id
  WHERE country='US'
  AND ps.ts BETWEEN us.join_ts AND (us.join_ts + 'interval + 4 hours')
  AND ps.action='tried')
  
SELECT t1.join_ts, 
       100*(t1.ios_counts::numeric/t1.total_counts) AS pct_ios_users
FROM t1
  

## Re-creating query

In [58]:
# Create table within a local database

# Define a database name, set your postgres username
dbname = "baseball"
username = "lacar"  # change this to your username

# Working with PostgreSQL in Python
con = psycopg2.connect(database=dbname, user=username)

# Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine("postgres://%s@localhost/%s" % (username, dbname))
print(engine.url)

postgres://lacar@localhost/baseball


In [None]:
user_id = [123, 123, 456, 456]
join_ts = ['2-14-20 3:05pm', '2-14-20 3:06pm', '2-15-20 5:46pm', '2-15-20 5:50pm']
join_client = ['desktop-browser','mobile-browser','ios-native','android-native']
country = ['']

user_summary= pd.DataFrame([user_id, action, timestamp]).T
user_summary.columns = ['user_id', 'action', 'timestamp']

In [None]:
user_id | join_ts | join_client | country

## Facebook problem

(From Yaniv) What percent of users click?



case

when


click

null


sum....



table


user1  |  click
user2  |  click
user2  |  NULL
user3  |  
user3  | 



click
null


# Insight SQL QotDs 

## 4/1/20

### Question 1. Computer scores of teams.

You would like to compute the scores of all teams after all matches. Points are awarded as follows:
* A team receives three points if they win a match (Score strictly more goals than the opponent team).
* A team receives one point if they draw a match (Same number of goals as the opponent team).
* A team receives no points if they lose a match (Score less goals than the opponent team).


Table: Teams
+---------------+----------+
| Column Name   | Type     |
+---------------+----------+
| team_id       | int      |
| team_name     | varchar  |
+---------------+----------+
team_id is the primary key of this table.
Each row of this table represents a single football team.

Table: Matches
+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| match_id      | int     |
| host_team     | int     |
| guest_team    | int     | 
| host_goals    | int     |
| guest_goals   | int     |
+---------------+---------+
match_id is the primary key of this table.
Each row is a record of a finished match between two different teams. 
Teams host_team and guest_team are represented by their IDs in the teams table (team_id) and they scored host_goals and guest_goals respectively.


Output table

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| team_id       | int      |
| team_name     | varchar  |
| total_points  | int      |
+---------------+---------+


In [100]:
# Create temporary tables, use join on host team
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5))          
           
SELECT *
FROM teams
JOIN matches
ON teams.team_id=matches.host_team
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,team_id,team_names,match_id,host_team,guest_team,host_goals,guest_goals
0,1,Padres,1,1,2,10,0
1,2,Dodgers,2,2,3,2,4
2,3,Giants,3,3,4,5,2
3,4,Dbacks,4,4,5,3,3
4,5,Rockies,5,5,1,0,5


In [101]:
# Create temporary tables
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5))          
           
SELECT *
FROM matches;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,match_id,host_team,guest_team,host_goals,guest_goals
0,1,1,2,10,0
1,2,2,3,2,4
2,3,3,4,5,2
3,4,4,5,3,3
4,5,5,1,0,5


In [102]:
# Create temporary tables
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5))          
           
SELECT *
FROM teams
JOIN matches
ON CASE WHEN matches.host_goals > matches.guest_goals THEN teams.team_id=matches.host_team
    END;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,team_id,team_names,match_id,host_team,guest_team,host_goals,guest_goals
0,1,Padres,1,1,2,10,0
1,3,Giants,3,3,4,5,2


In [104]:
# Create temporary tables
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
home_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 3
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 0
                END) AS home_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.host_team
    GROUP BY team_id, team_names),
    
away_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 0
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 3
                END) AS away_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.guest_team
    GROUP BY team_id, team_names) 
    
SELECT *
FROM home_results
JOIN away_results
ON home_results.team_id=away_results.team_id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,team_id,team_names,home_total_points,team_id.1,team_names.1,away_total_points
0,1,Padres,3,1,Padres,3
1,2,Dodgers,0,2,Dodgers,0
2,3,Giants,3,3,Giants,3
3,5,Rockies,0,5,Rockies,1
4,4,Dbacks,1,4,Dbacks,0


In [105]:
# Final output
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
home_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 3
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 0
                END) AS home_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.host_team
    GROUP BY team_id, team_names),
    
away_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 0
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 3
                END) AS away_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.guest_team
    GROUP BY team_id, team_names) 
    
SELECT home_results.team_id AS team_id,
       (home_total_points + away_total_points) AS total_points
FROM home_results
JOIN away_results
ON home_results.team_id=away_results.team_id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,team_id,total_points
0,1,6
1,2,0
2,3,6
3,5,1
4,4,1


In [None]:
# Final output
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
home_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 3
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 0
                END) AS home_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.host_team
    GROUP BY team_id, team_names),
    
away_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 0
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 3
                END) AS away_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.guest_team
    GROUP BY team_id, team_names) 
    
SELECT home_results.team_id AS team_id,
       (home_total_points + away_total_points) AS total_points
FROM home_results
JOIN away_results
ON home_results.team_id=away_results.team_id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

In [108]:
# Final output - shorter
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
home_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 3
                WHEN matches.host_goals=matches.guest_goals THEN 1
                ELSE 0
                END) AS home_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.host_team
    GROUP BY team_id, team_names),
    
away_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals<matches.guest_goals THEN 3
                WHEN matches.host_goals=matches.guest_goals THEN 1
                ELSE 0
                END) AS away_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.guest_team
    GROUP BY team_id, team_names) 
    
SELECT home_results.team_id AS team_id,
       (home_total_points + away_total_points) AS total_points
FROM home_results
JOIN away_results
ON home_results.team_id=away_results.team_id
ORDER BY total_points DESC, home_results.team_id ASC;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,team_id,total_points
0,1,6
1,3,6
2,4,1
3,5,1
4,2,0


In [120]:
# Testing Minting's solution (can't follow all of it)
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
score AS (
    SELECT match_id, 
           host_team, 
           guest_team,
           (CASE WHEN host_goals > guest_goals THEN 3
                 WHEN host_goals = guest_goals THEN 1
                 ELSE 0 END) AS host_score,
           (CASE WHEN guest_goals > host_goals THEN 3
                WHEN guest_goals = host_goals THEN 1
                ELSE 0 END) AS guest_score
           FROM matches)
    
SELECT team_id, team_name, sum(team_score)
FROM (SELECT t1.team_id AS team_id, 
             t1.team_name AS team_name, 
             s.host_score AS team_score
             FROM score s
             INNER JOIN Teams t1
             ON s.host_team = t1.team_id
       UNION ALL
       SELECT t2.team_id AS team_id,
              t2.team_name AS team_name, 
              s.guest_score AS team_score
              FROM score s
              INNER JOIN teams t2
              ON s.guest_team = t2.team_id) sub
GROUP BY team_id, team_name
ORDER BY team_score DESC, team_id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
score AS (
    SELECT match_id, 
           host_team, 
           guest_team,
           (CASE WHEN host_goals > guest_goals THEN 3
                 WHEN host_goals = guest_goals THEN 1
                 ELSE 0 END) AS host_score,
           (CASE WHEN guest_goals > host_goals THEN 3
                WHEN guest_goals = host_goals THEN 1
                ELSE 0 END) AS guest_score
           FROM matches)
    
SELECT team_id, team_name, sum(team_score)
FROM (SELECT t1.team_id AS team_id, 
             t1.team_name AS team_name, 
             s.host_score AS team_score
             FROM score s
             INNER JOIN Teams t1
             ON s.host_team = t1.team_id
       UNION ALL
       SELECT t2.team_id AS team_id,
              t2.team_name AS team_name, 
              s.guest_score AS team_score
              FROM score s
              INNER JOIN teams t2
              ON s.guest_team = t2.team_id) sub
GROUP BY team_id, team_name
ORDER BY team_score DESC, team_id;
': column t1.team_name does not exist
LINE 30:              t1.team_name AS team_name, 
                      ^
HINT:  Perhaps you meant to reference the column "t1.team_names".


In [None]:
SELECT team_id, team_name, sum(team_score)
FROM (SELECT t1.team_id AS team_id, t1.team_name AS team_name, s.host_score AS team_score
FROM score s
INNER JOIN Teams t1
ON s. host_team = t1.team_id
UNION ALL
SELECT t2.team_id AS team_id, t2.team_name AS team_name, s.guest_score AS team_score
FROM score s
INNER JOIN Teams t2
ON s. guest_team = t2. team_id) sub
GROUP BY team_id, team_name
ORDER BY team_score DESC, team_id

### Question 2. Rank scores.

Write a SQL query to rank scores. If there is a tie between two scores, both should have the same ranking. Note that after a tie, the next ranking number should be the next consecutive integer value. In other words, there should be no "holes" between ranks.

+----+-------+
| Id | Score |
+----+-------+
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |
| 6  | 3.65  |
+----+-------+


In [4]:
# 
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,score
0,1,3.5
1,2,3.65
2,3,4.0
3,4,3.85
4,5,4.0
5,6,3.65


In [10]:
# 
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,score
0,1,3.5
1,2,3.65
2,3,4.0
3,4,3.85
4,5,4.0
5,6,3.65


In [None]:
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score
FROM input_table AS it;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

In [76]:
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score,
       (SELECT COUNT(*) 
       FROM input_table
       WHERE input_table.score > it.score) AS rank
FROM input_table AS it
ORDER BY rank;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,score,rank
0,4.0,0
1,4.0,0
2,3.85,2
3,3.65,3
4,3.65,3
5,3.5,5


In [80]:
# 
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score,
       (SELECT COUNT(DISTINCT(score)) 
        FROM input_table
        WHERE input_table.score >= it.score) AS rank
FROM input_table AS it
ORDER BY rank;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,score,rank
0,4.0,1
1,4.0,1
2,3.85,2
3,3.65,3
4,3.65,3
5,3.5,4


In [17]:
# 
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score,
       DENSE_RANK() OVER(ORDER BY score DESC) AS score_rank
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,score,score_rank
0,4.0,1
1,4.0,1
2,3.85,2
3,3.65,3
4,3.65,3
5,3.5,4


Write a SQL query to get the nth highest salary from the Employee table.

+----+--------+
| Id | Salary |
+----+--------+
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |
+----+--------+


In [29]:
sql_query = """
WITH input_table (id, salary) AS (VALUES (1,100), (2,200), (3,300))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,salary
0,1,100
1,2,200
2,3,300


In [None]:
sql_query = """
WITH input_table (id, salary) AS (VALUES (1,100), (2,200), (3,300))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

In [32]:
sql_query = """
WITH Employee (id, salary) AS (VALUES (1,100), (2,200), (3,300))

SELECT MIN(salary)
FROM(SELECT DISTINCT salary
FROM Employee
ORDER BY salary DESC
LIMIT 2) AS sub_table
;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query



Unnamed: 0,min
0,200


In [124]:
sql_query = """
WITH  input_table (id, salary) AS (VALUES (1,100), (2,200), (3,300))

SELECT score,
       (SELECT salary 
        FROM input_table
        WHERE input_table.score > SELECT DISTINCT(it.score) from input) AS rank
FROM input_table AS it;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

InterfaceError: connection already closed

In [None]:
sql_query = """
WITH  input_table (id, salary) AS (VALUES (1,100), (2,200), (3,300))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

## 4/2/20

### Most frequent product

Given a table of products, find the most frequent product each day

ID   |  Date  |   Product  |
-----| ------ |  --------- |
1    |  2-12  |   apple    |
2    |  2-12  |   apple    |
3    |  2-12  |   orange   |
4    |  2-13  |   pear     |


In [153]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))
              
           
SELECT *
FROM products;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,date,product
0,1,2020-02-12,apple
1,2,2020-02-12,apple
2,3,2020-02-12,orange
3,4,2020-02-13,pear


In [166]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear')),
              
t1 AS
    (SELECT date,
           product,
           COUNT(*) AS no_items
    FROM products
    GROUP BY date, product)
    
SELECT *
FROM t1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,no_items
0,2020-02-12,apple,2
1,2020-02-13,pear,1
2,2020-02-12,orange,1


In [177]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear')),
              
t1 AS
    (SELECT date,
           product,
           COUNT(*) AS no_items
    FROM products
    GROUP BY date, product)

SELECT *
FROM t1;
            
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,no_items
0,2020-02-12,apple,2
1,2020-02-13,pear,1
2,2020-02-12,orange,1


In [189]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))
    
SELECT date, 
       product,
       COUNT(*)
FROM products
GROUP BY date, product;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,count
0,2020-02-12,apple,2
1,2020-02-13,pear,1
2,2020-02-12,orange,1


In [191]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear')),
    
t1 AS (SELECT date, 
       product,
       COUNT(*)
        FROM products
        GROUP BY date, product)

SELECT *
FROM t1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,count
0,2020-02-12,apple,2
1,2020-02-13,pear,1
2,2020-02-12,orange,1


In [193]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear')),
    
t1 AS (SELECT date, 
       product,
       COUNT(*)
       FROM products
       GROUP BY date, product)

SELECT date, MAX(count)
FROM t1
GROUP BY date;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,max
0,2020-02-12,2
1,2020-02-13,1


In [194]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear')),
    
t1 AS (SELECT date, 
       product,
       COUNT(*)
       FROM products
       GROUP BY date, product)

SELECT t1.date,
       t1.product
FROM t1
JOIN 
    (SELECT date, MAX(count) AS max_count
    FROM t1
    GROUP BY date) AS t2
ON t1.date=t2.date
AND t1.count=t2.max_count;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product
0,2020-02-12,apple
1,2020-02-13,pear


In [210]:
# Try with window function
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))

SELECT date,
       product,
       RANK() OVER(PARTITION BY date ORDER BY COUNT(product) DESC) AS n_rank
FROM products
GROUP BY date, product;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,n_rank
0,2020-02-12,apple,1
1,2020-02-12,orange,2
2,2020-02-13,pear,1


In [None]:
# Try Shu's window function
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))

SELECT t.date, t.product
FROM
    (SELECT date,
           product,
           RANK() OVER(PARTITION BY date ORDER BY COUNT(product) DESC) AS n_rank
    FROM products
    GROUP BY date, product) AS t
WHERE t.n_rank=1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

In [5]:
# Try with window function
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear'))

SELECT t.date, t.product
FROM
    (SELECT date,
           product,
           RANK() OVER(PARTITION BY date ORDER BY COUNT(product) DESC) AS n_rank
    FROM products
    GROUP BY date, product) AS t
WHERE t.n_rank=1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product
0,2020-02-12,apple
1,2020-02-13,orange
2,2020-02-14,apple


In [None]:
# Without a  window function and repeat 4/13/20
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear'))

SELECT t.date, t.product
FROM
    (SELECT date,
            product,
            COUNT(product) DESC) AS n_rank
    FROM products
    GROUP BY date, product) AS t
WHERE t.n_rank=1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

## 4/3/20

### Transactions

2) You have a table of transactions where each row represents a single transaction. The table has four columns: A user_id for the user sending money (from now on, the sender), a user_id for user receiving money (from now on, the receiver), an amount that was sent by the sender to the receiver, and a timestamp for when the transaction took place. User_ids appearing in both the sender and receiver columns are foreign keys to the same user table, and all values in the amount column are positive. 

Write a single query that gives the change in net worth for each user since data was being recorded in this table.


In [None]:
transactions
| sender_id | receiver_id | amount | ts |


output
| user_id | net_worth |


In [None]:
# Sent table

SELECT 
    user_id,

    
FROM transactions
WHERE 
    sender_id 
        
        

| sender_id | receiver_id | amount | ts |

## Exchange Seats

Mary is a teacher in a middle school and she has a table seat storing students' names and their corresponding seat ids.

The column id is continuous increment.
 
Mary wants to change seats for the adjacent students.
 
Can you write a SQL query to output the result for Mary?

+---------+---------+
|    id   | student |
+---------+---------+
|    1    | Abbot   |
|    2    | Doris   |
|    3    | Emerson |
|    4    | Green   |
|    5    | Jeames  |
+---------+---------+

In [5]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Abbot'),
           (2, 'Doris'),
           (3, 'Emerson'),
           (4, 'Green'),
           (5, 'Jeames'))

SELECT *
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Abbot
1,2,Doris
2,3,Emerson
3,4,Green
4,5,Jeames


In [20]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Abbot'),
           (2, 'Doris'),
           (3, 'Emerson'),
           (4, 'Green'),
           (5, 'Jeames'))

SELECT MAX(id) FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,max
0,5


In [23]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Abbot'),
           (2, 'Doris'),
           (3, 'Emerson'),
           (4, 'Green'),
           (5, 'Jeames'))

SELECT id,
       CASE WHEN id = (SELECT MAX(id) FROM seat) THEN student
            WHEN id%2 <> 0 THEN LEAD(student, 1) OVER()
            WHEN id%2 = 0 THEN LAG(student, 1) OVER()
            END AS student
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Doris
1,2,Abbot
2,3,Green
3,4,Emerson
4,5,Jeames


In [29]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Abbot'),
           (2, 'Doris'),
           (3, 'Emerson'),
           (4, 'Green'),
           (5, 'Jeames'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN LEAD(student, 1) OVER(ORDER BY student)
            WHEN id%2 = 0 THEN LAG(student, 1) OVER(ORDER BY student)
            END AS student
FROM seat
ORDER BY id;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Doris
1,2,Abbot
2,3,Green
3,4,Emerson
4,5,Jeames


{"headers": {"seat": ["id","student"]}, "rows": {"seat": [[1,"Craigie"],[2,"Julius"],[3,"Denis"],[4,"Isabel"],[5,"Windsor"],[6,"Vincent"],[7,"Mike"],[8,"Russell"],[9,"FitzGerald"],[10,"Rob"]]}}

In [32]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT *
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Craigie
1,2,Julius
2,3,Denis
3,4,Isabel
4,5,Windsor
5,6,Vincent
6,7,Mike
7,8,Russell
8,9,FitzGerald
9,10,Rob


In [34]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN LEAD(student, 1) OVER(ORDER BY id)
            WHEN id%2 = 0 THEN LAG(student, 1) OVER(ORDER BY id)
            END AS student
FROM seat
ORDER BY id;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Julius
1,2,Craigie
2,3,Isabel
3,4,Denis
4,5,Vincent
5,6,Windsor
6,7,Russell
7,8,Mike
8,9,Rob
9,10,FitzGerald


In [39]:
# Without window function
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id-1)
            WHEN id%2 = 0 THEN (SELECT student FROM seat WHERE id+1)
            END AS student
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id-1)
            WHEN id%2 = 0 THEN (SELECT student FROM seat WHERE id+1)
            END AS student
FROM seat;

': argument of WHERE must be type boolean, not type integer
LINE 16: ...   WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id-1)
                                                                   ^


In [46]:
# Without window function
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       id+1,
       id-1
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,?column?,?column?.1
0,1,2,0
1,2,3,1
2,3,4,2
3,4,5,3
4,5,6,4
5,6,7,5
6,7,8,6
7,8,9,7
8,9,10,8
9,10,11,9


In [50]:
# Without window function
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id=2)
            WHEN id%2 = 0 THEN (SELECT student FROM seat WHERE id=1)
            END AS student
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Julius
1,2,Craigie
2,3,Julius
3,4,Craigie
4,5,Julius
5,6,Craigie
6,7,Julius
7,8,Craigie
8,9,Julius
9,10,Craigie


In [51]:
# Without window function
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id=(SELECT id+1 FROM seat))
            END AS student
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id=(SELECT id+1 FROM seat))
            END AS student
FROM seat;

': more than one row returned by a subquery used as an expression


In [57]:
# Without window function
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT s.student
FROM seat s
WHERE id=(SELECT id+1 FROM seat);
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT s.student
FROM seat s
WHERE id=(SELECT id+1 FROM seat);
': more than one row returned by a subquery used as an expression


## 4/13/20 

### Group features

There are testing out on the change of a feature. They want to know whether group A behaves differently compared to group  B. 
How many people visit the SignUp page by each group? How about click through rate?
Based on the result, can we conclude that they are different? What test should be done? What hypothesis should be form? One tail or two tail test? How should you convince the manager? 

Table: Cohort
ID  | GroupAssigment  | 
1  | Group A | 
2
Group A
3
Group B
4
Group B
Table: Event
ID
Page
Click
1
SignUp
1
2
SignUp
0
3
SignIn
0
4
SignIn
0




In [None]:
SELECT COUNT(*)
FROM cohort
JOIN event
ON cohort.id=event.groupid
WHERE page="signup"

## 4/14/20

### Uber's fraud team

1) Suppose you work at Uber's fraud team. While it is common for clients to use Uber in a foreign city, it is rare for a driver to drive in a foreign city. List all the drivers that have at least 5 completed trips in a foreign city in the last 28 days.


output:
| driver_id |  n_trips

SELECT t.driver_id,
       COUNT(*) AS n_trips
FROM 
    (SELECT * 
     FROM trips t
     JOIN cities c1
     ON t.city_id=c1.id) AS tc
JOIN 
    (SELECT *
     FROM users u
     JOIN cities c2
     ON u.city_id=c2.id
     WHERE u.role='driver') AS udc
ON tc.city_id=udc.city_id
WHERE DATEDIFF(day, tc.complete_time, GETDATE()) < 28
AND tc.country <> udc.country
GROUP BY tc.driver_id
HAVING COUNT(*) >= 5


### Projects

1) Given below tables, find the name of the department with the highest number of projects?





In [None]:
WITH t1 AS
    (SELECT d.name,
           COUNT(p.id) AS project_count
    FROM departments d
    JOIN employees e
    ON d.id=e.department_id
    JOIN employees_projects ep
    ON e.id=ep.id
    JOIN projects p
    ON ep.project_id=p.id
    GROUP BY d.name),
    
t2 AS (
    SELECT *,
           RANK() OVER (ORDER BY project_count) AS rank
    FROM t1
    

SELECT *
FROM t2
WHERE rank=1;



# --