# Fun with Window Functions


## Database Connection

In [2]:
# connect to db in public repo
ib.connect_db('ib://ewu/w4111-public/databases/w4111')

Connected to: ib://ewu/w4111-public/databases/w4111


# "The interview"

You are being interviewed for a data scientist position at a major online store, and to get the job, you need to prove that you can answer some of their questions regarding past sales.

To do so, you have access to a dataset containing every sale between 2014 and 2017, stored in a single table. For simplicity, every transaction contains only one item.

The columns are the following:
* **transaction_id** - unique integer id for that transaction **(do not assume any ordering!)**
* **transaction_datetime** - date and time of the transaction, encoded as Postgres' timestamp type
* **customer_id** - unique integer id for that customer
* **customer_first_name**
* **customer_last_name**
* **shipping_state**
* **item_id** - unique integer id for that item
* **item_description**
* **item_price**

In [14]:
%%sql
select * from fact_sales limit 5;

Unnamed: 0,transaction_id,transaction_datetime,customer_id,customer_first_name,customer_last_name,shipping_state,item_id,item_description,item_price
0,1,2014-01-01 00:06:40,100000236,Carol,Blake,Missouri,8068199275,Slacks,87.7799988
1,2,2014-01-01 00:29:26,300000427,Dominic,Welch,Michigan,2695940506,Sweater,59.4000015
2,3,2014-01-01 00:53:39,100000197,Blake,Ross,Delaware,1336619171,Slacks,63.8400002
3,4,2014-01-01 01:41:48,100000116,Alison,Johnston,Delaware,5187373849,Short Slip,12.8699999
4,5,2014-01-01 02:31:54,100000007,Dorothy,Thomson,Alabama,8586814459,Gown & Robe Set,65.6399994



__For each of the customers listed below by ID, show their individual cumulative sum over transactions made in 2015.__

Customer IDs: 300000073, 100000381, 100000160

You must provide a single query, containing the following columns:
* transaction_id
* transaction_datetime
* customer_id
* customer_last_name
* item_id
* item_description
* item_price
* cumulative_sum, containing the cumulative sum over time for such customer

In [10]:
%%sql
SELECT
    transaction_id,
    transaction_datetime,
    customer_id,
    customer_last_name,
    item_id,
    item_description,
    item_price,
    sum(item_price) OVER (PARTITION BY customer_id,extract(year from transaction_datetime) 
                          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
        as cumulative_sum
    
FROM fact_sales
WHERE customer_id IN (300000073, 100000381, 100000160) AND extract(year from transaction_datetime)=2015



Unnamed: 0,transaction_id,transaction_datetime,customer_id,customer_last_name,item_id,item_description,item_price,cumulative_sum
0,12525,2015-01-03 22:22:30,100000160,Walker,7690258628,Baby Blanket,21.9300003,21.9300003
1,16544,2015-05-01 00:57:43,100000160,Walker,9223468295,Bath Mat,23.8799992,45.8099976
2,16891,2015-05-10 18:39:10,100000160,Walker,3135576553,Underwear,5.88000011,51.6899986
3,18761,2015-07-03 05:31:12,100000160,Walker,3405268556,Handbag,114.769997,166.459991
4,19235,2015-07-16 19:32:50,100000160,Walker,9538578209,Dress,87.3600006,253.819992
5,19806,2015-08-02 10:57:33,100000160,Walker,2113413915,Shorts,59.3300018,313.149994
6,19948,2015-08-07 01:20:17,100000160,Walker,4681342313,Sheets,29.8999996,343.049988
7,20591,2015-08-25 18:07:31,100000160,Walker,2363094138,Short Overalls,71.8199997,414.869995
8,21297,2015-09-15 21:20:41,100000160,Walker,8068199275,Slacks,87.7799988,502.649994
9,22549,2015-10-22 09:08:15,100000160,Walker,3123824581,Sweatpants,43.8899994,546.539978


__What were the top 5 shipping states per year, based on the total value of transactions, for 2015 and 2016?__

You must provide a single query, containing the following columns
* year, string
* shipping_state, string
* total_sales, containing the total value of transactions
* position, containing the position of such item among the top 5 for such state.

In [31]:
%%sql
WITH temp AS (
    SELECT DISTINCT
        extract(year from transaction_datetime) as year,
        shipping_state,
        sum(item_price) OVER (PARTITION BY shipping_state, extract(year from transaction_datetime)) as total_sales    
    FROM fact_sales
    WHERE extract(year from transaction_datetime) IN (2015,2016)
            )
    SELECT 
        year,
        shipping_state,
        total_sales,
        rank() OVER (PARTITION BY year ORDER BY total_sales DESC) as position
    FROM temp
    ORDER BY rank ASC, year
    LIMIT 10


Unnamed: 0,year,shipping_state,total_sales,rank
0,2015,Texas,28772.8223,1
1,2016,Texas,27251.5059,1
2,2015,Massachusetts,27991.8809,2
3,2016,Massachusetts,26252.7773,2
4,2015,Kentucky,22268.4961,3
5,2016,Rhode Island,22217.4824,3
6,2015,Rhode Island,20617.5215,4
7,2016,Kentucky,21055.3145,4
8,2015,Arkansas,17653.7988,5
9,2016,Arkansas,19702.6836,5


__Compute the daily total value of sales from 2014-01-01 to 2014-01-20 (inclusive), along with a running 7d average and a month-to-date total__

You must provide a single query, containing the following columns:
* date
* total_sales, containing the total for such date
* sales_7d_avg, containing the running 7d average (i.e. between 6 days before and the current date). Note: the average should be computed even if there's some missing data for the past 6 days (e.g. 2014-01-01).
* month_to_date, containing the cumulative sum from the start of the month to the current date.


In [105]:
%%sql
WITH temp AS (
SELECT DISTINCT
    cast(transaction_datetime as date) as date,
    sum(item_price) OVER (PARTITION BY cast(transaction_datetime as date)) as total_sales
FROM fact_sales
WHERE transaction_datetime >= '2014-01-01' AND transaction_datetime  <= '2014-01-21'
ORDER BY date ASC
        )
SELECT 
    date,
    total_sales,
    avg(total_sales) OVER (ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as sales_7d_avg,
    sum(total_sales) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as month_to_date
FROM temp

Unnamed: 0,date,total_sales,sales_7d_avg,month_to_date
0,2014-01-01,1870.06982,1870.06982421875,1870.06982
1,2014-01-02,1933.42004,1901.7449340820312,3803.48975
2,2014-01-03,1801.95007,1868.47998046875,5605.43994
3,2014-01-04,1061.10999,1666.6374816894531,6666.5498
4,2014-01-05,2227.86011,1778.8820068359375,8894.41016
5,2014-01-06,2297.97998,1865.3983357747395,11192.3906
6,2014-01-07,1919.49023,1873.1257498604912,13111.8809
7,2014-01-08,1796.0,1862.5443464006696,14907.8809
8,2014-01-09,1036.31006,1734.385777064732,15944.1914
9,2014-01-10,1709.91003,1721.2372000558037,17654.1016


__Compute the yearly total value of sales for New York, Massachusetts and Michigan, along with % variation in relation to the previous year.__

You must provide a single query, containing the following columns:
* shipping_state
* year
* total_sales, containing the total for such state and year
* variation, containg the % variation in relation to the previous year, i.e. if the yerly total for a state has doubled this field would contain a value of 100.0 . For the first year, this field should be NULL.

In [106]:
%%sql
WITH temp AS (
    SELECT DISTINCT
        shipping_state,
        extract(year from transaction_datetime) as year,
        sum(item_price) OVER (PARTITION BY shipping_state, extract(year from transaction_datetime)) as total_sales  
    FROM fact_sales
    WHERE shipping_state IN ('New York','Massachusetts','Michigan')
    ORDER BY shipping_state,year
            )
    SELECT 
        shipping_state,
        year,
        total_sales,
        ((total_sales - lag(total_sales,1) OVER (PARTITION BY shipping_state))/lag(total_sales,1) OVER (PARTITION BY shipping_state))*100 as variation
    FROM temp



Unnamed: 0,shipping_state,year,total_sales,variation
0,Massachusetts,2014,26338.0801,
1,Massachusetts,2015,27991.9004,6.279198080301285
2,Massachusetts,2016,26252.7949,-6.212888285517693
3,Massachusetts,2017,24987.0293,-4.821451008319856
4,Michigan,2014,13891.4531,
5,Michigan,2015,13060.791,-5.979663133621216
6,Michigan,2016,14520.0176,11.172574013471603
7,Michigan,2017,14958.4912,3.019787184894085
8,New York,2014,11459.1533,
9,New York,2015,13970.4395,21.9151109457016


__The store has decided to award customers that spent at least $2500 in a single year. Show which customer(s) will receive an award, along with the datetime of the transaction that made each of them pass the minimum value to get the prize.__

You must provide a single query, containing the following columns:
* year
* customer_id
* award_transaction_datetime


In [104]:
%%sql
WITH temp AS (
SELECT
    transaction_datetime,
    extract(year from transaction_datetime) as year,
    item_price,
    customer_id,
    sum(item_price) OVER (PARTITION BY (customer_id,extract(year from transaction_datetime)) ORDER BY transaction_datetime
                          ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as cumulative_sum
FROM fact_sales
)

SELECT 
    year,
    customer_id,
    transaction_datetime as award_transaction_datetime
FROM temp
WHERE cumulative_sum >= 2500 and transaction_datetime <= ALL(SELECT 
                                                                 temp1.transaction_datetime 
                                                                 FROM temp as temp1
                                                                 WHERE temp1.customer_id = temp.customer_id AND
                                                                 temp1.year = temp.year AND
                                                                 temp1.cumulative_sum >= 2500
                                                            )

Unnamed: 0,year,customer_id,award_transaction_datetime
0,2015,100000885,2015-12-09 23:48:33
