**Querying postgreSQL in Jupyter notebook**

Useful for writing notes and iterating over SQL queries. You can look at the "hard" examples down below to show how queries can be broken down into smaller parts and then combined into a more complicated query.
-Ben

In [2]:
import pandas as pd
import sqlalchemy
import sqlalchemy_utils
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database
import psycopg2

In [2]:
# Alise's database
# Working with PostgreSQL in Python
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(host = 'ec2-54-245-31-214.us-west-2.compute.amazonaws.com',
                       port = '5291',
                       database = 'ecommerce',
                       user = 'sqlpractice',
                       password = 'iloveSQL!')

In [3]:
# Define a database name
# Set your postgres username
dbname = "baseball"
username = "lacar"  # change this to your username

# Working with PostgreSQL in Python
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(database=dbname, user=username)

# Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine("postgres://%s@localhost/%s" % (username, dbname))
print(engine.url)

postgres://lacar@localhost/baseball


# Notes

## Order of execution

1. FROM /JOIN (subqueries)
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. DISTINCT
7. ORDER BY
8. LIMIT/OFFSET

It can be summarized as:
**calling -> aggregating -> displaying -> filter**

<br>
- First creates a working dataset, then filters with rows added by conditions
<br>
- Column name aliases are not accessed for all commands except after SELECT (aliases for table names are okay) (postgreSQL seems to have some exceptions for SELECT with column name aliasing)

## Window functions

- Window functions (always in SELECT statement; it can also be in ORDER BY)
    - Window functions can contain multiple types of functions, including aggregate, rank 
    
- Window functions
    - RANK() 
    - DENSE_RANK() - no gaps in rank values
    - ROW_NUMBER() - assign a unique sequential interger to rows within a partition of a result set, the first row starts with 1
    - NTILE() - to identify percentile or quartile **no median but can possibly use this?**
    - LAG() - pulls from previous row to compare rows to preceding **important for some questions**
    - LEAD() - pulls from following row to compare rows to following **important for some questions**
    

## Optimizing SQL queries

- Query runtime can be affected by multiple factors, including:
    - table size
    - joins
    - aggregations
    
- Amount of data and desired output influences runtime and methods of optimization

- Optimizing query runtime can be done by implementing practices such as:
    - EXPLAIN; understand the runtime of query - place before any query
    - Filtering data with WHERE or LIMIT and selecting only columns you need
    - Aggregate tables before joining them
    - Break query into multiple queries


### Other topics to study

- Self joins
- Cross joins
- Data types
- DATES/DATETIME
- Built-in SQL functions
    - ROUND()
    - CAST() - moves something into a float to allow division - see above
- Creating and updating tables
- Pivoting tables with CASE statements

**Wanted names so I edited query for the following**

### Creating tables for validation

In [70]:
user_id = [123, 123, 456, 456]
action = ['start', 'cancel', 'start', 'publish']
timestamp = ['2-14-20 3:05pm', '2-14-20 3:06pm', '2-15-20 5:46pm', '2-15-20 5:50pm']

composer = pd.DataFrame([user_id, action, timestamp]).T
composer.columns = ['user_id', 'action', 'timestamp']

# Temp table created here that I'll just over-write with each new problem
composer.to_sql('temp_table', engine, if_exists='replace')

In [71]:
# Overview of table
sql_query = """
SELECT *
FROM temp_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,index,user_id,action,timestamp
0,0,123,start,2-14-20 3:05pm
1,1,123,cancel,2-14-20 3:06pm
2,2,456,start,2-15-20 5:46pm
3,3,456,publish,2-15-20 5:50pm


# Edu study SQL

In [2]:
# Define a database name
# Set your postgres username
dbname = "postgres"
username = "lacar"  # change this to your username

# Working with PostgreSQL in Python
# Connect to make queries using psycopg2
con = None
con = psycopg2.connect(database=dbname, user=username)

# Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine("postgres://%s@localhost/%s" % (username, dbname))
print(engine.url)

postgres://lacar@localhost/postgres


In [3]:
# df_state_info_gs_census = pd.read_csv('df_state_info_gs_census_ALL.csv', index_col=0)
# df_state_info_gs_census = pd.read_csv(
#     "~/Documents/Goals_and_careers/Edu_Data_Science/Insight/edu_project_for_interview/df_state_info_gs_census_ALL.csv"
# )

#df_state_info_gs_census.iloc[:, 0:107].to_sql('edu', engine, if_exists='replace')

#[print(i, col) for i, col in enumerate(df_state_info_gs_census)]

In [5]:
sql_query = """
SELECT *
FROM edu
LIMIT 5
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

# df_query.to_sql('statcast_simple', engine, if_exists='replace')

   index  Unnamed: 0        CDSCode NCESDist NCESSchool StatusType   County  \
0      0           0  1100170112607  0691051      10947     Active  Alameda   
1      1           1  1611190130229  0601770      00041     Active  Alameda   
2      2           2  1611190130625  0601770      08674     Active  Alameda   
3      3          23  1612590108944  0628050      10726     Active  Alameda   
4      4           3  1611270130450  0601860      00059     Active  Alameda   

                             District                                  School  \
0  Alameda County Office of Education  Envision Academy for Arts & Technology   
1                     Alameda Unified                            Alameda High   
2                     Alameda Unified                  Alternatives in Action   
3                     Oakland Unified       Lighthouse Community Charter High   
4                 Albany City Unified                             Albany High   

                    Street  ... test_s

In [14]:
sql_query = """
SELECT DISTINCT "County"
FROM edu;
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

# df_query.to_sql('statcast_simple', engine, if_exists='replace')

             County
0            Madera
1             Butte
2            Orange
3        Stanislaus
4      Contra Costa
5        Santa Cruz
6            Nevada
7            Fresno
8              Napa
9             Modoc
10           Lassen
11    San Francisco
12         Humboldt
13      San Joaquin
14           Sutter
15            Glenn
16             Kern
17           Colusa
18           Amador
19           Solano
20      Santa Clara
21         Tuolumne
22             Yolo
23        San Diego
24           Placer
25   San Bernardino
26          Ventura
27        Calaveras
28         Siskiyou
29         Monterey
30           Shasta
31       San Benito
32        Mendocino
33      Los Angeles
34           Tehama
35             Inyo
36           Sonoma
37        El Dorado
38  San Luis Obispo
39    Santa Barbara
40           Sierra
41             Mono
42          Alameda
43             Yuba
44        San Mateo
45        Riverside
46           Tulare
47           Merced
48         Mariposa


In [16]:
# Show the number of students in each county
sql_query = """
SELECT "County",
        SUM("enrollment") AS n_students
FROM edu
GROUP BY "County"
ORDER BY n_students DESC;
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

# df_query.to_sql('statcast_simple', engine, if_exists='replace')

             County  n_students
0       Los Angeles    374767.0
1            Orange    136512.0
2         San Diego    128139.0
3    San Bernardino    113710.0
4         Riverside    112294.0
5       Santa Clara     76437.0
6        Sacramento     57697.0
7           Alameda     49637.0
8            Fresno     47767.0
9           Ventura     41478.0
10      San Joaquin     34802.0
11     Contra Costa     34266.0
12           Tulare     31111.0
13       Stanislaus     29717.0
14        San Mateo     25774.0
15         Monterey     19557.0
16             Kern     19487.0
17           Placer     18734.0
18    Santa Barbara     18572.0
19           Sonoma     17986.0
20           Solano     17813.0
21    San Francisco     15526.0
22           Merced     12552.0
23  San Luis Obispo     12382.0
24         Imperial     10912.0
25       Santa Cruz     10039.0
26            Butte      8891.0
27           Madera      8045.0
28             Yolo      7979.0
29            Marin      7507.0
30      

In [20]:
# Show the number of students in each district
sql_query = """
SELECT "District",
        SUM("enrollment") AS n_students
FROM edu
GROUP BY "District"
HAVING SUM("enrollment") != 'NaN'
ORDER BY n_students DESC;
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

# df_query.to_sql('statcast_simple', engine, if_exists='replace')

                           District  n_students
0               Los Angeles Unified    128582.0
1                 San Diego Unified     30414.0
2             Sweetwater Union High     27646.0
3              East Side Union High     25770.0
4          Chaffey Joint Union High     23683.0
..                              ...         ...
428  Southern Trinity Joint Unified        30.0
429          Leggett Valley Unified        23.0
430            Owens Valley Unified        22.0
431            Death Valley Unified        17.0
432                 Mattole Unified         9.0

[433 rows x 2 columns]


In [22]:
# Show the number of schools in each district
sql_query = """
SELECT "District",
        COUNT(DISTINCT("School")) AS n_schools
FROM edu
GROUP BY "District"
ORDER BY n_schools DESC;
"""

df_query = pd.read_sql_query(sql_query, con)
print(df_query)

# df_query.to_sql('statcast_simple', engine, if_exists='replace')

                    District  n_schools
0        Los Angeles Unified        125
1          San Diego Unified         24
2       East Side Union High         20
3      San Francisco Unified         17
4      Sweetwater Union High         13
..                       ...        ...
434        Tehachapi Unified          1
435      Temple City Unified          1
436        Templeton Unified          1
437   Tracy Joint Union High          1
438  Tranquillity Union High          1

[439 rows x 2 columns]


In [38]:
# Show the proportion of schools in the district 
# that have percent of low income students > 50 
# when the number of schools in the district is at least 5

# | district | n_schools | n_high_LI | pct_high_LI |

sql_query = """
SELECT "District",
       COUNT("School") AS n_schools
FROM edu
GROUP BY "District"
ORDER BY "District";
"""

df_query = pd.read_sql_query(sql_query, con)
df_query

Unnamed: 0,District,n_schools
0,ABC Unified,3
1,Acalanes Union High,4
2,Acton-Agua Dulce Unified,2
3,Alameda County Office of Education,1
4,Alameda Unified,2
...,...,...
434,Woodland Joint Unified,2
435,Yosemite Unified,2
436,Yreka Union High,1
437,Yuba City Unified,2


In [40]:
# Show the proportion of schools in the district 
# that have percent of low income students > 50 
# when the number of schools in the district is at least 5

# | district | n_schools | n_high_LI | p_high_LI |

sql_query = """

SELECT t1."District",
       t1.n_schools,
       t1.n_high_LI,
       t1.n_high_LI::decimal / t1.n_schools AS p_high_LI
FROM
    (SELECT "District",
           COUNT(DISTINCT "School") AS n_schools,
           SUM(CASE WHEN "pct_LI_students" > 50 THEN 1 ELSE 0 END) AS n_high_LI
    FROM edu
    GROUP BY "District"
    ORDER BY "District") AS t1
WHERE t1.n_schools >= 5
ORDER BY p_high_LI DESC;
"""

df_query = pd.read_sql_query(sql_query, con)
df_query

Unnamed: 0,District,n_schools,n_high_li,p_high_li
0,Stockton Unified,6,6,1.0
1,Fontana Unified,5,5,1.0
2,Victor Valley Union High,6,6,1.0
3,Porterville Unified,6,6,1.0
4,Los Angeles Unified,125,117,0.936
5,Sacramento City Unified,10,9,0.9
6,San Bernardino City Unified,8,7,0.875
7,Pomona Unified,7,6,0.857143
8,Santa Ana Unified,6,5,0.833333
9,Tulare Joint Union High,5,4,0.8


In [42]:
# Show the proportion of schools in the district 
# that have percent of low income students > 50 
# when the number of schools in the district is at least 5

# | district | n_schools | n_high_LI | p_high_LI |

sql_query = """

SELECT t1."District",
       t1.n_schools,
       t1.n_high_LI,
       t1.n_high_LI::decimal / t1.n_schools AS p_high_LI
FROM
    (SELECT "District",
           COUNT(DISTINCT "School") AS n_schools,
           COUNT(CASE WHEN "pct_LI_students" > 50 THEN 1 ELSE NULL END) AS n_high_LI
    FROM edu
    GROUP BY "District"
    ORDER BY "District") AS t1
WHERE t1.n_schools >= 5
ORDER BY p_high_LI DESC;
"""

df_query = pd.read_sql_query(sql_query, con)
df_query

Unnamed: 0,District,n_schools,n_high_li,p_high_li
0,Stockton Unified,6,6,1.0
1,Fontana Unified,5,5,1.0
2,Victor Valley Union High,6,6,1.0
3,Porterville Unified,6,6,1.0
4,Los Angeles Unified,125,117,0.936
5,Sacramento City Unified,10,9,0.9
6,San Bernardino City Unified,8,7,0.875
7,Pomona Unified,7,6,0.857143
8,Santa Ana Unified,6,5,0.833333
9,Tulare Joint Union High,5,4,0.8


In [43]:
# WRONG APPLICATION  - count with 0
# Show the proportion of schools in the district 
# that have percent of low income students > 50 
# when the number of schools in the district is at least 5

# | district | n_schools | n_high_LI | p_high_LI |

sql_query = """

SELECT t1."District",
       t1.n_schools,
       t1.n_high_LI,
       t1.n_high_LI::decimal / t1.n_schools AS p_high_LI
FROM
    (SELECT "District",
           COUNT(DISTINCT "School") AS n_schools,
           COUNT(CASE WHEN "pct_LI_students" > 50 THEN 1 ELSE 0 END) AS n_high_LI
    FROM edu
    GROUP BY "District"
    ORDER BY "District") AS t1
WHERE t1.n_schools >= 5
ORDER BY p_high_LI DESC;
"""

df_query = pd.read_sql_query(sql_query, con)
df_query

Unnamed: 0,District,n_schools,n_high_li,p_high_li
0,Anaheim Union High,8,8,1.0
1,Antelope Valley Union High,10,10,1.0
2,Campbell Union High,5,5,1.0
3,Capistrano Unified,6,6,1.0
4,Chaffey Joint Union High,8,8,1.0
5,Clovis Unified,6,6,1.0
6,Corona-Norco Unified,5,5,1.0
7,East Side Union High,20,20,1.0
8,Elk Grove Unified,10,10,1.0
9,Escondido Union High,7,7,1.0


In [50]:
# Show the proportion of schools in the district 
# that have high SPLICE ( > 60)
# when the number of schools in the district is at least 5

# | district | n_schools | n_high_SPLICE | p_high_SPLICE |

sql_query = """

SELECT t1."District",
       t1.n_schools,
       t1.n_high_splice,
       t1.n_high_splice::numeric / t1.n_schools AS p_high_SPLICE
FROM
(SELECT "District",
       COUNT("School") AS n_schools,
       SUM(CASE WHEN "graduation_rates_UCCSU_eligibility_LIstudents">60 THEN 1 ELSE 0 END) AS n_high_SPLICE
FROM edu
GROUP BY "District") AS t1
WHERE t1.n_schools >=5
ORDER BY p_high_splice DESC;
"""

df_query = pd.read_sql_query(sql_query, con)
df_query

Unnamed: 0,District,n_schools,n_high_splice,p_high_splice
0,San Diego Unified,24,18,0.75
1,Poway Unified,5,3,0.6
2,San Francisco Unified,17,10,0.588235
3,Los Angeles Unified,125,64,0.512
4,Oakland Unified,12,6,0.5
5,Fremont Union High,5,2,0.4
6,Fullerton Joint Union High,5,2,0.4
7,Roseville Joint Union High,5,2,0.4
8,Santa Ana Unified,6,2,0.333333
9,Sacramento City Unified,10,3,0.3


In [51]:
# Show columns

sql_query = """

SELECT *
FROM edu
LIMIT 2;
"""

df_query = pd.read_sql_query(sql_query, con)
df_query

Unnamed: 0.1,index,Unnamed: 0,CDSCode,NCESDist,NCESSchool,StatusType,County,District,School,Street,...,test_scores_math_LIstudents,test_scores_math_LIstudents_n_students,graduation_rates_UCCSU_eligibility_allStudents,graduation_rates_UCCSU_eligibility_LIstudents,graduation_rates_gradRates_eligibility_allStudents,graduation_rates_gradRates_eligibility_LIstudents,pct_LI_students,schoolname4merge,zip_code_GS,Id
0,0,0,1100170112607,691051,10947,Active,Alameda,Alameda County Office of Education,Envision Academy for Arts & Technology,1515 Webster Street,...,18.0,77.0,100.0,100.0,82.0,83.0,67.0,Envision Academy for Arts & Technology,94612.0,8600000US94612
1,1,1,1611190130229,601770,41,Active,Alameda,Alameda Unified,Alameda High,2200 Central Ave,...,49.0,104.0,66.0,50.0,95.0,94.0,18.0,Alameda High,94501.0,8600000US94501


# Insight paired mock interview problems

## SQL question 1
Calculate the distances between each two points first, and then display the minimum one.

### Create a temporary table

In [159]:
# Create temporary tables only for the purpose of testing the queries
# The WITH line is creating the table

sql_query = """
WITH  point (x) AS (VALUES (-1), (0), (2))
SELECT * FROM point;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,x
0,-1
1,0
2,2


In [160]:
# Solution provided
sql_query = """
WITH  point (x) AS (VALUES (-1), (0), (2))

SELECT
    p1.x, p2.x, ABS(p1.x - p2.x) AS distance
FROM
    point p1
        JOIN
    point p2 ON p1.x != p2.x;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,x,x.1,distance
0,-1,0,1
1,-1,2,3
2,0,-1,1
3,0,2,2
4,2,-1,3
5,2,0,2


### If there was an id field

In [111]:
sql_query = """
WITH  point (id, x) AS (VALUES (1, -1), (2, 0), (3, 2))

SELECT * FROM point;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,x
0,1,-1
1,2,0
2,3,2


In [115]:
# Use the id to join on the next row and create a difference field

sql_query = """
WITH  point (id, x) AS (VALUES (1, -1), (2, 0), (3, 2))

SELECT p1.x, p2.x, (p2.x-p1.x) AS diff
FROM point AS p1
JOIN point AS p2
ON p1.id=p2.id-1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,x,x.1,diff
0,-1,0,1
1,0,2,2


In [116]:
# Finalize query - select minimum difference

sql_query = """
WITH  point (id, x) AS (VALUES (1, -1), (2, 0), (3, 2))

SELECT MIN(p2.x-p1.x) AS shortest_distance
FROM point AS p1
JOIN point AS p2
ON p1.id=p2.id-1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,shortest_distance
0,1


## SQL question 2

(double click to fix formatting)

In social networks like Facebook or Twitter, people send friend requests and accept others' requests as well.
Table request_accepted
+--------------+-------------+------------+
| requester_id | accepter_id | accept_date|
|--------------|-------------|------------|
| 1            | 2           | 2016_06-03 |
| 1            | 3           | 2016-06-08 |
| 2            | 3           | 2016-06-08 |
| 3            | 4           | 2016-06-09 |
+--------------+-------------+------------+
This table holds the data of friend acceptance, while requester_id and accepter_id both are the id of a person.

Write a query to find the the people who has most friends and the most friends number under the following rules:
It is guaranteed there is only 1 person having the most friends.
The friend request could only be accepted once, which means there are no multiple records with the same requester_id and accepter_id value.
For the sample data above, the result is:
Result table:
+------+------+
| id   | num  |
|------|------|
| 3    | 3    |
+------+------+
The person with id '3' is a friend of people '1', '2' and '4', so he has 3 friends in total, which is the most number than any other.


In [119]:
# Create temporary tables only for the purpose of testing the queries
# The WITH line is creating the request_accepted table

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT * FROM request_accepted;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,requester_id,accepter_id
0,1,2
1,1,3
2,2,3
3,3,4


### Strategy 1: Try solution's approach with UNION
- Create two tables, one for requester_id, one for accepter_id
- Then concat horizontally with UNION (like an rbind) and select off that

In [155]:
# Just see how the UNION works

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

(SELECT requester_id, COUNT(*) 
FROM request_accepted
GROUP BY requester_id)
UNION
(SELECT accepter_id, COUNT(*) 
FROM request_accepted
GROUP BY accepter_id)
;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,requester_id,count
0,4,1
1,3,1
2,2,1
3,1,2
4,3,2


NOTE that UNION doesn't maintain the order. Requester_id and accepter_id queries are all mixed together.
Interestingly, the requester_id column is kept but this will be alias'd to avoid confusion.

In [129]:
# Turn the UNION into one big table and select off that

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT big_table.id, SUM(big_table.total) AS total_friends
FROM
    ((SELECT requester_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY requester_id)
    UNION
    (SELECT accepter_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY accepter_id)) AS big_table
GROUP BY big_table.id
ORDER BY total_friends DESC
LIMIT 1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,total_friends
0,3,3.0


### Strategy 2: Try doing a JOIN between the two tables
- Create two tables, one for requester_id, one for accepter_id
- Then concat vertically with OUTER JOIN (like cbind), sum, and select off that

In [140]:
# See if the JOIN works

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT r_table.id, r_table.total
FROM
    (SELECT requester_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY requester_id) AS r_table
JOIN
    (SELECT accepter_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY accepter_id) AS a_table
ON r_table.id=a_table.id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,total
0,2,1
1,3,1


It does not have id 4

In [156]:
# Try with an OUTER JOIN and do sum

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT id, r_table.total, a_table.total
FROM
    (SELECT requester_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY requester_id) AS r_table
FULL OUTER JOIN
    (SELECT accepter_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY accepter_id) AS a_table
ON r_table.id=a_table.id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT id, r_table.total, a_table.total
FROM
    (SELECT requester_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY requester_id) AS r_table
FULL OUTER JOIN
    (SELECT accepter_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY accepter_id) AS a_table
ON r_table.id=a_table.id;
': column reference "id" is ambiguous
LINE 4: SELECT id, r_table.total, a_table.total
               ^


The id field is treated as ambiguous

In [149]:
# Need an OUTER JOIN and do sum

sql_query = """
WITH  request_accepted (requester_id, accepter_id) AS (VALUES (1,2), (1,3), (2,3), (3,4))

SELECT r_table.id, r_table.total, a_table.id, a_table.total
FROM
    (SELECT requester_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY requester_id) AS r_table
FULL OUTER JOIN
    (SELECT accepter_id AS id, COUNT(*) AS total
    FROM request_accepted
    GROUP BY accepter_id) AS a_table
ON r_table.id=a_table.id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,total,id.1,total.1
0,1.0,2.0,,
1,2.0,1.0,2.0,1.0
2,3.0,1.0,3.0,2.0
3,,,4.0,1.0


Outer join approach likely doesn't work so far since the id fields can't be disambiguated

## Coding question 1

In [184]:
def bubble_sort_algo(arr):
    '''
    input: an array of size n
    output: sorted array
    No use of sort function
    '''
    for j in range(len(arr)-1):
        for i in range(len(arr)-j-1):
            if arr[i] > arr[i+1]:
                temp = arr[i]
                arr[i] = arr[i+1]
                arr[i+1] = temp
        print(j, arr)
      
    return arr

In [185]:
my_array = [2,0,2,1,1,0]
bubble_sort_algo(my_array)

0 [0, 2, 1, 1, 0, 2]
1 [0, 1, 1, 0, 2, 2]
2 [0, 1, 0, 1, 2, 2]
3 [0, 0, 1, 1, 2, 2]
4 [0, 0, 1, 1, 2, 2]


[0, 0, 1, 1, 2, 2]

In [186]:
def onepass_sort(arr):
    '''
    input: an array of size n
    output: sorted array
    No use of sort function
    '''
    p1=0
    p2=len(arr)-1
    cur=0
    
    while cur <= p2:
        if arr[cur]==0:
            arr[p1], arr[cur] = arr[cur], arr[p1]
            p1 += 1
            cur += 1
        elif arr[cur]==2:
            arr[p2], arr[cur] = arr[cur], arr[p2]
            p2 -= 1
        else:
            cur += 1
            
    return arr

In [187]:
my_array = [2,0,2,1,1,0]
onepass_sort(my_array)

[0, 0, 1, 1, 2, 2]

## Coding question 2

In [195]:
def maxMoney(arr):
    '''
    input: an array representing values of the houses
    output: maximum value
    '''
    
    # Strategy - determine the sum of each non-adjacent pair
    for i in range(len(arr)-1):
        for j in range(len(arr)-1):
            if abs(j-i)!=1:
                val = sum(arr[i]+arr[j])
                print(val)
                
    # 
    
    #return max_val

In [196]:
my_array= [2,7,9,3,1]
maxMoney(my_array)

TypeError: 'int' object is not iterable

In [197]:
sum(my_array)

22

In [198]:
len(my_array)

5

Leetcode 198

In [203]:
def rob(arr):
    prevMax = 0
    currMax = 0
    for x in arr:
        temp = currMax
        currMax = np.max(prevMax + x, currMax);
        prevMax = temp;
    return currMax
    

In [204]:
rob[2,7,9,3,1]

TypeError: 'function' object is not subscriptable

In [None]:
public int rob(int[] num) {
    int prevMax = 0;
    int currMax = 0;
    for (int x : num) {
        int temp = currMax;
        currMax = Math.max(prevMax + x, currMax);
        prevMax = temp;
    }
    return currMax;
}


# BRL SQL questions

## Tables for validation

In [203]:
# Create table within a local database

# Define a database name, set your postgres username
dbname = "baseball"
username = "lacar"  # change this to your username

# Working with PostgreSQL in Python
con = psycopg2.connect(database=dbname, user=username)

# Here, we're using postgres, but sqlalchemy can connect to other things too.
engine = create_engine("postgres://%s@localhost/%s" % (username, dbname))
print(engine.url)

postgres://lacar@localhost/baseball


In [257]:
# Generate random date ranges

# From https://towardsdatascience.com/mastering-dates-and-timestamps-in-pandas-and-python-in-general-5b8c6edcc50c

import random
import time
from dateutil.parser import parse
def str_time_prop(start, end, format, prop):
    stime = time.mktime(time.strptime(start, format))
    etime = time.mktime(time.strptime(end, format))
    ptime = stime + prop * (etime - stime)
    return time.strftime(format, time.localtime(ptime))

selected_format = '%Y-%m-%d %H:%M:%S'

def random_date(start, end, prop):
    return parse(str_time_prop(start, end, selected_format, prop)).strftime(selected_format)

def make_date(x):
    return random_date("2020-01-01 13:40:00", "2020-01-14 14:50:00", random.random())


In [258]:
# Generate dates (my function)
def generate_dates(n_dates):
    return sorted([make_date(x) for x in range(n_dates)])

In [210]:
# Generate names (my function)
def generate_name_list(n_names):
    import names   # needed to pip install
    name_list = list()
    for i in range(n_names):
        name_list.append(names.get_first_name())
    return name_list

In [214]:
# Generate 3-digit codes (e.g. city ids) (my function)
def generate_codes(n_codes):
    # 3 digits between 110 and 999 without repeating
    import random
    code_ids = random.sample(range(110, 1000), n_codes)
    return code_ids

In [238]:
# Generate random list following input of a set of values to choose
def generate_custom_vals(list2consider, n_items):
    custom_list = np.random.choice(list2consider, size=n_items, replace=True).tolist()
    return custom_list

### Table 1

In [259]:
my_date_list = generate_dates(10)

In [260]:
my_date_list

['2020-01-03 21:42:44',
 '2020-01-04 13:57:34',
 '2020-01-06 05:24:27',
 '2020-01-06 06:47:06',
 '2020-01-07 05:57:20',
 '2020-01-07 23:55:00',
 '2020-01-08 21:16:25',
 '2020-01-10 21:16:19',
 '2020-01-11 02:02:49',
 '2020-01-11 07:00:19']

In [213]:
my_name_list = generate_name_list(10)

In [215]:
my_city_codes = generate_codes(10)
my_city_codes

[237, 650, 179, 358, 260, 182, 617, 453, 402, 600]

In [239]:
my_list2consider = ['a', 'b', 'c']
generate_custom_vals(my_list2consider, 10)

['a', 'a', 'a', 'c', 'a', 'c', 'a', 'c', 'b', 'a']

In [248]:
my_list2consider = ['desktop-browser','mobile-browser','ios-native','android-native']
my_list2consider4table = generate_custom_vals(my_list2consider, 10)

In [264]:
my_list2consider = ['US', 'Canada', 'Mexico']
my_countries4table = generate_custom_vals(my_list2consider, 10)
my_countries4table

['Canada',
 'US',
 'US',
 'Canada',
 'Canada',
 'Canada',
 'US',
 'US',
 'US',
 'Mexico']

In [298]:
col_1 = range(1, 11)
col_2 = pd.to_datetime(my_date_list)
col_3 = my_list2consider4table
col_4 = my_countries4table

table1 = pd.DataFrame([col_1, col_2, col_3, col_4]).T
table1.columns = ['user_id', 'join_ts', 'join_client', 'country']

# Temp table created here that I'll just over-write with each new problem
table1.to_sql('user_summary', engine, if_exists='replace')


### Table 2

In [269]:
uid_action = generate_custom_vals(range(1, 11), 10)
uid_action

[3, 3, 1, 10, 7, 3, 10, 6, 8, 10]

In [270]:
my_list2consider = range(1,4)
my_page_id = generate_custom_vals(my_list2consider, 10)
my_page_id

[2, 1, 3, 1, 3, 1, 2, 1, 3, 1]

In [273]:
my_date_list2 = generate_dates(10)
my_date_list2

['2020-01-03 23:28:36',
 '2020-01-03 23:32:05',
 '2020-01-08 03:02:15',
 '2020-01-08 05:07:58',
 '2020-01-08 13:44:02',
 '2020-01-10 06:34:01',
 '2020-01-13 02:12:42',
 '2020-01-13 14:28:43',
 '2020-01-14 01:29:32',
 '2020-01-14 10:52:38']

In [277]:
my_list2consider = ['viewed', 'clicked']
my_action = generate_custom_vals(my_list2consider, 10)
my_action

['viewed',
 'viewed',
 'viewed',
 'viewed',
 'clicked',
 'clicked',
 'clicked',
 'viewed',
 'viewed',
 'viewed']

In [279]:
table2

Unnamed: 0,user_id,page_id,ts,action
0,3,2,2020-01-03 23:28:36,viewed
1,3,1,2020-01-03 23:32:05,viewed
2,1,3,2020-01-08 03:02:15,viewed
3,10,1,2020-01-08 05:07:58,viewed
4,7,3,2020-01-08 13:44:02,clicked
5,3,1,2020-01-10 06:34:01,clicked
6,10,2,2020-01-13 02:12:42,clicked
7,6,1,2020-01-13 14:28:43,viewed
8,8,3,2020-01-14 01:29:32,viewed
9,10,1,2020-01-14 10:52:38,viewed


In [295]:
col_1 = uid_action
col_2 = my_page_id
col_3 = pd.to_datetime(my_date_list2)
col_4 = my_action

table2 = pd.DataFrame([col_1, col_2, col_3, col_4]).T
table2.columns = ['user_id', 'page_id', 'ts', 'action']

# Temp table created here that I'll just over-write with each new problem
table2.to_sql('page_actions', engine, if_exists='replace')

**Need to account for users to don't comment at all.**

In [None]:
/*
-- user_summary
-- user_id | join_ts | join_client | country

-- problems_solved
-- user_id | problem_id | ts | action | answer_is_correct
-- We have the following actions:
Viewed problem
Tried problem

*/

-- In the last 14d, what are the top 5 countries of people joining Brilliant

SELECT country, COUNT(*) AS no_of_people
FROM user_summary
WHERE join_ts BETWEEN (NOW() - 'interval 14 days') AND NOW()
GROUP BY country
ORDER BY no_of_people DESC
LIMIT 5

In [None]:
-- For each country, what are the average and total number of problems viewed 4 hours after a user joins?
-- | country | user_id | total_number_of_problems

WITH t1 AS
  (SELECT country, user_id, COUNT(*) AS number_of_problems
  FROM user_summary AS us
  JOIN problems_solved AS ps
  ON us.user_id=ps.user_id
  WHERE ps.ts BETWEEN us.join_ts AND (us.join_ts + 'interval + 4 hours')
  AND ps.action='viewed'
  GROUP BY us.country, us.user_id)
  
SELECT t1.country, 
	     AVG(number_of_problems) AS avg_number_of_problems,
       SUM(number_of_problems) AS total_number_of_problems
FROM t1;

In [None]:
-- For the US, if someone tried a problem 4 hours after joining, what % of them joined via iOS? Trend this by the date someone joined
-- | date | percentage |
-- date including zeros for ios

--join_client in ('desktop-browser','mobile-browser','ios-native','android-native')

WITH t1 AS
  (SELECT DISTINCT user_id, 
   	      us.join_ts, 
          COUNT(CASE WHEN join_client = 'ios-native' END) AS ios_counts,
   				COUNT(DISTINCT user_id) AS total_counts,
  FROM user_summary AS us
  JOIN problems_solved AS ps
  ON us.user_id=ps.user_id
  WHERE country='US'
  AND ps.ts BETWEEN us.join_ts AND (us.join_ts + 'interval + 4 hours')
  AND ps.action='tried')
  
SELECT t1.join_ts, 
       100*(t1.ios_counts::numeric/t1.total_counts) AS pct_ios_users
FROM t1
  

## Re-creating query

In [383]:
sql_query = """            
SELECT *
FROM user_summary;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,index,user_id,join_ts,join_client,country
0,0,1,2020-01-03 21:42:44,android-native,Canada
1,1,2,2020-01-04 13:57:34,android-native,US
2,2,3,2020-01-06 05:24:27,mobile-browser,US
3,3,4,2020-01-06 06:47:06,android-native,Canada
4,4,5,2020-01-07 05:57:20,mobile-browser,Canada
5,5,6,2020-01-07 23:55:00,desktop-browser,Canada
6,6,7,2020-01-08 21:16:25,mobile-browser,US
7,7,8,2020-01-10 21:16:19,mobile-browser,US
8,8,9,2020-01-11 02:02:49,desktop-browser,US
9,9,10,2020-01-11 07:00:19,android-native,Mexico


In [384]:
for i in df_query.index:
    print("(", df_query.loc[i, "user_id"], 
          ", CAST('", df_query.loc[i, "join_ts"], "' AS date)", 
          ",'", df_query.loc[i, "join_client"], "'",
          ",'", df_query.loc[i, "country"], "'),")
    

( 1 , CAST(' 2020-01-03 21:42:44 ' AS date) ,' android-native ' ,' Canada '),
( 2 , CAST(' 2020-01-04 13:57:34 ' AS date) ,' android-native ' ,' US '),
( 3 , CAST(' 2020-01-06 05:24:27 ' AS date) ,' mobile-browser ' ,' US '),
( 4 , CAST(' 2020-01-06 06:47:06 ' AS date) ,' android-native ' ,' Canada '),
( 5 , CAST(' 2020-01-07 05:57:20 ' AS date) ,' mobile-browser ' ,' Canada '),
( 6 , CAST(' 2020-01-07 23:55:00 ' AS date) ,' desktop-browser ' ,' Canada '),
( 7 , CAST(' 2020-01-08 21:16:25 ' AS date) ,' mobile-browser ' ,' US '),
( 8 , CAST(' 2020-01-10 21:16:19 ' AS date) ,' mobile-browser ' ,' US '),
( 9 , CAST(' 2020-01-11 02:02:49 ' AS date) ,' desktop-browser ' ,' US '),
( 10 , CAST(' 2020-01-11 07:00:19 ' AS date) ,' android-native ' ,' Mexico '),


In [385]:
# Test temp table
sql_query = """
WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico'))

SELECT *
FROM user_summary;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,user_id,join_ts,join_client,country
0,1,2020-01-03,android-native,Canada
1,2,2020-01-04,android-native,US
2,3,2020-01-06,mobile-browser,US
3,4,2020-01-06,android-native,Canada
4,5,2020-01-07,mobile-browser,Canada
5,6,2020-01-07,desktop-browser,Canada
6,7,2020-01-08,mobile-browser,US
7,8,2020-01-10,mobile-browser,US
8,9,2020-01-11,desktop-browser,US
9,10,2020-01-11,android-native,Mexico


In [380]:
# Test temp table
sql_query = """
WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST(' 2020-01-03 21:42:44 ' AS date) ,' android-native ' ,' Canada '),
( 2 , CAST(' 2020-01-04 13:57:34 ' AS date) ,' android-native ' ,' US '),
( 3 , CAST(' 2020-01-06 05:24:27 ' AS date) ,' mobile-browser ' ,' US '),
( 4 , CAST(' 2020-01-06 06:47:06 ' AS date) ,' android-native ' ,' Canada '),
( 5 , CAST(' 2020-01-07 05:57:20 ' AS date) ,' mobile-browser ' ,' Canada '),
( 6 , CAST(' 2020-01-07 23:55:00 ' AS date) ,' desktop-browser ' ,' Canada '),
( 7 , CAST(' 2020-01-08 21:16:25 ' AS date) ,' mobile-browser ' ,' US '),
( 8 , CAST(' 2020-01-10 21:16:19 ' AS date) ,' mobile-browser ' ,' US '),
( 9 , CAST(' 2020-01-11 02:02:49 ' AS date) ,' desktop-browser ' ,' US '),
( 10 , CAST(' 2020-01-11 07:00:19 ' AS date) ,' android-native ' ,' Mexico '))

SELECT *
FROM user_summary
WHERE join_ts > '2020-01-05';
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,user_id,join_ts,join_client,country
0,3,2020-01-06,mobile-browser,US
1,4,2020-01-06,android-native,Canada
2,5,2020-01-07,mobile-browser,Canada
3,6,2020-01-07,desktop-browser,Canada
4,7,2020-01-08,mobile-browser,US
5,8,2020-01-10,mobile-browser,US
6,9,2020-01-11,desktop-browser,US
7,10,2020-01-11,android-native,Mexico


In [386]:
# Test query
sql_query = """            
SELECT *
FROM page_actions;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,index,user_id,page_id,ts,action
0,0,3,2,2020-01-03 23:28:36,viewed
1,1,3,1,2020-01-03 23:32:05,viewed
2,2,1,3,2020-01-08 03:02:15,viewed
3,3,10,1,2020-01-08 05:07:58,viewed
4,4,7,3,2020-01-08 13:44:02,clicked
5,5,3,1,2020-01-10 06:34:01,clicked
6,6,10,2,2020-01-13 02:12:42,clicked
7,7,6,1,2020-01-13 14:28:43,viewed
8,8,8,3,2020-01-14 01:29:32,viewed
9,9,10,1,2020-01-14 10:52:38,viewed


In [388]:
for i in df_query.index:
    print("(", df_query.loc[i, "user_id"], 
          ",'", df_query.loc[i, "page_id"], "'",
          ", CAST('", df_query.loc[i, "ts"], "' AS date)",
          ",'", df_query.loc[i, "action"], "'),")
    

( 3 ,' 2 ' , CAST(' 2020-01-03 23:28:36 ' AS date) ,' viewed '),
( 3 ,' 1 ' , CAST(' 2020-01-03 23:32:05 ' AS date) ,' viewed '),
( 1 ,' 3 ' , CAST(' 2020-01-08 03:02:15 ' AS date) ,' viewed '),
( 10 ,' 1 ' , CAST(' 2020-01-08 05:07:58 ' AS date) ,' viewed '),
( 7 ,' 3 ' , CAST(' 2020-01-08 13:44:02 ' AS date) ,' clicked '),
( 3 ,' 1 ' , CAST(' 2020-01-10 06:34:01 ' AS date) ,' clicked '),
( 10 ,' 2 ' , CAST(' 2020-01-13 02:12:42 ' AS date) ,' clicked '),
( 6 ,' 1 ' , CAST(' 2020-01-13 14:28:43 ' AS date) ,' viewed '),
( 8 ,' 3 ' , CAST(' 2020-01-14 01:29:32 ' AS date) ,' viewed '),
( 10 ,' 1 ' , CAST(' 2020-01-14 10:52:38 ' AS date) ,' viewed '),


In [390]:
# Test query
sql_query = """            
WITH page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))

SELECT *
FROM page_actions;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,user_id,page_id,ts,action
0,3,2,2020-01-03,viewed
1,3,1,2020-01-03,viewed
2,1,3,2020-01-08,viewed
3,10,1,2020-01-08,viewed
4,7,3,2020-01-08,clicked
5,3,1,2020-01-10,clicked
6,10,2,2020-01-13,clicked
7,6,1,2020-01-13,viewed
8,8,3,2020-01-14,viewed
9,10,1,2020-01-14,viewed


### Problem 1

In the last 4 months, what are the top 5 countries of number of people joining the platform.

- What if there are ties?



In [102]:
# Modified
sql_query = """   
SELECT country, COUNT(*) AS no_of_people
FROM user_summary
WHERE NOW()-join_ts <= interval '4 months'
GROUP BY country
ORDER BY no_of_people DESC, country ASC
LIMIT 5
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,country,no_of_people
0,US,5
1,Canada,4
2,Mexico,1


In [393]:
# Modified
sql_query = """   
SELECT country, COUNT(*) AS no_of_people
FROM user_summary
WHERE join_ts > NOW() - interval '4 months'
GROUP BY country
ORDER BY no_of_people DESC, country ASC
LIMIT 5
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,country,no_of_people
0,US,5
1,Canada,4
2,Mexico,1


### Problem 2

For each country, what are the average and total number of pages viewed within 4 days after a user joins?

In [328]:
sql_query = """  

WITH t1 AS
  (SELECT country, us.user_id, COUNT(*) AS number_of_views
  FROM user_summary AS us
  JOIN page_actions AS pa
  ON us.user_id=pa.user_id
  WHERE pa.action='viewed'
  GROUP BY us.country, us.user_id)
  
SELECT *
FROM t1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,country,user_id,number_of_views
0,Canada,1,1
1,Canada,6,1
2,Mexico,10,2
3,US,3,2
4,US,8,1


In [330]:
sql_query = """  

WITH t1 AS
  (SELECT country, us.user_id, COUNT(*) AS number_of_views
  FROM user_summary AS us
  JOIN page_actions AS pa
  ON us.user_id=pa.user_id
  WHERE pa.ts BETWEEN us.join_ts AND (us.join_ts + interval '4 days')
  AND pa.action='viewed'
  GROUP BY us.country, us.user_id)
  
SELECT *
FROM t1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,country,user_id,number_of_views
0,Mexico,10,1
1,US,8,1


In [332]:
sql_query = """  

WITH t1 AS
  (SELECT country, us.user_id, COUNT(*) AS n_views
  FROM user_summary AS us
  JOIN page_actions AS pa
  ON us.user_id=pa.user_id
  WHERE pa.ts BETWEEN us.join_ts AND (us.join_ts + interval '4 days')
  AND pa.action='viewed'
  GROUP BY us.country, us.user_id)
  
SELECT t1.country, 
	   AVG(n_views) AS avg,
       SUM(n_views) AS total
FROM t1
GROUP BY t1.country;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,country,avg,total
0,Mexico,1.0,1.0
1,US,1.0,1.0


### Problem 3

For the US, if someone clicked on a page within 4 hours after joining, what % of them joined via android-native? Trend this by the date someone joined.

In [103]:
sql_query = """  
WITH t1 AS
  (SELECT DISTINCT us.user_id, 
   	      us.join_ts, 
          SUM(CASE WHEN join_client = 'android-native' THEN 1 END) AS ios_counts,
   				COUNT(DISTINCT us.user_id) AS total_counts
  FROM user_summary AS us
  JOIN page_actions AS pa
  ON us.user_id=pa.user_id
  WHERE country='US'
  AND pa.ts BETWEEN us.join_ts AND (us.join_ts + interval '4 hours')
  AND pa.action='tried'
  GROUP BY us.user_id)
  
SELECT t1.join_ts, 
       100*(t1.ios_counts::numeric/t1.total_counts) AS pct_ios_users
FROM t1

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '  
WITH t1 AS
  (SELECT DISTINCT us.user_id, 
   	      us.join_ts, 
          SUM(CASE WHEN join_client = 'android-native' THEN 1 END) AS ios_counts,
   				COUNT(DISTINCT us.user_id) AS total_counts
  FROM user_summary AS us
  JOIN page_actions AS pa
  ON us.user_id=pa.user_id
  WHERE country='US'
  AND pa.ts BETWEEN us.join_ts AND (us.join_ts + interval '4 hours')
  AND pa.action='tried'
  GROUP BY us.user_id)
  
SELECT t1.join_ts, 
       100*(t1.ios_counts::numeric/t1.total_counts) AS pct_ios_users
FROM t1

': column "us.join_ts" must appear in the GROUP BY clause or be used in an aggregate function
LINE 4:           us.join_ts, 
                  ^


## Facebook problem

(From Yaniv) What percent of users click?

table
USER   | ACTION
user1  |  click
user2  |  click
user2  |  NULL
user3  |  NULL
user3  |  NULL

output
| pct_click | 

intermediate output
| click | total users


-- GROUP BY count
-- CASE WHEN count > 0 THEN 1 ELSE 0


SELECT SUM(CASE WHEN n_click > 0 THEN 1 ELSE 0) / 
       COUNT(*) AS prop_click
FROM
    (SELECT user,
           SUM(CASE WHEN action='click' THEN 1 ELSE 0) AS n_click
    FROM table
    GROUP BY user) AS t1



In [130]:
# weird, I can't use table in WITH table - neither can I use user
# they're like reserved words within SQL

# sql_query = """
# WITH table (id, score) AS 
#    (VALUES ('user1', 'click'),
#            ('user2', 'click'),
#            ('user2', NULL),
#            ('user3', NULL),
#            ('user3', NULL))
  
# SELECT *
# FROM table;

# """
# df_query = pd.read_sql_query(sql_query,con)    
# df_query

In [143]:
sql_query = """
WITH input_table (id, action) AS 
   (VALUES ('user1', 'click'),
           ('user2', 'click'),
           ('user2', NULL),
           ('user3', NULL),
           ('user3', NULL))
  
SELECT *
FROM input_table;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,action
0,user1,click
1,user2,click
2,user2,
3,user3,
4,user3,


In [149]:
sql_query = """
WITH input_table (id, action) AS 
   (VALUES ('user1', 'click'),
           ('user2', 'click'),
           ('user2', NULL),
           ('user3', NULL),
           ('user3', NULL))
  

SELECT *
FROM
    (SELECT id,
           SUM(CASE WHEN action='click' THEN 1 ELSE 0 END) AS n_click
    FROM input_table
    GROUP BY id) AS t1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,n_click
0,user2,1
1,user1,1
2,user3,0


In [4]:
sql_query = """
WITH input_table (id, action) AS 
   (VALUES ('user1', 'click'),
           ('user2', 'click'),
           ('user2', NULL),
           ('user3', NULL),
           ('user3', NULL))
  

SELECT SUM(CASE WHEN n_click > 0 THEN 1 ELSE 0 END)::numeric / 
       COUNT(*) AS prop_click
FROM
    (SELECT id,
           SUM(CASE WHEN action='click' THEN 1 ELSE 0 END) AS n_click
    FROM input_table
    GROUP BY id) AS t1;

"""
df_query = pd.read_sql_query(sql_query,con)    

df_query

Unnamed: 0,prop_click
0,0.666667


**Using avg towards finding a proportion**
When you have 0s and 1s, then doing the AVG will give you the same as finding the numerator and denominator separately.

In [5]:
sql_query = """
WITH input_table (id, action) AS 
   (VALUES ('user1', 'click'),
           ('user2', 'click'),
           ('user2', NULL),
           ('user3', NULL),
           ('user3', NULL))
  

SELECT AVG(CASE WHEN n_click > 0 THEN 1 ELSE 0 END)::numeric
FROM
    (SELECT id,
           SUM(CASE WHEN action='click' THEN 1 ELSE 0 END) AS n_click
    FROM input_table
    GROUP BY id) AS t1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,avg
0,0.666667


In [26]:
sql_query = """
WITH input_table (id, action) AS 
   (VALUES ('user1', 'click'),
           ('user2', 'click'),
           ('user2', NULL),
           ('user3', NULL),
           ('user3', NULL))
  

SELECT AVG(CASE WHEN n_click > 0 THEN 1 ELSE 0 END)*1.0 AS prop
FROM
    (SELECT id,
           SUM(CASE WHEN action='click' THEN 1 ELSE 0 END) AS n_click
    FROM input_table
    GROUP BY id) AS t1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,prop
0,0.666667


In [154]:
# checking Mike's
sql_query = """
WITH input_table (id, action) AS 
   (VALUES ('user1', 'click'),
           ('user2', 'click'),
           ('user2', NULL),
           ('user3', NULL),
           ('user3', NULL))
  

SELECT id,
    MAX(action)
    FROM input_table
    GROUP BY id

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,max
0,user2,click
1,user1,click
2,user3,


# QotD 4/1/20

## Question 1. Computer scores of teams.

You would like to compute the scores of all teams after all matches. Points are awarded as follows:
* A team receives three points if they win a match (Score strictly more goals than the opponent team).
* A team receives one point if they draw a match (Same number of goals as the opponent team).
* A team receives no points if they lose a match (Score less goals than the opponent team).


Table: Teams
+---------------+----------+
| Column Name   | Type     |
+---------------+----------+
| team_id       | int      |
| team_name     | varchar  |
+---------------+----------+
team_id is the primary key of this table.
Each row of this table represents a single football team.

Table: Matches
+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| match_id      | int     |
| host_team     | int     |
| guest_team    | int     | 
| host_goals    | int     |
| guest_goals   | int     |
+---------------+---------+
match_id is the primary key of this table.
Each row is a record of a finished match between two different teams. 
Teams host_team and guest_team are represented by their IDs in the teams table (team_id) and they scored host_goals and guest_goals respectively.


Output table

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| team_id       | int      |
| team_name     | varchar  |
| total_points  | int      |
+---------------+---------+


In [100]:
# Create temporary tables, use join on host team
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5))          
           
SELECT *
FROM teams
JOIN matches
ON teams.team_id=matches.host_team
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,team_id,team_names,match_id,host_team,guest_team,host_goals,guest_goals
0,1,Padres,1,1,2,10,0
1,2,Dodgers,2,2,3,2,4
2,3,Giants,3,3,4,5,2
3,4,Dbacks,4,4,5,3,3
4,5,Rockies,5,5,1,0,5


In [113]:
# Create temporary tables
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5))          
           
SELECT *
FROM matches;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,match_id,host_team,guest_team,host_goals,guest_goals
0,1,1,2,10,0
1,2,2,3,2,4
2,3,3,4,5,2
3,4,4,5,3,3
4,5,5,1,0,5


In [102]:
# Create temporary tables
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5))          
           
SELECT *
FROM teams
JOIN matches
ON CASE WHEN matches.host_goals > matches.guest_goals THEN teams.team_id=matches.host_team
    END;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,team_id,team_names,match_id,host_team,guest_team,host_goals,guest_goals
0,1,Padres,1,1,2,10,0
1,3,Giants,3,3,4,5,2


In [104]:
# Create temporary tables
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
home_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 3
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 0
                END) AS home_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.host_team
    GROUP BY team_id, team_names),
    
away_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 0
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 3
                END) AS away_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.guest_team
    GROUP BY team_id, team_names) 
    
SELECT *
FROM home_results
JOIN away_results
ON home_results.team_id=away_results.team_id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,team_id,team_names,home_total_points,team_id.1,team_names.1,away_total_points
0,1,Padres,3,1,Padres,3
1,2,Dodgers,0,2,Dodgers,0
2,3,Giants,3,3,Giants,3
3,5,Rockies,0,5,Rockies,1
4,4,Dbacks,1,4,Dbacks,0


In [105]:
# Final output
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
home_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 3
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 0
                END) AS home_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.host_team
    GROUP BY team_id, team_names),
    
away_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 0
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 3
                END) AS away_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.guest_team
    GROUP BY team_id, team_names) 
    
SELECT home_results.team_id AS team_id,
       (home_total_points + away_total_points) AS total_points
FROM home_results
JOIN away_results
ON home_results.team_id=away_results.team_id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,team_id,total_points
0,1,6
1,2,0
2,3,6
3,5,1
4,4,1


In [None]:
# Final output
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
home_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 3
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 0
                END) AS home_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.host_team
    GROUP BY team_id, team_names),
    
away_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 0
                WHEN matches.host_goals=matches.guest_goals THEN 1
                WHEN matches.host_goals<matches.guest_goals THEN 3
                END) AS away_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.guest_team
    GROUP BY team_id, team_names) 
    
SELECT home_results.team_id AS team_id,
       (home_total_points + away_total_points) AS total_points
FROM home_results
JOIN away_results
ON home_results.team_id=away_results.team_id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

In [108]:
# Final output - shorter
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
home_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals>matches.guest_goals THEN 3
                WHEN matches.host_goals=matches.guest_goals THEN 1
                ELSE 0
                END) AS home_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.host_team
    GROUP BY team_id, team_names),
    
away_results AS
    (SELECT team_id,
       team_names, 
       SUM(CASE WHEN matches.host_goals<matches.guest_goals THEN 3
                WHEN matches.host_goals=matches.guest_goals THEN 1
                ELSE 0
                END) AS away_total_points
    FROM teams
    JOIN matches
    ON teams.team_id=matches.guest_team
    GROUP BY team_id, team_names) 
    
SELECT home_results.team_id AS team_id,
       (home_total_points + away_total_points) AS total_points
FROM home_results
JOIN away_results
ON home_results.team_id=away_results.team_id
ORDER BY total_points DESC, home_results.team_id ASC;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,team_id,total_points
0,1,6
1,3,6
2,4,1
3,5,1
4,2,0


In [120]:
# Testing Minting's solution (can't follow all of it)
sql_query = """
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
score AS (
    SELECT match_id, 
           host_team, 
           guest_team,
           (CASE WHEN host_goals > guest_goals THEN 3
                 WHEN host_goals = guest_goals THEN 1
                 ELSE 0 END) AS host_score,
           (CASE WHEN guest_goals > host_goals THEN 3
                WHEN guest_goals = host_goals THEN 1
                ELSE 0 END) AS guest_score
           FROM matches)
    
SELECT team_id, team_name, sum(team_score)
FROM (SELECT t1.team_id AS team_id, 
             t1.team_name AS team_name, 
             s.host_score AS team_score
             FROM score s
             INNER JOIN Teams t1
             ON s.host_team = t1.team_id
       UNION ALL
       SELECT t2.team_id AS team_id,
              t2.team_name AS team_name, 
              s.guest_score AS team_score
              FROM score s
              INNER JOIN teams t2
              ON s.guest_team = t2.team_id) sub
GROUP BY team_id, team_name
ORDER BY team_score DESC, team_id;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH teams (team_id, team_names) AS (
    VALUES (1,'Padres'),
           (2,'Dodgers'),
           (3,'Giants'),
           (4,'Dbacks'),
           (5,'Rockies')),
           
matches (match_id, host_team, guest_team, host_goals, guest_goals) AS (
    VALUES (1, 1, 2, 10, 0),
           (2, 2, 3, 2, 4),
           (3, 3, 4, 5, 2),
           (4, 4, 5, 3, 3),
           (5, 5, 1, 0, 5)),          
           
score AS (
    SELECT match_id, 
           host_team, 
           guest_team,
           (CASE WHEN host_goals > guest_goals THEN 3
                 WHEN host_goals = guest_goals THEN 1
                 ELSE 0 END) AS host_score,
           (CASE WHEN guest_goals > host_goals THEN 3
                WHEN guest_goals = host_goals THEN 1
                ELSE 0 END) AS guest_score
           FROM matches)
    
SELECT team_id, team_name, sum(team_score)
FROM (SELECT t1.team_id AS team_id, 
             t1.team_name AS team_name, 
             s.host_score AS team_score
             FROM score s
             INNER JOIN Teams t1
             ON s.host_team = t1.team_id
       UNION ALL
       SELECT t2.team_id AS team_id,
              t2.team_name AS team_name, 
              s.guest_score AS team_score
              FROM score s
              INNER JOIN teams t2
              ON s.guest_team = t2.team_id) sub
GROUP BY team_id, team_name
ORDER BY team_score DESC, team_id;
': column t1.team_name does not exist
LINE 30:              t1.team_name AS team_name, 
                      ^
HINT:  Perhaps you meant to reference the column "t1.team_names".


In [None]:
SELECT team_id, team_name, sum(team_score)
FROM (SELECT t1.team_id AS team_id, t1.team_name AS team_name, s.host_score AS team_score
FROM score s
INNER JOIN Teams t1
ON s. host_team = t1.team_id
UNION ALL
SELECT t2.team_id AS team_id, t2.team_name AS team_name, s.guest_score AS team_score
FROM score s
INNER JOIN Teams t2
ON s. guest_team = t2. team_id) sub
GROUP BY team_id, team_name
ORDER BY team_score DESC, team_id

## Question 2. Rank scores.

Write a SQL query to rank scores. If there is a tie between two scores, both should have the same ranking. Note that after a tie, the next ranking number should be the next consecutive integer value. In other words, there should be no "holes" between ranks.

+----+-------+
| Id | Score |
+----+-------+
| 1  | 3.50  |
| 2  | 3.65  |
| 3  | 4.00  |
| 4  | 3.85  |
| 5  | 4.00  |
| 6  | 3.65  |
+----+-------+


In [4]:
# 
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,score
0,1,3.5
1,2,3.65
2,3,4.0
3,4,3.85
4,5,4.0
5,6,3.65


In [10]:
# 
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,score
0,1,3.5
1,2,3.65
2,3,4.0
3,4,3.85
4,5,4.0
5,6,3.65


In [None]:
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score
FROM input_table AS it;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

In [76]:
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score,
       (SELECT COUNT(*) 
       FROM input_table
       WHERE input_table.score > it.score) AS rank
FROM input_table AS it
ORDER BY rank;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,score,rank
0,4.0,0
1,4.0,0
2,3.85,2
3,3.65,3
4,3.65,3
5,3.5,5


In [80]:
# 
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score,
       (SELECT COUNT(DISTINCT(score)) 
        FROM input_table
        WHERE input_table.score >= it.score) AS rank
FROM input_table AS it
ORDER BY rank;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,score,rank
0,4.0,1
1,4.0,1
2,3.85,2
3,3.65,3
4,3.65,3
5,3.5,4


In [17]:
# 
sql_query = """
WITH  input_table (id, score) AS (VALUES (1,3.50), (2,3.65), (3,4.00), (4,3.85), (5,4.00), (6,3.65))

SELECT score,
       DENSE_RANK() OVER(ORDER BY score DESC) AS score_rank
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,score,score_rank
0,4.0,1
1,4.0,1
2,3.85,2
3,3.65,3
4,3.65,3
5,3.5,4


Write a SQL query to get the nth highest salary from the Employee table.

+----+--------+
| Id | Salary |
+----+--------+
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |
+----+--------+


In [29]:
sql_query = """
WITH input_table (id, salary) AS (VALUES (1,100), (2,200), (3,300))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,salary
0,1,100
1,2,200
2,3,300


In [None]:
sql_query = """
WITH input_table (id, salary) AS (VALUES (1,100), (2,200), (3,300))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

In [32]:
sql_query = """
WITH Employee (id, salary) AS (VALUES (1,100), (2,200), (3,300))

SELECT MIN(salary)
FROM(SELECT DISTINCT salary
FROM Employee
ORDER BY salary DESC
LIMIT 2) AS sub_table
;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query



Unnamed: 0,min
0,200


In [124]:
sql_query = """
WITH  input_table (id, salary) AS (VALUES (1,100), (2,200), (3,300))

SELECT score,
       (SELECT salary 
        FROM input_table
        WHERE input_table.score > SELECT DISTINCT(it.score) from input) AS rank
FROM input_table AS it;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

InterfaceError: connection already closed

In [None]:
sql_query = """
WITH  input_table (id, salary) AS (VALUES (1,100), (2,200), (3,300))

SELECT *
FROM input_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

# QotD 4/2/20

## Most frequent product

Given a table of products, find the most frequent product each day

ID   |  Date  |   Product  |
-----| ------ |  --------- |
1    |  2-12  |   apple    |
2    |  2-12  |   apple    |
3    |  2-12  |   orange   |
4    |  2-13  |   pear     |


In [153]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))
              
           
SELECT *
FROM products;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,date,product
0,1,2020-02-12,apple
1,2,2020-02-12,apple
2,3,2020-02-12,orange
3,4,2020-02-13,pear


In [166]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear')),
              
t1 AS
    (SELECT date,
           product,
           COUNT(*) AS no_items
    FROM products
    GROUP BY date, product)
    
SELECT *
FROM t1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,no_items
0,2020-02-12,apple,2
1,2020-02-13,pear,1
2,2020-02-12,orange,1


In [177]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear')),
              
t1 AS
    (SELECT date,
           product,
           COUNT(*) AS no_items
    FROM products
    GROUP BY date, product)

SELECT *
FROM t1;
            
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,no_items
0,2020-02-12,apple,2
1,2020-02-13,pear,1
2,2020-02-12,orange,1


In [189]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))
    
SELECT date, 
       product,
       COUNT(*)
FROM products
GROUP BY date, product;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,count
0,2020-02-12,apple,2
1,2020-02-13,pear,1
2,2020-02-12,orange,1


In [191]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear')),
    
t1 AS (SELECT date, 
       product,
       COUNT(*)
        FROM products
        GROUP BY date, product)

SELECT *
FROM t1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,count
0,2020-02-12,apple,2
1,2020-02-13,pear,1
2,2020-02-12,orange,1


In [193]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear')),
    
t1 AS (SELECT date, 
       product,
       COUNT(*)
       FROM products
       GROUP BY date, product)

SELECT date, MAX(count)
FROM t1
GROUP BY date;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,max
0,2020-02-12,2
1,2020-02-13,1


In [194]:
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear')),
    
t1 AS (SELECT date, 
       product,
       COUNT(*)
       FROM products
       GROUP BY date, product)

SELECT t1.date,
       t1.product
FROM t1
JOIN 
    (SELECT date, MAX(count) AS max_count
    FROM t1
    GROUP BY date) AS t2
ON t1.date=t2.date
AND t1.count=t2.max_count;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product
0,2020-02-12,apple
1,2020-02-13,pear


In [210]:
# Try with window function
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))

SELECT date,
       product,
       RANK() OVER(PARTITION BY date ORDER BY COUNT(product) DESC) AS n_rank
FROM products
GROUP BY date, product;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,n_rank
0,2020-02-12,apple,1
1,2020-02-12,orange,2
2,2020-02-13,pear,1


In [None]:
# Try Shu's window function
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))

SELECT t.date, t.product
FROM
    (SELECT date,
           product,
           RANK() OVER(PARTITION BY date ORDER BY COUNT(product) DESC) AS n_rank
    FROM products
    GROUP BY date, product) AS t
WHERE t.n_rank=1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

In [5]:
# Try with window function
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear'))

SELECT t.date, t.product
FROM
    (SELECT date,
           product,
           RANK() OVER(PARTITION BY date ORDER BY COUNT(product) DESC) AS n_rank
    FROM products
    GROUP BY date, product) AS t
WHERE t.n_rank=1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product
0,2020-02-12,apple
1,2020-02-13,orange
2,2020-02-14,apple


In [16]:
# Without a  window function and repeat 4/13/20
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear'))

SELECT t.date, t.product
FROM
    (SELECT date,
            product,
            COUNT(product) DESC) AS n_rank
    FROM products
    GROUP BY date, product) AS t
WHERE t.n_rank=1;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear'))

SELECT t.date, t.product
FROM
    (SELECT date,
            product,
            COUNT(product) DESC) AS n_rank
    FROM products
    GROUP BY date, product) AS t
WHERE t.n_rank=1;

': syntax error at or near "DESC"
LINE 17:             COUNT(product) DESC) AS n_rank
                                    ^


In [17]:
#  ----------  RE-VISIT  4/15/20 -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))
              
--need a count of each product for each day

SELECT date,
       product,
       COUNT(product) AS n_product
FROM products
GROUP BY date, product;
    
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,n_product
0,2020-02-12,apple,2
1,2020-02-13,pear,1
2,2020-02-12,orange,1


In [18]:
#  ----------  RE-VISIT  4/15/20 -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))
              
--need a count of each product for each day
    --then need a ranking

SELECT t.date,
       t.product,
       RANK() OVER(ORDER BY t.n_product DESC) AS rank
FROM
    (SELECT date,
            product,
           COUNT(product) AS n_product
    FROM products
    GROUP BY date, product) AS t;
    
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,rank
0,2020-02-12,apple,1
1,2020-02-13,pear,2
2,2020-02-12,orange,2


In [20]:
#  ----------  RE-VISIT  4/15/20 -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))
              
--need a count of each product for each day
--then need a ranking
    -- then need a selection of the rank

SELECT tr.date,
       tr.product
FROM
    (SELECT t.date,
            t.product,
           RANK() OVER(PARTITION BY date ORDER BY t.n_product DESC) AS rank
    FROM
        (SELECT date,
                product,
               COUNT(product) AS n_product
        FROM products
        GROUP BY date, product) AS t) AS tr
WHERE tr.rank=1
ORDER BY date;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product
0,2020-02-12,apple
1,2020-02-13,pear


In [28]:
#  ----------  RE-VISIT  4/15/20 (shorter window function) -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))
              

--need a count
--need a ranking and selection


SELECT date,
       product,
       RANK() OVER(PARTITION BY date ORDER BY COUNT(product) DESC) AS n_rank
FROM products
GROUP BY date, product;


"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,n_rank
0,2020-02-12,apple,1
1,2020-02-12,orange,2
2,2020-02-13,pear,1


In [26]:
#  ----------  RE-VISIT  4/15/20 (shorter window function) -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'))
              

--need a count
--need a ranking and selection

SELECT rt.date, rt.product
FROM
    (SELECT date,
           product,
           RANK() OVER(PARTITION BY date ORDER BY COUNT(product) DESC) AS n_rank
    FROM products
    GROUP BY date, product) AS rt
WHERE rt.n_rank=1;


"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product
0,2020-02-12,apple
1,2020-02-13,pear


In [34]:
#  ----------  RE-VISIT  4/15/20 (without a window function) -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear'))

--get count

SELECT date,
        product,
        COUNT(product) AS n_prod
FROM products
GROUP BY date, product
ORDER BY date, n_prod DESC

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,n_prod
0,2020-02-12,apple,2
1,2020-02-12,orange,1
2,2020-02-13,orange,2
3,2020-02-13,pear,1
4,2020-02-14,apple,2
5,2020-02-14,pear,1


In [None]:
SELECT score,
       (SELECT COUNT(*) 
       FROM input_table
       WHERE input_table.score > it.score) AS rank
FROM input_table AS it
ORDER BY rank;

In [57]:
#  ----------  RE-VISIT  4/15/20 (without a window function) -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear'))

SELECT *
FROM products;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,date,product
0,1,2020-02-12,apple
1,2,2020-02-12,apple
2,3,2020-02-12,orange
3,4,2020-02-13,pear
4,5,2020-02-13,orange
5,6,2020-02-13,orange
6,7,2020-02-14,apple
7,8,2020-02-14,apple
8,9,2020-02-14,pear


In [61]:
#  ----------  RE-VISIT  4/15/20 (without a window function) -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear')),

-- get count
-- get ranking using COUNT of n_product by comparing with its own table for that date

t AS
    (SELECT date,
            product,
            COUNT(product) AS n_prod
    FROM products
    GROUP BY date, product
    ORDER BY date, n_prod DESC)

SELECT t1.date,
       t1.product,
       (SELECT COUNT(*)
        FROM t
        WHERE t.n_prod >= t1.n_prod
        AND t.date=t1.date) AS rank
FROM t AS t1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product,rank
0,2020-02-12,apple,1
1,2020-02-12,orange,2
2,2020-02-13,orange,1
3,2020-02-13,pear,2
4,2020-02-14,apple,1
5,2020-02-14,pear,2


In [63]:
#  ----------  RE-VISIT  4/15/20 (without a window function) -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear')),

-- get count
-- get ranking using COUNT of n_product by comparing with its own table for that date
    -- select the rank

t AS
    (SELECT date,
            product,
            COUNT(product) AS n_prod
    FROM products
    GROUP BY date, product
    ORDER BY date, n_prod DESC)

SELECT t2.date,
       t2.product
FROM
    (SELECT t1.date,
           t1.product,
           (SELECT COUNT(*)
            FROM t
            WHERE t.n_prod >= t1.n_prod
            AND t.date=t1.date) AS rank
    FROM t AS t1) AS t2
WHERE t2.rank=1;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,date,product
0,2020-02-12,apple
1,2020-02-13,orange
2,2020-02-14,apple


In [64]:
#  ----------  RE-VISIT  4/15/20 (without a window function - use MAX) -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear')),

-- get count
-- get ranking using COUNT of n_product by comparing with its own table for that date
    -- select the rank

t AS
    (SELECT date,
            product,
            COUNT(product) AS n_prod
    FROM products
    GROUP BY date, product
    ORDER BY date, n_prod DESC)

SELECT t.date,
       t.product
FROM t
JOIN
    (SELECT date,
            MAX(COUNT(*)) AS max_count
            FROM t
            GROUP BY date) AS t1
ON t.date=t1.date
AND t.n_prod=t1.max_count


"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear')),

-- get count
-- get ranking using COUNT of n_product by comparing with its own table for that date
    -- select the rank

t AS
    (SELECT date,
            product,
            COUNT(product) AS n_prod
    FROM products
    GROUP BY date, product
    ORDER BY date, n_prod DESC)

SELECT t.date,
       t.product
FROM t
JOIN
    (SELECT date,
            MAX(COUNT(*)) AS max_count
            FROM t
            GROUP BY date) AS t1
ON t.date=t1.date
AND t.n_prod=t1.max_count


': aggregate function calls cannot be nested
LINE 30:             MAX(COUNT(*)) AS max_count
                         ^


In [67]:
#  ----------  RE-VISIT  4/15/20 (without a window function - use MAX - not sure if this can work) -------------
# Given a table of products, find the most frequent product each day
sql_query = """
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear')),

-- get count
-- get ranking using COUNT of n_product by comparing with its own table for that date
    -- select the rank

t1 AS
    (SELECT date,
            product,
            COUNT(product) AS n_prod
    FROM products
    GROUP BY date, product)

SELECT t1.date,
       t1.product
FROM t1
JOIN 
    (SELECT date, MAX(n_prod) AS max_count
    FROM t1
    GROUP BY date) AS t2
ON t1.date=t2.date
AND t1.count=t2.max_count

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH products (id, date, product) AS (
    VALUES (1, '2020-02-12', 'apple'),
           (2, '2020-02-12', 'apple'),
           (3, '2020-02-12', 'orange'),
           (4, '2020-02-13', 'pear'),
           (5, '2020-02-13', 'orange'),
           (6, '2020-02-13', 'orange'),
           (7, '2020-02-14', 'apple'),
           (8, '2020-02-14', 'apple'),
           (9, '2020-02-14', 'pear')),

-- get count
-- get ranking using COUNT of n_product by comparing with its own table for that date
    -- select the rank

t1 AS
    (SELECT date,
            product,
            COUNT(product) AS n_prod
    FROM products
    GROUP BY date, product)

SELECT t1.date,
       t1.product
FROM t1
JOIN 
    (SELECT date, MAX(n_prod) AS max_count
    FROM t1
    GROUP BY date) AS t2
ON t1.date=t2.date
AND t1.count=t2.max_count



': aggregate functions are not allowed in JOIN conditions
LINE 32: AND t1.count=t2.max_count
             ^


Keys: Remain calm. Do things piece by piece. Don't forget the FROM when adding a nested subquery.

# QotD 4/3/20

## Transactions

2) You have a table of transactions where each row represents a single transaction. The table has four columns: A user_id for the user sending money (from now on, the sender), a user_id for user receiving money (from now on, the receiver), an amount that was sent by the sender to the receiver, and a timestamp for when the transaction took place. User_ids appearing in both the sender and receiver columns are foreign keys to the same user table, and all values in the amount column are positive. 

Write a single query that gives the change in net worth for each user since data was being recorded in this table.


In [None]:
transactions
| sender_id | receiver_id | amount | ts |


output
| user_id | net_worth |


In [None]:
# Sent table

SELECT 
    user_id,

    
FROM transactions
WHERE 
    sender_id 
        
        

| sender_id | receiver_id | amount | ts |

## Exchange Seats

Mary is a teacher in a middle school and she has a table seat storing students' names and their corresponding seat ids.

The column id is continuous increment.
 
Mary wants to change seats for the adjacent students.
 
Can you write a SQL query to output the result for Mary?

+---------+---------+
|    id   | student |
+---------+---------+
|    1    | Abbot   |
|    2    | Doris   |
|    3    | Emerson |
|    4    | Green   |
|    5    | Jeames  |
+---------+---------+

In [5]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Abbot'),
           (2, 'Doris'),
           (3, 'Emerson'),
           (4, 'Green'),
           (5, 'Jeames'))

SELECT *
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Abbot
1,2,Doris
2,3,Emerson
3,4,Green
4,5,Jeames


In [20]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Abbot'),
           (2, 'Doris'),
           (3, 'Emerson'),
           (4, 'Green'),
           (5, 'Jeames'))

SELECT MAX(id) FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,max
0,5


In [23]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Abbot'),
           (2, 'Doris'),
           (3, 'Emerson'),
           (4, 'Green'),
           (5, 'Jeames'))

SELECT id,
       CASE WHEN id = (SELECT MAX(id) FROM seat) THEN student
            WHEN id%2 <> 0 THEN LEAD(student, 1) OVER()
            WHEN id%2 = 0 THEN LAG(student, 1) OVER()
            END AS student
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Doris
1,2,Abbot
2,3,Green
3,4,Emerson
4,5,Jeames


In [29]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Abbot'),
           (2, 'Doris'),
           (3, 'Emerson'),
           (4, 'Green'),
           (5, 'Jeames'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN LEAD(student, 1) OVER(ORDER BY student)
            WHEN id%2 = 0 THEN LAG(student, 1) OVER(ORDER BY student)
            END AS student
FROM seat
ORDER BY id;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Doris
1,2,Abbot
2,3,Green
3,4,Emerson
4,5,Jeames


{"headers": {"seat": ["id","student"]}, "rows": {"seat": [[1,"Craigie"],[2,"Julius"],[3,"Denis"],[4,"Isabel"],[5,"Windsor"],[6,"Vincent"],[7,"Mike"],[8,"Russell"],[9,"FitzGerald"],[10,"Rob"]]}}

In [32]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT *
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Craigie
1,2,Julius
2,3,Denis
3,4,Isabel
4,5,Windsor
5,6,Vincent
6,7,Mike
7,8,Russell
8,9,FitzGerald
9,10,Rob


In [34]:
# Input query
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN LEAD(student, 1) OVER(ORDER BY id)
            WHEN id%2 = 0 THEN LAG(student, 1) OVER(ORDER BY id)
            END AS student
FROM seat
ORDER BY id;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Julius
1,2,Craigie
2,3,Isabel
3,4,Denis
4,5,Vincent
5,6,Windsor
6,7,Russell
7,8,Mike
8,9,Rob
9,10,FitzGerald


In [39]:
# Without window function
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id-1)
            WHEN id%2 = 0 THEN (SELECT student FROM seat WHERE id+1)
            END AS student
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id-1)
            WHEN id%2 = 0 THEN (SELECT student FROM seat WHERE id+1)
            END AS student
FROM seat;

': argument of WHERE must be type boolean, not type integer
LINE 16: ...   WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id-1)
                                                                   ^


In [46]:
# Without window function
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       id+1,
       id-1
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,?column?,?column?.1
0,1,2,0
1,2,3,1
2,3,4,2
3,4,5,3
4,5,6,4
5,6,7,5
6,7,8,6
7,8,9,7
8,9,10,8
9,10,11,9


In [50]:
# Without window function
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id=2)
            WHEN id%2 = 0 THEN (SELECT student FROM seat WHERE id=1)
            END AS student
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,id,student
0,1,Julius
1,2,Craigie
2,3,Julius
3,4,Craigie
4,5,Julius
5,6,Craigie
6,7,Julius
7,8,Craigie
8,9,Julius
9,10,Craigie


In [51]:
# Without window function
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id=(SELECT id+1 FROM seat))
            END AS student
FROM seat;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT id,
       CASE WHEN (id = (SELECT MAX(id) FROM seat)) AND (id%2 <> 0) THEN student
            WHEN id%2 <> 0 THEN (SELECT student FROM seat WHERE id=(SELECT id+1 FROM seat))
            END AS student
FROM seat;

': more than one row returned by a subquery used as an expression


In [57]:
# Without window function
sql_query = """
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT s.student
FROM seat s
WHERE id=(SELECT id+1 FROM seat);
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

DatabaseError: Execution failed on sql '
WITH seat (id, student) AS (
    VALUES (1, 'Craigie'),
           (2, 'Julius'),
           (3, 'Denis'),
           (4, 'Isabel'),
           (5, 'Windsor'),
           (6, 'Vincent'),
           (7, 'Mike'),
           (8, 'Russell'),
           (9, 'FitzGerald'),
           (10, 'Rob'))

SELECT s.student
FROM seat s
WHERE id=(SELECT id+1 FROM seat);
': more than one row returned by a subquery used as an expression


# QotD 4/13/20 

## Group features

There are testing out on the change of a feature. They want to know whether group A behaves differently compared to group  B. 
How many people visit the SignUp page by each group? How about click through rate?
Based on the result, can we conclude that they are different? What test should be done? What hypothesis should be form? One tail or two tail test? How should you convince the manager? 

Table: Cohort
ID  | GroupAssigment  | 
1  | Group A | 
2
Group A
3
Group B
4
Group B
Table: Event
ID
Page
Click
1
SignUp
1
2
SignUp
0
3
SignIn
0
4
SignIn
0




In [None]:
SELECT COUNT(*)
FROM cohort
JOIN event
ON cohort.id=event.groupid
WHERE page="signup"

# QotD 4/14/20

## Uber's fraud team

1) Suppose you work at Uber's fraud team. While it is common for clients to use Uber in a foreign city, it is rare for a driver to drive in a foreign city. List all the drivers that have at least 5 completed trips in a foreign city in the last 28 days.


output:
| driver_id |  n_trips

SELECT t.driver_id,
       COUNT(*) AS n_trips
FROM 
    (SELECT * 
     FROM trips t
     JOIN cities c1
     ON t.city_id=c1.id) AS tc
JOIN 
    (SELECT *
     FROM users u
     JOIN cities c2
     ON u.city_id=c2.id
     WHERE u.role='driver') AS udc
ON tc.city_id=udc.city_id
WHERE DATEDIFF(day, tc.complete_time, GETDATE()) < 28
AND tc.country <> udc.country
GROUP BY tc.driver_id
HAVING COUNT(*) >= 5


## Projects

1) Given below tables, find the name of the department with the highest number of projects?





In [None]:
WITH t1 AS
    (SELECT d.name,
           COUNT(p.id) AS project_count
    FROM departments d
    JOIN employees e
    ON d.id=e.department_id
    JOIN employees_projects ep
    ON e.id=ep.id
    JOIN projects p
    ON ep.project_id=p.id
    GROUP BY d.name),
    
t2 AS (
    SELECT *,
           RANK() OVER (ORDER BY project_count) AS rank
    FROM t1
    

SELECT *
FROM t2
WHERE rank=1;



# QotD 4/15/20

## Problem statement and initial solution

1) Write a SQL query to create a histogram of the number of comments per user in the month of January 2019. Assume bin buckets class intervals of one.

- no. of people with 1 comment, 2 comments, 3 comments...

| n_comments | n_people |

SELECT 
    ct.n_comments,
    COUNT(*) AS n_people
FROM
    (SELECT uc.user_id,
           COUNT(*) AS n_comments
    FROM user_comments AS uc
    WHERE created_at BETWEEN ''2019-01-01' AND '2019-01-31'
    GROUP BY user_id) AS ct
GROUP BY ct.n_comments

In [76]:
[make_date(x) for x in range(10)]

['2019-12-26 05:13:56',
 '2019-12-30 07:16:11',
 '2019-12-25 01:39:35',
 '2019-12-25 13:30:31',
 '2019-12-31 01:09:45',
 '2019-12-29 20:05:27',
 '2019-12-28 14:33:06',
 '2019-12-30 08:58:11',
 '2019-12-29 07:59:45',
 '2019-12-26 14:15:40']

In [71]:
# Overview of table
sql_query = """
SELECT *
FROM temp_table;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,index,user_id,action,timestamp
0,0,123,start,2-14-20 3:05pm
1,1,123,cancel,2-14-20 3:06pm
2,2,456,start,2-15-20 5:46pm
3,3,456,publish,2-15-20 5:50pm


# QotD 4/16/20

## SQL question 

A table called Users contains columns for user id, country, and sign up date. Write a query to return, for each country, the first and last user to sign up.

In [112]:
# Overview of table
sql_query = """
WITH users (user_id, product_id, time_stamp)
AS (VALUES
(1, 'China', CAST('4-12-20' AS date)),
(2, 'US', CAST('4-13-20' AS date)),
(3, 'Canada', CAST('4-14-20' AS date)),
(4, 'China', CAST('4-15-20' AS date)),
(5, 'Mexico', CAST('4-16-20' AS date)),
(6, 'China', CAST('4-17-20' AS date)),
(7, 'US', CAST('4-17-20' AS date)))

SELECT *
FROM users;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,user_id,product_id,time_stamp
0,1,China,2020-04-12
1,2,US,2020-04-13
2,3,Canada,2020-04-14
3,4,China,2020-04-15
4,5,Mexico,2020-04-16
5,6,China,2020-04-17
6,7,US,2020-04-17


In [139]:
# Overview of table
sql_query = """
WITH users (user_id, product_id, time_stamp)
AS (VALUES
(1, 'China', CAST('4-12-20' AS date)),
(2, 'US', CAST('4-13-20' AS date)),
(3, 'Canada', CAST('4-14-20' AS date)),
(4, 'China', CAST('4-15-20' AS date)),
(5, 'Mexico', CAST('4-16-20' AS date)),
(6, 'China', CAST('4-17-20' AS date)),
(7, 'US', CAST('4-17-20' AS date)))


SELECT uu1.product_id,
       uu1.user_id AS early_user,
       uu2.user_id AS late_user

FROM

    -- Make a table for the early users for each country
    (SELECT u.product_id,
           u.user_id
    FROM users u
    JOIN
        (SELECT product_id AS product_early,
                MIN(time_stamp) AS earliest_date
        FROM users
        GROUP BY product_id) AS u1
    ON u.time_stamp=u1.earliest_date
    AND u.product_id=u1.product_early) AS uu1

JOIN

    -- JOIN a table for the late users for each country
    (SELECT u.product_id,
           u.user_id
    FROM users u
    JOIN
        (SELECT product_id AS product_late,
                MAX(time_stamp) AS latest_date
        FROM users
        GROUP BY product_id) AS u2
    ON u.time_stamp=u2.latest_date
    AND u.product_id=u2.product_late) AS uu2

ON uu1.product_id=uu2.product_id;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,product_id,early_user,late_user
0,US,2,7
1,Mexico,5,5
2,Canada,3,3
3,China,1,6


In [147]:
# Overview of table
sql_query = """
WITH users (user_id, product_id, time_stamp)
AS (VALUES
(1, 'China', CAST('4-12-20' AS date)),
(2, 'US', CAST('4-13-20' AS date)),
(3, 'Canada', CAST('4-14-20' AS date)),
(4, 'China', CAST('4-15-20' AS date)),
(5, 'Mexico', CAST('4-16-20' AS date)),
(6, 'China', CAST('4-17-20' AS date)),
(7, 'US', CAST('4-17-20' AS date)))

-- try with aggregate min/max window function
SELECT user_id,
       product_id,
       time_stamp,
       MIN(time_stamp) OVER(PARTITION BY product_id) AS early,
       MAX(time_stamp) OVER(PARTITION BY product_id) AS late
FROM users
;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,user_id,product_id,time_stamp,early,late
0,3,Canada,2020-04-14,2020-04-14,2020-04-14
1,4,China,2020-04-15,2020-04-12,2020-04-17
2,1,China,2020-04-12,2020-04-12,2020-04-17
3,6,China,2020-04-17,2020-04-12,2020-04-17
4,5,Mexico,2020-04-16,2020-04-16,2020-04-16
5,2,US,2020-04-13,2020-04-13,2020-04-17
6,7,US,2020-04-17,2020-04-13,2020-04-17


In [156]:
# Overview of table
sql_query = """
WITH users (user_id, product_id, time_stamp)
AS (VALUES
(1, 'China', CAST('4-12-20' AS date)),
(2, 'US', CAST('4-13-20' AS date)),
(3, 'Canada', CAST('4-14-20' AS date)),
(4, 'China', CAST('4-15-20' AS date)),
(5, 'Mexico', CAST('4-16-20' AS date)),
(6, 'China', CAST('4-17-20' AS date)),
(7, 'US', CAST('4-17-20' AS date)))

-- try with aggregate min/max window function and case

SELECT *
FROM
    (SELECT user_id,
           product_id,
           time_stamp,
           MIN(time_stamp) OVER(PARTITION BY product_id) AS early,
           MAX(time_stamp) OVER(PARTITION BY product_id) AS late
    FROM users) u
WHERE u.time_stamp=u.early OR u.time_stamp=u.late
;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,user_id,product_id,time_stamp,early,late
0,3,Canada,2020-04-14,2020-04-14,2020-04-14
1,1,China,2020-04-12,2020-04-12,2020-04-17
2,6,China,2020-04-17,2020-04-12,2020-04-17
3,5,Mexico,2020-04-16,2020-04-16,2020-04-16
4,2,US,2020-04-13,2020-04-13,2020-04-17
5,7,US,2020-04-17,2020-04-13,2020-04-17


In [160]:
# Overview of table
sql_query = """
WITH users (user_id, product_id, time_stamp)
AS (VALUES
(1, 'China', CAST('4-12-20' AS date)),
(2, 'US', CAST('4-13-20' AS date)),
(3, 'Canada', CAST('4-14-20' AS date)),
(4, 'China', CAST('4-15-20' AS date)),
(5, 'Mexico', CAST('4-16-20' AS date)),
(6, 'China', CAST('4-17-20' AS date)),
(7, 'US', CAST('4-17-20' AS date)))

-- try with aggregate min/max window function and case

SELECT product_id,
       CASE WHEN time_stamp=early THEN user_id END early_user,
       CASE WHEN time_stamp=late THEN user_id END late_user
FROM
    (SELECT user_id,
           product_id,
           time_stamp,
           MIN(time_stamp) OVER(PARTITION BY product_id) AS early,
           MAX(time_stamp) OVER(PARTITION BY product_id) AS late
    FROM users) u
;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,product_id,early_user,late_user
0,Canada,3.0,3.0
1,China,,
2,China,1.0,
3,China,,6.0
4,Mexico,5.0,5.0
5,US,2.0,
6,US,,7.0


In [131]:
# Chuchu - may not be right?
sql_query = """
WITH users (user_id, product_id, time_stamp)
AS (VALUES
(1, 'China', CAST('4-12-20' AS date)),
(2, 'US', CAST('4-13-20' AS date)),
(3, 'Canada', CAST('4-14-20' AS date)),
(4, 'China', CAST('4-15-20' AS date)),
(5, 'Mexico', CAST('4-16-20' AS date)),
(6, 'China', CAST('4-17-20' AS date)),
(7, 'US', CAST('4-17-20' AS date)))

SELECT s.product_id, u.user_id AS first, u1.user_id AS last
FROM(
SELECT product_id, MAX(time_stamp) as laststamp, MIN(time_stamp) as firststamp
FROM users
GROUP BY product_id) s
JOIN users u
ON u.product_id = s.product_id AND u.time_stamp = s.firststamp
JOIN users u1
ON u.product_id = s.product_id AND u.time_stamp = s.laststamp

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,product_id,first,last
0,Canada,3,1
1,Canada,3,2
2,Canada,3,3
3,Canada,3,4
4,Canada,3,5
5,Canada,3,6
6,Canada,3,7
7,Mexico,5,1
8,Mexico,5,2
9,Mexico,5,3


## Mike's problem

Write an SQL query that makes recommendations using the pages that your friends liked. Assume you have two tables: a table of users and friends and table of users and pages they liked. It should not recommend pages you already like.

In [165]:
sql_query = """

WITH friends (user_id, friends)
AS (VALUES
(1, 2), (1, 3), (1, 4), (2, 1), (3, 1), (3, 4), (4, 1), (4, 3)),

likes (user_id, page_likes)
AS (VALUES
(1, 'A'), (1, 'B'), (1, 'C'), (2, 'A'), (3, 'B'), (3, 'C'), (4, 'B'))

SELECT *
FROM likes

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,page_likes
0,1,A
1,1,B
2,1,C
3,2,A
4,3,B
5,3,C
6,4,B


In [166]:
sql_query = """

WITH friends (user_id, friends)
AS (VALUES
(1, 2), (1, 3), (1, 4), (2, 1), (3, 1), (3, 4), (4, 1), (4, 3)),

likes (user_id, page_likes)
AS (VALUES
(1, 'A'), (1, 'B'), (1, 'C'), (2, 'A'), (3, 'B'), (3, 'C'), (4, 'B'))

SELECT *
FROM friends

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,friends
0,1,2
1,1,3
2,1,4
3,2,1
4,3,1
5,3,4
6,4,1
7,4,3


Strategy - for each user, make a count of likes for each page that their friends liked

- need to limit to where they are not there
- and the pages are not there

user_id | A | B | C |

1       | 2 | 2 | 1 |
2


In [175]:
# Get the user_ids friends likes

sql_query = """

WITH friends (user_id, friends)
AS (VALUES
(1, 2), (1, 3), (1, 4), (2, 1), (3, 1), (3, 4), (4, 1), (4, 3)),

likes (user_id, page_likes)
AS (VALUES
(1, 'A'), (1, 'B'), (1, 'C'), (2, 'A'), (3, 'B'), (3, 'C'), (4, 'B'))

SELECT *
FROM friends
JOIN likes
ON friends.friends=likes.user_id

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,friends,user_id.1,page_likes
0,1,2,2,A
1,1,3,3,C
2,1,3,3,B
3,1,4,4,B
4,2,1,1,C
5,2,1,1,B
6,2,1,1,A
7,3,1,1,C
8,3,1,1,B
9,3,1,1,A


In [195]:
# Get the user_ids friends likes... join it to a table where the users own likes are not equal

sql_query = """
WITH friends (user_id, friends)
AS (VALUES
(1, 2), (1, 3), (1, 4), (2, 1), (3, 1), (3, 4), (4, 1), (4, 3)),

likes (user_id, page_likes)
AS (VALUES
(1, 'A'), (1, 'B'), (1, 'C'), (2, 'A'), (3, 'B'), (3, 'C'), (4, 'B'))

SELECT f.user_id,
       f.friends,
       l.page_likes
FROM friends f
JOIN likes l
ON f.friends=l.user_id

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,user_id,friends,page_likes
0,1,2,A
1,1,3,C
2,1,3,B
3,1,4,B
4,2,1,C
5,2,1,B
6,2,1,A
7,3,1,C
8,3,1,B
9,3,1,A


In [202]:
# Get the user_ids friends likes... join it to a table where the users own likes are not equal

sql_query = """
WITH friends (user_id, friends)
AS (VALUES
(1, 2), (1, 3), (1, 4), (2, 1), (3, 1), (3, 4), (4, 1), (4, 3)),

likes (user_id, page_likes)
AS (VALUES
(1, 'A'), (1, 'B'), (1, 'C'), (2, 'A'), (3, 'B'), (3, 'C'), (4, 'B'))

SELECT f.user_id,
       f.friends AS recommending_friend,
       l.page_likes AS liked_by_friend
FROM friends f
JOIN likes l
ON f.friends=l.user_id
WHERE l.page_likes NOT IN (SELECT DISTINCT(page_likes)
                           FROM likes l2
                           WHERE l2.user_id=f.user_id)
ORDER BY f.user_id;

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,user_id,recommending_friend,liked_by_friend
0,2,1,B
1,2,1,C
2,3,1,A
3,4,1,A
4,4,1,C
5,4,3,C


In [193]:
# Get the user_ids friends likes

sql_query = """

WITH friends (user_id, friends)
AS (VALUES
(1, 2), (1, 3), (1, 4), (2, 1), (3, 1), (3, 4), (4, 1), (4, 3)),

likes (user_id, page_likes)
AS (VALUES
(1, 'A'), (1, 'B'), (1, 'C'), (2, 'A'), (3, 'B'), (3, 'C'), (4, 'B'))

SELECT DISTINCT(page_likes)
FROM likes
WHERE user_id=1

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,page_likes
0,A
1,C
2,B


# QotD 4/17/20 (I'm leading)

## Brilliant questions

In the last 14d, what are the top 5 countries of people joining the platform?

In [105]:
# Testing query
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))

SELECT *
FROM page_actions;
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,user_id,page_id,ts,action
0,3,2,2020-01-03,viewed
1,3,1,2020-01-03,viewed
2,1,3,2020-01-08,viewed
3,10,1,2020-01-08,viewed
4,7,3,2020-01-08,clicked
5,3,1,2020-01-10,clicked
6,10,2,2020-01-13,clicked
7,6,1,2020-01-13,viewed
8,8,3,2020-01-14,viewed
9,10,1,2020-01-14,viewed


### Question 1

In [42]:
# Testing query
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))

SELECT country,
       COUNT(*) AS no_joined
FROM user_summary
WHERE join_ts >= (join_ts - interval '14 days')
GROUP BY country
;
"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,country,no_joined
0,US,5
1,Mexico,1
2,Canada,4


### Question 2
For each country, what are the average and total number of pages viewed within 4 days after a user joins?

In [56]:
# Testing query
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed')),


u_counts AS
    (SELECT us.user_id,
           COUNT(*) AS n_viewed
    FROM user_summary us
    JOIN page_actions pa
    ON us.user_id=pa.user_id
    WHERE pa.ts <= us.join_ts + (interval '4 days')
    AND pa.action='viewed'
    GROUP BY us.user_id)
    
SELECT *
FROM u_counts
RIGHT JOIN user_summary us
ON u_counts.user_id=us.user_id;

"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,n_viewed,user_id.1,join_ts,join_client,country
0,,,1,2020-01-03,android-native,Canada
1,,,2,2020-01-04,android-native,US
2,3.0,2.0,3,2020-01-06,mobile-browser,US
3,,,4,2020-01-06,android-native,Canada
4,,,5,2020-01-07,mobile-browser,Canada
5,,,6,2020-01-07,desktop-browser,Canada
6,,,7,2020-01-08,mobile-browser,US
7,8.0,1.0,8,2020-01-10,mobile-browser,US
8,,,9,2020-01-11,desktop-browser,US
9,10.0,2.0,10,2020-01-11,android-native,Mexico


In [61]:
# Trying with CASE
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))


SELECT us.user_id,
       SUM(CASE WHEN action='viewed' THEN 1 ELSE 0 END) AS n_views
FROM user_summary us
JOIN page_actions pa
ON us.user_id=pa.user_id
WHERE pa.ts <= us.join_ts + (interval '4 days')
GROUP BY us.user_id

"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,n_views
0,8,1
1,10,2
2,7,0
3,3,2


In [68]:
# Trying with CASE, without time restriction
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))


SELECT us.user_id,
       SUM(CASE WHEN action='viewed' THEN 1 ELSE 0 END) AS n_views
FROM user_summary us
LEFT JOIN page_actions pa
ON us.user_id=pa.user_id
GROUP BY us.user_id
ORDER BY us.user_id

"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,n_views
0,1,1
1,2,0
2,3,2
3,4,0
4,5,0
5,6,1
6,7,0
7,8,1
8,9,0
9,10,2


In [109]:
# Finalize query
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed')),


u_counts AS
    (SELECT us.user_id,
       SUM(CASE WHEN action='viewed' THEN 1 ELSE 0 END) AS n_views
    FROM user_summary us
    LEFT JOIN page_actions pa
    ON us.user_id=pa.user_id
    GROUP BY us.user_id
    ORDER BY us.user_id)

SELECT country,
       AVG(n_views) AS avg_views,
       SUM(n_views) AS total_views
FROM u_counts
JOIN user_summary us
ON u_counts.user_id=us.user_id
GROUP BY country;

"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,country,avg_views,total_views
0,Mexico,2.0,2.0
1,US,0.6,3.0
2,Canada,0.5,2.0


### Question 3

For the US, if someone clicked on a page within 4 hours after joining, what % of them joined via android-native? Trend this by the date someone joined.

In [77]:
# remove country and time restriction, take out distinct
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))


--limit to US
--limit to time interval
--numerator android, denominator all devices

SELECT us.user_id,
       us.join_client
FROM user_summary us
JOIN page_actions pa
ON us.user_id=pa.user_id
--WHERE pa.action = 'clicked'
--AND us.country='US'
--AND pa.ts <= us.join_ts + interval '4 hours'


"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,join_client
0,1,android-native
1,3,mobile-browser
2,3,mobile-browser
3,3,mobile-browser
4,6,desktop-browser
5,7,mobile-browser
6,8,mobile-browser
7,10,android-native
8,10,android-native
9,10,android-native


In [78]:
# remove country and time restriction
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))


--limit to US (don't worry about this)
--limit to time interval (don't worry about this)
--numerator android, denominator all devices

SELECT DISTINCT(us.user_id),
       us.join_client
FROM user_summary us
JOIN page_actions pa
ON us.user_id=pa.user_id
"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,join_client
0,1,android-native
1,10,android-native
2,3,mobile-browser
3,8,mobile-browser
4,7,mobile-browser
5,6,desktop-browser


In [85]:
# remove country and time restriction
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))


--limit to US (don't worry about this)
--limit to time interval (don't worry about this)
--numerator android, denominator all devices


SELECT *
FROM user_summary us1
JOIN
    (SELECT DISTINCT(us.user_id),
           us.join_client
    FROM user_summary us
    JOIN page_actions pa
    ON us.user_id=pa.user_id) AS t
ON us1.user_id=t.user_id
    
"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,join_ts,join_client,country,user_id.1,join_client.1
0,1,2020-01-03,android-native,Canada,1,android-native
1,3,2020-01-06,mobile-browser,US,3,mobile-browser
2,6,2020-01-07,desktop-browser,Canada,6,desktop-browser
3,7,2020-01-08,mobile-browser,US,7,mobile-browser
4,8,2020-01-10,mobile-browser,US,8,mobile-browser
5,10,2020-01-11,android-native,Mexico,10,android-native


In [93]:
# remove country and time restriction
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))


--limit to US (don't worry about this)
--limit to time interval (don't worry about this)
--numerator android, denominator all devices


SELECT us1.join_ts,
       SUM(CASE WHEN us1.join_client='android-native' THEN 1 ELSE 0 END) AS num,
       COUNT(*) AS den
FROM user_summary us1
JOIN
    (SELECT DISTINCT(us.user_id),
           us.join_client
    FROM user_summary us
    JOIN page_actions pa
    ON us.user_id=pa.user_id) AS t
ON us1.user_id=t.user_id
GROUP BY us1.join_ts
ORDER BY us1.join_ts
    
"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,join_ts,num,den
0,2020-01-03,1,1
1,2020-01-06,0,1
2,2020-01-07,0,1
3,2020-01-08,0,1
4,2020-01-10,0,1
5,2020-01-11,1,1


In [94]:
# finalize query
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))


--limit to US (don't worry about this)
--limit to time interval (don't worry about this)
--numerator android, denominator all devices


SELECT us1.join_ts,
       (SUM(CASE WHEN us1.join_client='android-native' THEN 1 ELSE 0 END)::numeric / 
       COUNT(*)) AS pct_android
FROM user_summary us1
JOIN
    (SELECT DISTINCT(us.user_id),
           us.join_client
    FROM user_summary us
    JOIN page_actions pa
    ON us.user_id=pa.user_id) AS t
ON us1.user_id=t.user_id
GROUP BY us1.join_ts
ORDER BY us1.join_ts
    
"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,join_ts,pct_android
0,2020-01-03,1.0
1,2020-01-06,0.0
2,2020-01-07,0.0
3,2020-01-08,0.0
4,2020-01-10,0.0
5,2020-01-11,1.0


In [153]:
# query done with Mike
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS date), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS date), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS date), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS date), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS date), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS date), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS date), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS date), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS date), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS date), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS date), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS date), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS date), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS date), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS date), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS date), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS date), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS date), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS date), 'viewed'))

SELECT t2.join_ts,
       (SUM(CASE WHEN t2.join_client='android-native' THEN 1 ELSE 0 END) /
       COUNT(*)) AS den
FROM 
(SELECT DISTINCT us.user_id,
        SUM(CASE WHEN pa.action='clicked' THEN 1 ELSE 0 END)
FROM user_summary us
LEFT JOIN page_actions pa
ON us.user_id=pa.user_id
WHERE pa.ts-us.join_ts <= 4
AND us.country='US'
GROUP BY us.user_id) t1
JOIN user_summary t2
ON t1.user_id=t2.user_id
GROUP BY t2.join_ts
ORDER BY t2.join_ts


"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,join_ts,den
0,2020-01-06,0
1,2020-01-08,0
2,2020-01-10,0


In [None]:
# Checking dates

In [100]:
# finalize query
sql_query = """

WITH user_summary (user_id, join_ts, join_client, country)
AS (VALUES
( 1 , CAST('2020-01-03 21:42:44' AS timestamp), 'android-native', 'Canada'),
( 2 , CAST('2020-01-04 13:57:34' AS timestamp), 'android-native', 'US'),
( 3 , CAST('2020-01-06 05:24:27' AS timestamp), 'mobile-browser', 'US'),
( 4 , CAST('2020-01-06 06:47:06' AS timestamp), 'android-native', 'Canada'),
( 5 , CAST('2020-01-07 05:57:20' AS timestamp), 'mobile-browser', 'Canada'),
( 6 , CAST('2020-01-07 23:55:00' AS timestamp), 'desktop-browser', 'Canada'),
( 7 , CAST('2020-01-08 21:16:25' AS timestamp), 'mobile-browser', 'US'),
( 8 , CAST('2020-01-10 21:16:19' AS timestamp), 'mobile-browser', 'US'),
( 9 , CAST('2020-01-11 02:02:49' AS timestamp), 'desktop-browser', 'US'),
( 10 , CAST('2020-01-11 07:00:19' AS timestamp), 'android-native', 'Mexico')),

page_actions (user_id, page_id, ts, action)
AS (VALUES
( 3,2, CAST('2020-01-03 23:28:36 ' AS timestamp), 'viewed'),
( 3,1, CAST('2020-01-03 23:32:05 ' AS timestamp), 'viewed'),
( 1,3, CAST('2020-01-08 03:02:15 ' AS timestamp), 'viewed'),
( 10,1, CAST('2020-01-08 05:07:58 ' AS timestamp), 'viewed'),
( 7,3, CAST('2020-01-08 13:44:02 ' AS timestamp), 'clicked'),
( 3,1, CAST('2020-01-10 06:34:01 ' AS timestamp), 'clicked'),
( 10,2, CAST('2020-01-13 02:12:42 ' AS timestamp), 'clicked'),
( 6,1, CAST('2020-01-13 14:28:43 ' AS timestamp), 'viewed'),
( 8,3, CAST('2020-01-14 01:29:32 ' AS timestamp), 'viewed'),
( 10,1, CAST('2020-01-14 10:52:38 ' AS timestamp), 'viewed'))


--limit to US (don't worry about this)
--limit to time interval (don't worry about this)
--numerator android, denominator all devices


SELECT *
FROM user_summary us1
"""

df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,join_ts,join_client,country
0,1,2020-01-03 21:42:44,android-native,Canada
1,2,2020-01-04 13:57:34,android-native,US
2,3,2020-01-06 05:24:27,mobile-browser,US
3,4,2020-01-06 06:47:06,android-native,Canada
4,5,2020-01-07 05:57:20,mobile-browser,Canada
5,6,2020-01-07 23:55:00,desktop-browser,Canada
6,7,2020-01-08 21:16:25,mobile-browser,US
7,8,2020-01-10 21:16:19,mobile-browser,US
8,9,2020-01-11 02:02:49,desktop-browser,US
9,10,2020-01-11 07:00:19,android-native,Mexico


## Mike's problem 2

Two tables of users and events. What fraction of users who accessed feature 2 (F2) upgraded to premium within the first month of signing up.

In [5]:
sql_query = """

WITH users (user_id, name, signup)
AS (VALUES
(1, 'Jon', CAST('2-14-20' AS date)),
(2, 'Jane', CAST('2-14-20' AS date)),
(3, 'Jill', CAST('2-15-20' AS date)),
(4, 'Josh', CAST('2-15-20' AS date)),
(5, 'Jean', CAST('2-16-20' AS date)),
(6, 'Justin', CAST('2-17-20' AS date)),
(7, 'Jeremy', CAST('2-18-20' AS date))),

events (user_id, type, access_date)
AS (VALUES
(1, 'F1', CAST('3-1-20' AS date)),
(2, 'F2', CAST('3-2-20' AS date)),
(3, 'F2', CAST('3-15-20' AS date)),
(4, 'F2', CAST('3-15-20' AS date)),
(1, 'P', CAST('3-16-20' AS date)),
(2, 'P', CAST('3-18-20' AS date)),
(3, 'P', CAST('3-22-20' AS date)))

SELECT *
FROM users
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,name,signup
0,1,Jon,2020-02-14
1,2,Jane,2020-02-14
2,3,Jill,2020-02-15
3,4,Josh,2020-02-15
4,5,Jean,2020-02-16
5,6,Justin,2020-02-17
6,7,Jeremy,2020-02-18


In [6]:
sql_query = """

WITH users (user_id, name, signup)
AS (VALUES
(1, 'Jon', CAST('2-14-20' AS date)),
(2, 'Jane', CAST('2-14-20' AS date)),
(3, 'Jill', CAST('2-15-20' AS date)),
(4, 'Josh', CAST('2-15-20' AS date)),
(5, 'Jean', CAST('2-16-20' AS date)),
(6, 'Justin', CAST('2-17-20' AS date)),
(7, 'Jeremy', CAST('2-18-20' AS date))),

events (user_id, type, access_date)
AS (VALUES
(1, 'F1', CAST('3-1-20' AS date)),
(2, 'F2', CAST('3-2-20' AS date)),
(3, 'F2', CAST('3-15-20' AS date)),
(4, 'F2', CAST('3-15-20' AS date)),
(1, 'P', CAST('3-16-20' AS date)),
(2, 'P', CAST('3-18-20' AS date)),
(3, 'P', CAST('3-22-20' AS date)))

SELECT *
FROM events
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,type,access_date
0,1,F1,2020-03-01
1,2,F2,2020-03-02
2,3,F2,2020-03-15
3,4,F2,2020-03-15
4,1,P,2020-03-16
5,2,P,2020-03-18
6,3,P,2020-03-22


Strategy:
- limit to users who fit time window
- join tables
- use case statements

In [12]:
sql_query = """

WITH users (user_id, name, signup)
AS (VALUES
(1, 'Jon', CAST('2-14-20' AS date)),
(2, 'Jane', CAST('2-14-20' AS date)),
(3, 'Jill', CAST('2-15-20' AS date)),
(4, 'Josh', CAST('2-15-20' AS date)),
(5, 'Jean', CAST('2-16-20' AS date)),
(6, 'Justin', CAST('2-17-20' AS date)),
(7, 'Jeremy', CAST('2-18-20' AS date))),

events (user_id, type, access_date)
AS (VALUES
(1, 'F1', CAST('3-1-20' AS date)),
(2, 'F2', CAST('3-2-20' AS date)),
(3, 'F2', CAST('3-15-20' AS date)),
(4, 'F2', CAST('3-15-20' AS date)),
(1, 'P', CAST('3-16-20' AS date)),
(2, 'P', CAST('3-18-20' AS date)),
(3, 'P', CAST('3-22-20' AS date)))

SELECT *
FROM users u
JOIN events e
ON u.user_id=e.user_id;


"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,name,signup,user_id.1,type,access_date
0,1,Jon,2020-02-14,1,P,2020-03-16
1,1,Jon,2020-02-14,1,F1,2020-03-01
2,2,Jane,2020-02-14,2,P,2020-03-18
3,2,Jane,2020-02-14,2,F2,2020-03-02
4,3,Jill,2020-02-15,3,P,2020-03-22
5,3,Jill,2020-02-15,3,F2,2020-03-15
6,4,Josh,2020-02-15,4,F2,2020-03-15


In [19]:
# Testing datediff

sql_query = """

WITH users (user_id, name, signup)
AS (VALUES
(1, 'Jon', CAST('2-14-20' AS date)),
(2, 'Jane', CAST('2-14-20' AS date)),
(3, 'Jill', CAST('2-15-20' AS date)),
(4, 'Josh', CAST('2-15-20' AS date)),
(5, 'Jean', CAST('2-16-20' AS date)),
(6, 'Justin', CAST('2-17-20' AS date)),
(7, 'Jeremy', CAST('2-18-20' AS date))),

events (user_id, type, access_date)
AS (VALUES
(1, 'F1', CAST('3-1-20' AS date)),
(2, 'F2', CAST('3-2-20' AS date)),
(3, 'F2', CAST('3-15-20' AS date)),
(4, 'F2', CAST('3-15-20' AS date)),
(1, 'P', CAST('3-16-20' AS date)),
(2, 'P', CAST('3-18-20' AS date)),
(3, 'P', CAST('3-22-20' AS date)))

SELECT e.access_date-u.signup
FROM users u
JOIN events e
ON u.user_id=e.user_id

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,?column?
0,31
1,16
2,33
3,17
4,36
5,29
6,29


In [31]:
# Try getting month

sql_query = """

WITH users (user_id, name, signup)
AS (VALUES
(1, 'Jon', CAST('2-14-20' AS date)),
(2, 'Jane', CAST('2-14-20' AS date)),
(3, 'Jill', CAST('2-15-20' AS date)),
(4, 'Josh', CAST('2-15-20' AS date)),
(5, 'Jean', CAST('2-16-20' AS date)),
(6, 'Justin', CAST('2-17-20' AS date)),
(7, 'Jeremy', CAST('2-18-20' AS date))),

events (user_id, type, access_date)
AS (VALUES
(1, 'F1', CAST('3-1-20' AS date)),
(2, 'F2', CAST('3-2-20' AS date)),
(3, 'F2', CAST('3-15-20' AS date)),
(4, 'F2', CAST('3-15-20' AS date)),
(1, 'P', CAST('3-16-20' AS date)),
(2, 'P', CAST('3-18-20' AS date)),
(3, 'P', CAST('3-22-20' AS date)))

SELECT DATE_PART('month', e.access_date)-DATE_PART('month', u.signup)
FROM users u
JOIN events e
ON u.user_id=e.user_id
WHERE u.user_id IN (SELECT e.user_id
                    FROM events e1
                    WHERE e1.type='F2')


"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,?column?
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0
5,1.0
6,1.0


In [39]:
# Just use a date difference less than 30 days

sql_query = """

WITH users (user_id, name, signup)
AS (VALUES
(1, 'Jon', CAST('2-14-20' AS date)),
(2, 'Jane', CAST('2-14-20' AS date)),
(3, 'Jill', CAST('2-15-20' AS date)),
(4, 'Josh', CAST('2-15-20' AS date)),
(5, 'Jean', CAST('2-16-20' AS date)),
(6, 'Justin', CAST('2-17-20' AS date)),
(7, 'Jeremy', CAST('2-18-20' AS date))),

events (user_id, type, access_date)
AS (VALUES
(1, 'F1', CAST('3-1-20' AS date)),
(2, 'F2', CAST('3-2-20' AS date)),
(3, 'F2', CAST('3-15-20' AS date)),
(4, 'F2', CAST('3-15-20' AS date)),
(1, 'P', CAST('3-16-20' AS date)),
(2, 'P', CAST('3-18-20' AS date)),
(3, 'P', CAST('3-22-20' AS date)))

SELECT SUM(CASE WHEN (e.access_date-u.signup <= 31) AND (e.type='P') THEN 1 END) AS num,
       SUM(CASE WHEN e.type='F2' THEN 1 END) AS den
FROM users u
JOIN events e
ON u.user_id=e.user_id
WHERE u.user_id IN (SELECT e.user_id
                    FROM events e1
                    WHERE e1.type='F2')


"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,num,den
0,1,3


In [29]:
# Just use a date difference less than 30 days

sql_query = """

WITH users (user_id, name, signup)
AS (VALUES
(1, 'Jon', CAST('2-14-20' AS date)),
(2, 'Jane', CAST('2-14-20' AS date)),
(3, 'Jill', CAST('2-15-20' AS date)),
(4, 'Josh', CAST('2-15-20' AS date)),
(5, 'Jean', CAST('2-16-20' AS date)),
(6, 'Justin', CAST('2-17-20' AS date)),
(7, 'Jeremy', CAST('2-18-20' AS date))),

events (user_id, type, access_date)
AS (VALUES
(1, 'F1', CAST('3-1-20' AS date)),
(2, 'F2', CAST('3-2-20' AS date)),
(3, 'F2', CAST('3-15-20' AS date)),
(4, 'F2', CAST('3-15-20' AS date)),
(1, 'P', CAST('3-16-20' AS date)),
(2, 'P', CAST('3-18-20' AS date)),
(3, 'P', CAST('3-22-20' AS date)))

SELECT (SUM(CASE WHEN (e.access_date-u.signup <= 31) AND (e.type='P') THEN 1 END)::numeric / 
        SUM(CASE WHEN e.type='F2' THEN 1 END)) AS proportion
FROM users u
JOIN events e
ON u.user_id=e.user_id
WHERE u.user_id IN (SELECT e.user_id
                    FROM events e1
                    WHERE e1.type='F2')


"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,proportion
0,0.333333


# QotD 4/30/20

A table logs all posts each user sees, and time seen. 
 
Write queries to find:
 

A) last time each post was seen, in reverse chronological order
B) last post each user has seen, in chronological order

In [27]:
# Just use a date difference less than 30 days

sql_query = """
WITH posts (user_id, post, time)
AS (VALUES
(1, 'A', CAST('2-14-20' AS date)),
(2, 'B', CAST('2-14-20' AS date)),
(3, 'C', CAST('2-15-20' AS date)),
(1, 'B', CAST('2-15-20' AS date)),
(2, 'A', CAST('2-16-20' AS date)),
(3, 'B', CAST('2-17-20' AS date)),
(1, 'D', CAST('2-18-20' AS date)))

SELECT *
FROM posts
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query

Unnamed: 0,user_id,post,time
0,1,A,2020-02-14
1,2,B,2020-02-14
2,3,C,2020-02-15
3,1,B,2020-02-15
4,2,A,2020-02-16
5,3,B,2020-02-17
6,1,D,2020-02-18


In [12]:
# A) last time each post was seen, in reverse chronological order

sql_query = """
WITH posts (user_id, post, time)
AS (VALUES
(1, 'A', CAST('2-14-20' AS date)),
(2, 'B', CAST('2-14-20' AS date)),
(3, 'C', CAST('2-15-20' AS date)),
(1, 'B', CAST('2-15-20' AS date)),
(2, 'A', CAST('2-16-20' AS date)),
(3, 'B', CAST('2-17-20' AS date)),
(1, 'D', CAST('2-18-20' AS date)))

SELECT post,
       MAX(time) AS last_post
FROM posts
GROUP BY post
ORDER BY last_post DESC
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,post,last_post
0,D,2020-02-18
1,B,2020-02-17
2,A,2020-02-16
3,C,2020-02-15


In [17]:
# B) last post each user has seen, in chronological order

sql_query = """
WITH posts (user_id, post, time)
AS (VALUES
(1, 'A', CAST('2-14-20' AS date)),
(2, 'B', CAST('2-14-20' AS date)),
(3, 'C', CAST('2-15-20' AS date)),
(1, 'B', CAST('2-15-20' AS date)),
(2, 'A', CAST('2-16-20' AS date)),
(3, 'B', CAST('2-17-20' AS date)),
(1, 'D', CAST('2-18-20' AS date)))

-- | user | post | date |

SELECT user_id, post,
       MAX(time) AS last_post
FROM posts
GROUP BY user_id, post
ORDER BY last_post DESC
"""
df_query = pd.read_sql_query(sql_query,con)    
df_query



Unnamed: 0,user_id,post,last_post
0,1,D,2020-02-18
1,3,B,2020-02-17
2,2,A,2020-02-16
3,3,C,2020-02-15
4,1,B,2020-02-15
5,2,B,2020-02-14
6,1,A,2020-02-14


In [19]:
# B) last post each user has seen, in chronological order

sql_query = """
WITH posts (user_id, post, time)
AS (VALUES
(1, 'A', CAST('2-14-20' AS date)),
(2, 'B', CAST('2-14-20' AS date)),
(3, 'C', CAST('2-15-20' AS date)),
(1, 'B', CAST('2-15-20' AS date)),
(2, 'A', CAST('2-16-20' AS date)),
(3, 'B', CAST('2-17-20' AS date)),
(1, 'D', CAST('2-18-20' AS date)))

-- | user | post | date |

SELECT *
FROM
    (SELECT *,
           RANK() OVER(PARTITION BY user_id ORDER BY time DESC ) AS rank
    FROM posts) t
WHERE t.rank=1

"""
df_query = pd.read_sql_query(sql_query,con)    
df_query



Unnamed: 0,user_id,post,time,rank
0,1,D,2020-02-18,1
1,2,A,2020-02-16,1
2,3,B,2020-02-17,1


In [28]:
# Testing Mike's
sql_query = """
WITH posts (user_id, post, time) 
AS (VALUES(1, 'A', CAST('2-14-20' AS date)),
          (2, 'B', CAST('2-14-20' AS date)),
          (3, 'C', CAST('2-15-20' AS date)),
          (1, 'B', CAST('2-15-20' AS date)),
          (2, 'A', CAST('2-16-20' AS date)),
          (3, 'B', CAST('2-17-20' AS date)),
          (1, 'D', CAST('2-18-20' AS date))),

t1 AS (
SELECT *, rank() OVER (PARTITION by user_id ORDER BY TIME desc) AS time_rank
FROM posts)

SELECT user_id, post, time_rank
FROM t1
WHERE time_rank = 1
ORDER BY 3


"""
df_query = pd.read_sql_query(sql_query,con)    
df_query


Unnamed: 0,user_id,post,time_rank
0,1,D,1
1,2,A,1
2,3,B,1


In [21]:
# B) last post each user has seen, in chronological order

# Kongrath

sql_query = """
WITH posts (user_id, post, time)
AS (VALUES
(1, 'A', CAST('2-14-20' AS date)),
(2, 'B', CAST('2-14-20' AS date)),
(3, 'C', CAST('2-15-20' AS date)),
(1, 'B', CAST('2-15-20' AS date)),
(2, 'A', CAST('2-16-20' AS date)),
(3, 'B', CAST('2-17-20' AS date)),
(1, 'D', CAST('2-18-20' AS date)))

-- | user | post | date |

SELECT temp.User_id, Posts.Post
FROM 
	(SELECT User_id, MAX(Time) AS maxT
	FROM Posts
	GROUP BY User_id) AS temp
	
	JOIN Posts
    ON temp.User_id = Posts.User_id AND temp.maxT = Posts.Time
ORDER BY temp.maxT


"""
df_query = pd.read_sql_query(sql_query,con)    
df_query



Unnamed: 0,user_id,post
0,2,A
1,3,B
2,1,D


# --