### Cumulative ER Arrivals

We have a table showing the number of patients who arrive at the ER every day. How can we calculate the cumulative number of patients for each day, using SQL?

To answer this, let's first create some sample data. You'll find that pandas, datetime, and random are useful libraries for populating tabular structures with dummy data. 

For this exaample, we'll create a sample table with the day of year in one column and the number of ER arrivals in another column. 

In [1]:
import random
import pandas as pd

In [2]:
#help(random)
#help(datetime)

We'll model the number of arrivals as a random integer chosen between 50 and 150. This probably isn't realistic, data will probably be more normally distrubuted and fluctuate in response to external events, some of which are predicatable. But this will generate the sample data we need to write and test a SQL query.

In [3]:
def num_arrivals():
    return random.randint(50, 150)

Next up, we'll create a pandas dataframe holding the date for the next 100 days and a randomly generated number of patients arriving at the ER that day. Note that pandas a nice method for generating a series of dates. We'll generate the dates and number of arrivals as two lists, and assemble them into a dataframe using pandas

In [4]:
er_dates = []

er_arrivals = []
for i in range(100):
    er_arrivals.append(num_arrivals())

# use pandas
er_dates = pd.date_range(start='1/1/2020', periods=100)

In [5]:
df = pd.DataFrame({'er_date':er_dates, 'er_arrivals':er_arrivals})

In [6]:
df.head(10)

Unnamed: 0,er_date,er_arrivals
0,2020-01-01,136
1,2020-01-02,106
2,2020-01-03,56
3,2020-01-04,85
4,2020-01-05,132
5,2020-01-06,92
6,2020-01-07,103
7,2020-01-08,94
8,2020-01-09,108
9,2020-01-10,145


Alright! We're ready to write some SQL. 

Rather than building a database, I'll use a module, pandasql, that will allow us to run SQL code directly agaainst a pandas dataframe as if it were a table in a relational database, using SQLite syntax. 

In [7]:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

Try it out! 

I recommend trying this out before jumping straight to the answer. There's value in sticking with it to the point of mild frustration :) Even if you end up looking at the answer, the time you spend puzzing over it will help you remember the solution and apply it to a related problem in the future.

Hint - One solution to this problem involves a self-join - Remember, you're not limited to joining table on an equality operator! Think about how you'd use an inequality along with a self-join...

Here's an example of using a self-join to get all the er_arrivals up to each date in the system. 

In [8]:
pysqldf("""
SELECT 
    date(a.er_date), date(b.er_date), b.er_arrivals
FROM 
    df a
JOIN 
    df b 
ON 
    a.er_date >= b.er_date
""").head(20)

Unnamed: 0,date(a.er_date),date(b.er_date),er_arrivals
0,2020-01-01,2020-01-01,136
1,2020-01-02,2020-01-01,136
2,2020-01-02,2020-01-02,106
3,2020-01-03,2020-01-01,136
4,2020-01-03,2020-01-02,106
5,2020-01-03,2020-01-03,56
6,2020-01-04,2020-01-01,136
7,2020-01-04,2020-01-02,106
8,2020-01-04,2020-01-03,56
9,2020-01-04,2020-01-04,85


#### Solution 1: SQL: Use an INNER JOIN
    
You can modify the query above, with an aggregation, to get the cumulative value for each date. 

In [9]:
sql = """
SELECT 
    date(a.er_date) as er_date, 
    SUM(b.er_arrivals) as er_cumulative 
FROM
    df a
JOIN 
    df b 
ON
    a.er_date >= b.er_date 
GROUP BY 
    a.er_date 
ORDER BY 
    a.er_date ASC
"""

In [10]:
pysqldf(sql)

Unnamed: 0,er_date,er_cumulative
0,2020-01-01,136
1,2020-01-02,242
2,2020-01-03,298
3,2020-01-04,383
4,2020-01-05,515
5,2020-01-06,607
6,2020-01-07,710
7,2020-01-08,804
8,2020-01-09,912
9,2020-01-10,1057


#### Solution 2: SQL: Use an OVER clause

In [11]:
over_sql = """
SELECT 
    date(er_date) as er_date, 
    SUM(er_arrivals) OVER (ORDER BY date(er_date)) as er_cumulative 
FROM
    df 
"""

In [12]:
pysqldf(sql)

Unnamed: 0,er_date,er_cumulative
0,2020-01-01,136
1,2020-01-02,242
2,2020-01-03,298
3,2020-01-04,383
4,2020-01-05,515
5,2020-01-06,607
6,2020-01-07,710
7,2020-01-08,804
8,2020-01-09,912
9,2020-01-10,1057


#### Solution 2: Use pandas

Pandas has a cumulative sum method. 

In [13]:
df['er_arrivals'].cumsum()

0       136
1       242
2       298
3       383
4       515
5       607
6       710
7       804
8       912
9      1057
10     1118
11     1194
12     1250
13     1386
14     1484
15     1570
16     1666
17     1757
18     1815
19     1948
20     2011
21     2073
22     2173
23     2308
24     2399
25     2540
26     2676
27     2736
28     2856
29     2949
      ...  
70     7251
71     7314
72     7428
73     7492
74     7615
75     7685
76     7810
77     7890
78     7980
79     8102
80     8177
81     8285
82     8336
83     8406
84     8510
85     8619
86     8692
87     8826
88     8924
89     9009
90     9122
91     9222
92     9288
93     9359
94     9470
95     9614
96     9692
97     9821
98     9944
99    10047
Name: er_arrivals, Length: 100, dtype: int64

### Exercise

In this workbook, we generated a sample table with the total number of ER arrivals for each date, and used this to calculate a cumulative sum for each date. 

As an follow up exercise, how would you model the arrival of each patient? Try logging the arrival date for each patient, and use this table to generate a cumulative sum of total patient arrivals in the ER for each day in your date range. 