In [1]:
import warnings
warnings.filterwarnings('ignore')

# Lab 4 - Data (ETL)

In [2]:
%matplotlib inline

## General Instructions

In this course, Labs are the chance to applying concepts and methods discussed in the module.
They are a low stakes (pass/fail) opportunity for you to try your hand at *doing*.
Please make sure you follow the general Lab instructions, described in the Syllabus.
The summary is:

* Discussions should start as students work through the material, first Wednesday at the start of the new Module week. 
* Labs are due by Sunday. 
* Lab solutions are released Monday.  
* Post Self Evaluation and Lab to Lab Group on Blackboard and Lab to Module on Blackboard on Monday.

The last part is important because the Problem Sets will require you to perform the same or similar tasks without guidance.
Problem Sets are your opportunity to demonstrate that you understand how to apply the concepts and methods discussed in the relevant Modules and Labs.

## Specific Instructions

1.  For Blackboard submissions, if there are no accompanying files, you should submit *only* your notebook and it should be named using *only* your JHED id: fsmith79.ipynb for example if your JHED id were "fsmith79". If the assignment requires additional files, you should name the *folder/directory* your JHED id and put all items in that folder/directory, ZIP it up (only ZIP...no other compression), and submit it to Blackboard.

    * do **not** use absolute paths in your notebooks. All resources should located in the same directory as the rest of your assignments.
    * the directory **must** be named your JHED id and **only** your JHED id.
    * do **not** return files provided by us (data files, .py files)

2. Data Science is as much about what you write (communicating) as the code you execute (researching). In many places, you will be required to execute code and discuss both the purpose and the result. Additionally, Data Science is about reproducibility and transparency. This includes good communication with your team and possibly with yourself. Therefore, you must show **all** work.

3. Avail yourself of the Markdown/Codecell nature of the notebook. If you don't know about Markdown, look it up. Your notebooks should not look like ransom notes. Don't make everything bold. Clearly indicate what question you are answering.

4. Submit a cleanly executed notebook. The first code cell should say `In [1]` and each successive code cell should increase by 1 throughout the notebook.

**Note** This assignment will have multiple files. Follow those instructions.

## Lab

**Reid's** is a small breakfast stand that sells drinks (coffee, tea, sodas) and food (egg & sausage, oatmeal) in a commercial downtown area, Monday through Friday, from 8a until 11am.
Although their menu is small, they do try to cater to a wide variety of diets and thus provide both vegan and keto options for most of their meals.
They started using Ordr as their Point of Sale (POS) system about two months ago and are on the Basic Plan.

Under the Basic Plan, they are able to use the Ordr API to access orders.
This order information comes in the form of a denormalized JSON document.
In order to make any sense of things, you need to normalize it in the Datawarehouse.

1. You are not actually going to access an external API. Use the provided JSON file as the data that the API would return.
2. **You are doing "ETL in the Large" in this assignment.** You are going to build a datawarehouse in SQLite, *not* an application database. This difference is substantial. Refer to the draft chapter of Fundamentals for some of the differences.

**Note** We sometimes get strange questions about the use of SQLite like, "do you really use SQLite in production?". We use SQLite jor this Lab for the following reasons:

1. SQLite is a real RDBMS.
2. SQLite uses real SQL. SQL may be the most important skill you can have as a Data Scientist doing Data Science.
3. Most importantly, the database itself is a standalone file that you can submit to us.

That being said, under some and somewhat weird circumstances, I have used SQLite on real projects before. However, the learning objective is not SQLite, SQLite is  tool.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Note</strong>
    <p>You may need to install <tt>sqlite</tt>. It is normally on MacOs and may be on Linux already.</p>
</div>

**Note**
We assume you know the basics of RDMBS and SQL DDL in this course (It is in the course prerequisites!).
That you understand what "normalized" and "denormalized" data means and that you know about primary and foreign keys.
This [article](https://www3.ntu.edu.sg/home/ehchua/programming/sql/Relational_Database_Design.html) does talk about the major points.
Additionally, we assume you know SQL and DDL.
If you do not, this Lab will be more challenging than usual and you should start early.

**Important - You must not use Pandas for any part of this assignment.**
Why not?
Because you should know how to do these things without relying on Pandas.

## Part 1

### Learning Objectives

* investigate the structure of data acquired from a 3rd party.
* convert denormalized data into normalized data, according to common data warehouse practices.
* design a data warehouse to store production data acquired from a 3rd party.
* write data to a data warehouse.

This assignment is not about tools *per se* but about broader skills and concepts.

You will be creating the following files:

1. **reids.sql** - this file will create the database structure using DDL. Make sure you review the data and sketch out your design.
2. **reids.db** - this is the actual database, our data warehouse.

```
> sqlite3 reids.db < reids.sql
```

will create the database and all the tables.
The database will be empty at this point.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Note</strong>
    <p>"<tt>></tt>" represents the command line. Your sqlite executable may have a different name.</p>
</div>

3. **reids.py** - this program will parse the JSON file and fill the database.

```
> python reids.py
```

Unfortunately, the documentations for the Ordr API is sparse, here is an example of one order:

```
{'items': [
    {'name': 'coffee', 'price': 2.75},
    {'name': 'flavor shot', 'price': 1.0}
    ],
'charges': {
    'date': '01/04/21 10:22',
    'subtotal': 3.75,
    'taxes': 0.26,
    'total': 4.01},
'payment': {
    'card_type': 'visa',
    'last_4_card_number': '0465',
    'zip': '21217',
    'cardholder': 'Christina Sampson',
    'method': 'credit_card'}}
```

**The date format is Month/Day/Year**

Make sure you look through the data to see what values are possible for each of the fields.
The standards of normalization/denormalization for datawarehouses are slightly different or can be different than regular production RDMBS systems.
For example, we might be tempted to create a `menu_items` table:

```
id    name                 price
1     coffee               2.75
2     flavor shot          1.00
3     egg salad sandwich   4.50
```

An issue arises if we change the name to "Kona Coffee" because it will change it *for past purchases*.
That is, customers in the past bought based on the name "Coffee" and not "Kona Coffee".
This might be important.
Even worse, if we change the price to \\$3.00, it changes it for all past purchases and that is clearly wrong.

In a production RDMBS we often want the data to change everywhere it is used.
If "Steve" changes his name to "Sam", we want that to be reflected in any query and report.
For datawarehousing, though, we want to preserve the historical fidelity of the data.
This means we have a tendency to normalize *less* than we would otherwise do.
It's worth noting that there is a trend to preserve the historical fidelity in production databases as well by things like soft deletes.

This means the main issue for the Ordr data is storing the three main entities and creating primary/secondary keys.
You will need to create these.

All of this "parsing and massaging" work will be done in the `reids.py` file.
It will contain the code to parse the JSON file and fill the database, performing whatever normalization and standardization is required as well as creating whatever primary and foreign key relationships seem reasonable.

You must create the following tables in the database:

1. `items`
2. `charges`
3. `payments`

but you can add additional tables as necessary (it is not uncommmon to include tables in datawarehouses that support analytics such as information about business dates).

**Note** Feel free to develop reids.py as a Notebook and then generate the .py from the .ipynb file...just make sure you only include the .py file and that it will run from the command line as specified above and you have commented out any debug/chatter.

**Important**
There are some "gotchas".
1. When inserting data into the database, don't forget to the commit.
2. If you must reconstruct your database, make sure you "free" all references to it. If you use a script to change it but it's open your notebook, the open version in the notebook won't necessarily see those changes.  You'll need to get a new connection.

When you are done with this part, you should be able to proceed to Part 2.

**Everything having to do with parsing the JSON file from Ordr and setting up the "data warehouse" should be done in the three files described above and not in this Notebook.**

## Part 2

### Learning Objectives

You almost never start out with a Notebook and start pulling data. 
The idea that you launch Jupyter Notebook and load a readily available CSV is an incredibly artificial artifact of school (if you had to pull data The Real Way(tm) for every assignment, we'd never get anything done).

Instead, you are more likely to start out with a database and you run queries directly against the database, finding out where and what everything is, answering some initial questions.

* Run queries against an RDBMS to answer basic business questions.

Some data science projects are literally just this: someone asks a question, you investigate the data, you run a query using something like [MySQL Workbench](https://www.mysql.com/products/workbench/), [Toad](https://www.toadworld.com/products/toad-for-sql-server) (Windows Only) or [Postico](https://eggerapps.at/postico/) (MacOS Only). There are also generic SQL clients. For example, [VSCode](https://code.visualstudio.com/) has SQL extensions.

You will mimic that experience here by using only the [sqlite3](https://docs.python.org/3/library/sqlite3.html) Python library (included in the base installiation, link is to documentation).
As with Part 1, you may *not* use Pandas for this part.
Additionally, you *must* not print out native Python data structures.
[Tabulate](https://pypi.org/project/tabulate/) has been provided in the environment.yml for your use.


For Part 2, everything should be done here, in this notebook.

**Note**
The general format is discuss/code/discuss.
For the questions below, you should be able to:

1. explain what the query does (discuss)
2. execute and display the query result (code)
3. interpret the result (discuss)

All three are required for full credit on something like Problem Set so you should practice the triad here. It is permissible to use a query to get raw data (and show it in a table) and then perform a calculation with that raw data (just add a code cell). However, you should do as much as possible in SQL.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Note</strong>
    <p>
There is a significant work/payoff imbalance here and this reflects real life.
Setting up the data warehouse is 80% of the effort but only 20% of the credit.
Your boss just doesn't care about your struggles with the data, they only care about answering the queries.
As a result, the queries (Part 2) may be 20% of the effort but they're 80% of the grade.
They're proof that you did Part 1 correctly.
If you don't get to the queries, if you don't do them right, there's no proof.
    </p>
</div>

Using the database `reids.db` and SQL please answer the following questions:

In [3]:
from tabulate import tabulate
import sqlite3

In [6]:
con = sqlite3.connect('reids.db')
crsr = con.cursor()

### Question 1.

What were Reid's order count and gross revenue by day for the two month period?







For order count, we can simply count up the amount of charges. For gross revenue, we can sum the totals from each charge.

In [21]:
orders = crsr.execute('''SELECT count(*) FROM charges''')
orders = crsr.fetchall()
data = [[orders[0][0]]]
print(tabulate(data, headers=['Orders']))

  Orders
--------
    2071


So there were 2071 orders for this period.

Note I split up the day and time as separate columns in the database, which allows us to group by day for this query.

In [22]:
revenue = crsr.execute('''SELECT date, SUM(total) FROM charges GROUP BY date''')
revenue = crsr.fetchall()
data = []
for day in revenue:
    data.append(day)

print(tabulate(data, headers=['Date', 'Gross Revenue']))

Date          Gross Revenue
----------  ---------------
2021/04/01           188.87
2021/04/02           265.11
2021/04/05           339.77
2021/04/06           276.09
2021/04/07           188.63
2021/04/08           345.92
2021/04/09           341.58
2021/04/12           294.54
2021/04/13           167.5
2021/04/14           288.41
2021/04/15           262.69
2021/04/16           402.63
2021/04/19           276.1
2021/04/20           265.65
2021/04/21           207.62
2021/04/22           316.76
2021/04/23           408.49
2021/04/26           256.32
2021/04/27           240.78
2021/04/28           267.53
2021/04/29           259.79
2021/04/30           284.13
2021/05/03           285.47
2021/05/04           263.49
2021/05/05           188.86
2021/05/06           359.81
2021/05/07           355.28
2021/05/10           301.24
2021/05/11           171.73
2021/05/12           202.26
2021/05/13           315.68
2021/05/14           263.23
2021/05/17           170.68
2021/05/18           2

Here we can see the gross revenue summed by date.

### Question 2.

What is Reid's average order count and gross revenue by day of the week?







In [49]:
# doesn't work
revenue = crsr.execute('''SELECT date, count(*) FROM charges
                        WHERE DATEPART(WEEKDAY, GETDATE()) = 2 ''')
revenue = crsr.fetchall()
print(revenue)
data = []
for day in revenue:
    data.append(day)



OperationalError: no such column: WEEKDAY

I'm not really sure how to approach this problem without going back into the database and adding the days to the tables, but all the documentation I could find online is how to select the weekday with manually entering the date.





### Question 3.

How many cups of coffee does Reid's sell per day, on average?







In [51]:
# foreign keys?
coffee = crsr.execute('''SELECT count(*) FROM items WHERE name = 'coffee' ''')
coffee = crsr.fetchall()
print(coffee)

[(1,)]


This is found by linking the items table with the charges or another ordering table. I tried my best to create the tables, and they have unique ids, but I couldn't figure out how to properly set up the foreign keys to link tables to each other. This is my first time using SQL.

### Question 4.

What proportion of orders contain "up charges" like flavor shots, vegan or keto substitutions?







In [32]:
# couldn't figure out how to create order table properly -looking foward to the solution
upcharge = crsr.execute('''SELECT count(*) FROM items WHERE name = 'flavor shot'
                        AND name = 'vegan' AND name='keto' ''')
upcharge = crsr.fetchall()
print(upcharge)

[(0,)]


I couldn't get the ORDER table working properly with setting up my foreign keys. I understand the principles - we want to have a table that links each order with a unique charge, and ids of items from the items table and ids from the payments table. However I could not figure out how to do this as this is my first time working with SQL. Looking forward to reviewing the solution this week.

### Question 5.

Reid's considers someone to be a "regular" if they come at least 3 out of 5 days per week. How many regulars do you estimate there are and what are their names? How many days per week do they each come on average? What are the limits of this calculation based on the available data?







In [36]:
customer = crsr.execute('''SELECT * FROM payments GROUP BY cardholder ''')
customer = crsr.fetchall()
data = []
for payment in customer:
    data.append(payment)

print(tabulate(data, headers=['cust_id', 'method', 'card', 'last_4', 'zip', 'name']))

  cust_id  method       card          last_4    zip  name
---------  -----------  ----------  --------  -----  ----------------------
        3  cash
       94  credit_card  mastercard      2460  21217  Aaron Bridges
      276  credit_card  visa            3660  21201  Aaron Compton
      553  credit_card  mastercard      1369  21226  Aaron Crane
      347  credit_card  mastercard      1505  21270  Aaron Ellis
      337  credit_card  mastercard      1511  21281  Abigail Carter
      118  credit_card  mastercard      4229  21278  Abigail Hall
      747  credit_card  visa            2280  21205  Adam Brooks
      115  credit_card  mastercard      2581  21274  Adam Thompson
       86  credit_card  visa            5136  21230  Adrian Herman
      270  credit_card  visa            6840  21205  Adrian James
      567  credit_card  visa             946  21231  Adriana Young
       66  credit_card  visa            9698  21223  Alexander Atkinson
      358  credit_card  mastercard      2556  21

Again I couldn't quite figure this out because it relies on foreign keys in order to link the charges with the cardholders/payments ids. But essentially what we would do is select the order based on cardholder name and figure out which days they come in per week.

I'm going to review the lab solution once we get it on how to set up foreign keys, as it seems like that is most important for these questions. I know most of my tables should be set up correctly except for the foreign keys. Thank you.