<img src="https://github.com/christopherhuntley/BUAN6510/blob/master/img/Dolan.png?raw=true" width="180px" align="right">

# **DATA 6510**
# **Homework 3: A Decade of Laundry Tags** 
_Building a Data Warehouse from Source Data._

This assignment is part tutorial and part exercises. The data provided by the real company behind the DeluxCare example in Lesson 1. The company "dry cleans" (launders) expensive shirts, dresses, jackets, suits, etc. It has been around for decades and has about a dozen locations in southeastern Connecticut.

In part 1 you will "ride along" as we will build the data warehouse from scratch using files provided by the company. For those of you that wanted to see how SQL DDL and DML works, this is your chance. &#128578; For everybody else, **do not skip this part;** it illustrates a few things that will almost certainly appear on Quiz 3 or Quiz 4.

In part 2 you will explore the business (and perhaps a bit about upper-class Connecticut residents) by crafting a few analytical queries. 

## **Part 1. Build a data warehouse from scratch.**

### Domain: Anatomy of a Sale

Before any further it may be helpful to understand the process of executing one sale:

1. The customer brings in a pile of garments in need of cleaning. This marks the start of a sale. If the customer is new, then the employee collects information about th customer. 
2. The employee sorts the garments by type (shirts, pants, dresses, suits, etc.) and scans each garment's "garment tag" (if there is one) to determine the brand and any special handling instructions. Items with the garment tag are grouped together. 
3. Each group of items from step 2 is given a service code and prepped for cleaning. 
4. The entire order (invoice) is priced according to service types, numbers of items, and special handling instructions. Each group (garment tag) appears as one line item on the invoice. 
5. The garments are cleaned as requested and prepped for pickup. 
6. The customer returns to pick up their clothes and pays their bill. The employee closes out the invoice, marking it as paid. 

Everything else about our data warehouse derive from this basic process description. It forms the _business language_ used to describe our data.  

### Source Data 

The data comes to us from two sources:
- [garment_tags.csv](./data/DeluxCare/HW9_garment_tags.csv), a moderately large CSV file with data from ~90000 scanned garment tags; each tag represents a bundle of items of the same customer, sale, type, brand, etc. as they appear in the master sales records kept by company management. 
- [service_types.csv](./data/DeluxCare/HW9_ServiceTypes.csv), a decription of the service codes used to describe how the garments were to be handled; the codes are used by the point of sale system (cash register) to price each line item on the invoice. 


### Data Warehouse Design

The ERD below lays out our or data warehouse as a classic "star schema" design, with 
- `GarmentFacts` (fact table) in the middle 
- `Customer`, `Brand`, `Sale`, and `ServiceType` dimensions on the periphery

![HW9 ERD](./img/HW9_GarmentTags_DW.png)

Remarks: 
- `ServiceType` is a _conforming_ dimension that is _applied_ by the point of sale system. The employee selects a service code (type) from a fixed list at the time of sale. In other words, the data collection is forced to "conform" to this pre-defined list of service codes. 
- `Brand` and `Customer` are said to be _slow moving_ dimensions. They are less static than the `ServiceType` but are updated only when needed (e.g., for a new customer or brand). 
- `Sale` is a _fast moving_ dimension, updated with each sale. It will almost as many rows as the `GarmentFacts` table. 
- The _granularity_ (see Lesson 7) of the `GarmentFacts` table is determined by the four dimensions. In other words, no matter how many rows our `GarmentFacts` table has, the maximum number of groups we can create with a `GROUP BY` is  
**(Number of Customers) X (Number of Brands) x (Number of Sales) x (Number of Service Types)**. 

### Initializing the Database

In [None]:
# Load %%sql magic
!pip install jupysql
%load_ext sql
%config SqlMagic.displaylimit = None

# Standard Imports
import sqlite3
import pandas as pd

# SQLite database connection
%sql sqlite:///hw3.db

### Extracting and Loading the Data from the Source Files into the Database

In [None]:
# extract from files
garments_df = pd.read_csv("./data/DeluxCare/HW9_garment_tags.csv",dtype={'Cust_Zip':'str'})
service_types_df = pd.read_csv("./data/DeluxCare/HW9_ServiceTypes.csv")


# connect to database
conn = sqlite3.connect('hw3.db')

garments_df.to_sql("garment_tags_import", conn, if_exists='replace')
service_types_df.to_sql("service_types_import", conn, if_exists='replace')

garments_df.dtypes


In [None]:
%%sql 
SELECT * FROM garment_tags_import LIMIT 10;

In [None]:
%%sql 
SELECT * FROM service_types_import LIMIT 10;

### Creating New DW Tables with SQL DDL

SQL DDL is fairly easy to read, even without any formal training. A few things to look for: 
- Each table include either `facts` or `dim` in the name to indicate its purpose.
- The `count` column was renamed to `item_count` to avoid a naming conflict with the `COUNT()` function.
- In SQLite, an "autoincrement" surrogate key (called `rowid`) is silently generated for each table _unless_ `WITHOUT ROWID` is specified.
- For data quality auditing (traceability), original source keys are kept whenever possible. 
- SQLite only supports five data types: INTEGER, TEXT, REAL, BLOB, NUMERIC.
- Dates are stored as TEXT because SQLite does not have a Date data type. 
- Foreign keys are defined as both columns (`cust_id`) and constraints (`FOREIGN KEY cust_id REFERENCES ...`)
- Since SQLite does not support `ALTER TABLE` statements, `DROP TABLE IF EXISTS` is used instead to re-create each table from scratch each time the script is run. This would not work if we were to import data incrementally over time, but since we are doing this just once, it's okay. 
- A benefit of using `DROP TABLE IF EXISTS` is that we can run the notebook from top to bottom whenever we want to re-create the database. 

In [None]:
%%sql

DROP TABLE IF EXISTS garment_facts;
CREATE TABLE garment_facts (
    garm_id INTEGER PRIMARY KEY,   -- an alias for the `rowid` surrogate key automatically created by SQLite
    
    -- FK columns 
    cust_id INTEGER not null,      -- fk to customer_dim
    sale_id INTEGER not null,      -- fk to sale_dim
    brand_id INTEGER,              -- fk to brand_dim; null --> no label
    service_code INTEGER NOT NULL, -- fk to service_dim
    
    -- non-FK columns
    item_count INTEGER,            -- number of items in the group
    garm_id_src INTEGER,            -- garment id from original source data (used for auditing)
    
    -- FK constraints
    FOREIGN KEY (cust_id) REFERENCES customer_dim (cust_id),
    FOREIGN KEY (sale_id) REFERENCES sale_dim (sale_id),
    FOREIGN KEY (brand_id) REFERENCES brand_dim (brand_id),
    FOREIGN KEY (service_code) REFERENCES service_dim (service_code)
    
);

DROP TABLE IF EXISTS customer_dim;
CREATE TABLE customer_dim (
    cust_id INTEGER PRIMARY KEY, -- pk provided by src
    city TEXT,                   -- city name (may be null)
    zip TEXT                     -- zipcode / postal region code (may be null)
) WITHOUT ROWID;

DROP TABLE IF EXISTS brand_dim;
CREATE TABLE brand_dim (
    brand_id INTEGER PRIMARY KEY, -- surrogate pk
    label TEXT NOT NULL           -- maker / manufacturer / designer
);

DROP TABLE IF EXISTS sale_dim;
CREATE TABLE sale_dim (
    sale_id INTEGER PRIMARY KEY, -- surrogate pk
    date TEXT NOT NULL,          -- date of sale as YY-MM-DD text string
    amount REAL NOT NULL         -- sale amount in USD
) WITHOUT ROWID;

DROP TABLE IF EXISTS service_dim;
CREATE TABLE service_dim (
    service_code INTEGER PRIMARY KEY,   -- PK provided by conforming dimension source
    description TEXT NOT NULL           -- human readable TEXT (English)
) WITHOUT ROWID;

### Populating the New DW Tables via SQL Transformations

**Whenever possible, we try to load the dimension tables first, starting with any conformed dimensions before the slow moving dimensions. Why? Because they represent the strongest entities.**

In [None]:
%%sql
INSERT INTO service_dim (service_code, description) 
    SELECT `Service Code`, `Description`
    FROM service_types_import;
SELECT * FROM service_dim LIMIT 10;

In [None]:
%%sql 
INSERT INTO customer_dim (cust_id, city, zip)
    SELECT DISTINCT CUST_ID, CUST_CITY, CUST_ZIP
    FROM garment_tags_import;
SELECT * FROM customer_dim LIMIT 10;

In [None]:
%%sql
INSERT INTO sale_dim (sale_id, date,amount)
    SELECT DISTINCT SALE_ID, SALE_DATE, SALE_AMOUNT
    FROM garment_tags_import;
SELECT * FROM sale_dim LIMIT 10; 

In [None]:
%%sql
INSERT INTO brand_dim (label)
    SELECT DISTINCT Garm_Brand
    FROM garment_tags_import
    WHERE Garm_Brand IS NOT NULL;
SELECT * FROM brand_dim LIMIT 10;

**Once we have the dimensions worked out we can load the fact table. In the code below we had to use a `LEFT JOIN` to match up the brands with the labels as specified in the `brand_dim` table. Why a `LEFT JOIN`? Because not every garment had a label.** 

In [None]:
%%sql 
INSERT INTO garment_facts (cust_id, sale_id, brand_id, service_code, item_count, garm_id_src)
    SELECT Cust_ID as cust_id,
           Sale_ID as sale_id,
           brand_id,
           Garm_Type_ID as service_code,
           Garm_Count as item_count,
           Garm_ID as garm_id_src
    FROM garment_tags_import
            LEFT JOIN brand_dim ON (garment_tags_import.Garm_Brand = brand_dim.label);

### Validating that everything worked

In [None]:
%%sql
-- garment_facts should have the same number of rows as garment_tags_import
SELECT 
    (SELECT count(*) FROM garment_facts) as fact_count,
    (SELECT count(*) FROM garment_tags_import) as tag_count

## **Part 2. Analytical Queries**

**How many sales were made in each town?**

In [None]:
%%sql

**Which town had the highest total sales in 2012?** 

In [None]:
%%sql

**Which 10 brands were the most common?**

In [None]:
%%sql

**How many 'RALPH LAUREN' garments were cleaned in each year?** 

In [None]:
%%sql

**How popular was `RALPH LAUREN` in each town in 2012?**

In [None]:
%%sql