# DEPA Final Project (Fall 2019)

## Benedict Au | Nov 2, 2019

## 0. Introduction

### 0.1. Executive summary

Alcohol is one of the most popular purchases in the US, but does consumers’ love of beer vary with their economic condition? In this project, we seek to derive insights on alcohol consumption patterns with respect to changes in US economic metrics. 

Questions that this project can potentially answer:

- Do preferences for beer types change during recession? Do consumers favor particular beers, price points, or alcohol content?
- How does unemployment affect beer purchasing? Does it significantly impact the total amount purchased, or just shifts the product mix?

### 0.2. Business use case

This project can provide insight into consumer purchasing habits, which is highly desirable to a variety of businesses. Breweries can use this data to plan their production, such that they focus their production on the beer with the highest expected demand. Retailers and restaurants can use this data to plan their inventory, stocking particular products ahead of expected increases in demand.

### 0.3. Data sources

- The IRI Academic Marketing Data Set (Bronnenberg, et al, 2012) - 130 GB unzipped - NDA required, access through The University of Chicago Office of Research and National Laboratories Research Computing Center 
- St. Louis Fed Federal Research Economic Data (FRED) - through FRED API

### 0.4 Prerequisites

Section 4 requires an empty schema `beer` in MySQL 8. The code is provided in `section 4.0`. 

The following packages are also required and can be installed using `pip` or `conda`:  
`os`, `glob`, `NumPy`, `pandas-0.25`, `functools`, `sqlalchemy`, `tqdm`, and `fredapi`.

**IMPORTANT**: pandas version `0.24.+` is required as pandas has gained the ability to hold integer dtypes with missing values.

### 0.5. Sections in this notebook

The following sections in this notebook progress as follows: 

Section 1 explains the procedure to access the IRI dataset on UChicago Research Computing Center and documents the steps taken to extract the necessary files and directories pertinent to this project, given the limitations of the memory size of personal laptop computers. 

Section 2 provides an overview of the IRI dataset and its various dimensions, their limitations. 

Section 3 describes the fact-dimension schema in MySQL. 

Section 4 contains the code for data intake and manipulation. It also pushes pandas dataframes into an empty MySQL `beer` schema. 

---

## 1. IRI Data extraction:

1. Connect to RCC /project2: <https://rcc.uchicago.edu/docs/data-transfer/index.html> smb://midwaysmb.rcc.uchicago.edu/project2 Username: ADLOCAL\CNetID Password: CNet password Hostname: midwaysmb.rcc.uchicago.edu

2. Navigate to `/projects/databases/IRIData/`

3. Unzip `zYearXX` files and extract beer directory

4. Relocate contents of `BEER` directory into `YearXX` directory

5. Rename `zparsed stub files.zip` as `parsed stub files.zip`

6. Collect all BEER product attribute files `parsed stub files` into the directory "beer_attributes".

  - Renamed `prod_beer.xlsx` and `prod_beer_sz.xlsx` from the `parsed stub files` directory as `prod01_beer.xlsx` and `prod01_beer_sz.xlsx`.
  - Renamed prod_beer.xlsx from the `parsed stub files 2007` directory as `prod07_beer.xlsx`.

--------------------------------------------------------------------------------

## 2. Overview of the `IRI dataset`:

Dataset size: 8 GB unzipped

Dataset range: January 1, 2001 (week 1114) to December 30, 2012 (1739).

Data sets are separated by year. In each year, there are the following files:

- `ADB Measure Definitions.doc` defines store measures
- `Delivery_Stores` defines stores included in the year's files
- `demos.csv` identifies the demographics of the panelists
- `IRI week translation.xls` defines the conversion of week numbers to dates.
- `panel_measure_definition.doc` defines panel measures. This file is slightly different for years 2008-2011.
- `Category_outlet_startweek_endweek` with no file extension contains store-week level data
- `Category_PANEL_outlet_startweek_endweek.dat` DAT file contains panel data at transaction level.

**TO DO** For years 1, 2, 6, 7, and 12, the `DEMOS.csv` files are located in the directory `demo trips external`. Move these into each YearXX directory.

Note that for 2001-2007, the outlet categories are:
- {DR: drug, GR: groceries, MA: mass}. 

For 2008-2011, the outlet categories are:
- {DK: drug, GK: groceries, MK: mass}.

### 2.1. Week numbers - `IRI WEEK Translation` file description:

- End date = (weekNumber - 400) * 7 + 31900
- Start date = (weekNumber - 400) * 7 + 31900 - 6

**Suggestion:** week conversions can be calculated instead of creating a data table to house this.

### 2.2. Sales data - `Category_outlet_startweek_endweek` file description:

The store data files are the largest files.

Both the store data and panel data files are keyed to the dimensional information (store, week, UPC fields, [panelist]).

Records within a file represent a transaction by store-week-upc (universal product code).

**Naming convention:** The naming convention for these is category name then outlet then start week and then end week, all separated by underscores, with no extension, so salted snacks drug data for the earliest year would be `saltsnck_drug_1114_1165`.

**Columns of interest:**

- IRI_Key FK: Delivery_Stores
- WEEK FK: IRI Week Translation.xls
- SY FK: UPC system code
- GW FK: UPC generation code
- VEND FK: UPC vendor code
- ITEM FK: UPC item code
- UNITS Units sold
- DOLLARS Amount (note below)

The dollars column reflects the retail price paid, on average, after retail features, displays and retail coupons. It does not include manufacturer coupons or any discount that might be applied by the retailer that is not applicable to the item. For example, if a retailer gave USD5 off if you purchased more than USD200, that discount is not applied. Sales taxes are not included.

The F column denotes whether there was a marketing feature within the store, such a small or large-sized ad. The D column denotes whether there was a marketing display of the product within the store.

### 2.3. `Delivery_Stores` file description:

The file contains each store "masked" using the sequence key as it's identifier across the various tables. This file also contains outlet, estimated acv, the market name so data can be aggregated by market, an open and close week, and finally a "chain" number representing a particular retailer. All the stores belonging to Chain8 are part of the same retailer that year.

**Columns of interest:**

- IRI_KEY: FK: masked store ID, **maybe different from year to year**. Cross-reference in Appendix 2 of the data dictionary. <- ignore this for now
- OU: drug/groceries/mass market -> into its own NF table EST_ACV: estimate of annualized sale in MILLIONS for the store across ALL categories
- Market_Name: 50 markets total -> into its own NF table

### 2.4. Product attributes (in directory "beer_attributes"):

`prod01_beer.xlsx` and prod01_beer_sz.xlsx for 2001-2006.<br>
`prod07_beer.xlsx` for 2007.<br>
`prod11_beer.xlsx` for 2008-2011.<br>
`prod12_beer` for 2012.

`prod01_beer_sz.xlsx` describes additional size attribute information. No size information was provided for 2007 onward.

**Columns of interest:**

- L2 Small category (domestic or import) -> into its own NF table
- L4 Vendor -> into its own NF table
- L5 Brand -> into its own NF table
- SY UPC system code
- GW UPC generation code
- VEND UPC vendor code
- ITEM UPC item code -> for each UPC item, generate a surrogate key
- VOL_EQ Volume equivalent := ounces / 192\. Denotes total beer per unit sold (e.g. total volume in bottle/can/4-pack/6-pack/case/keg) **TO DO:** figure out a way to determine volume of each individual package
- TYPE OF BEER/ALE **Admit PRODUCT TYPE if MISSING** -> into its own NF table
- PACKAGE Packaging (can/glass, single, box, carton, keg, etc...) -> into its own NF table
- FLAVOR/SCENT FLAVOR[FLAVOR = MISSING] <- NULL, -> into its own NF table

Note: Columns CALORIE LEVEL and COLOR have too many missing values to be useful for analysis.

### ~~2.5. Category_PANEL_outlet_startweek_endweek.dat~~:

Panel data is provided for two BehaviorScan markets, Eau Claire, Wisconsin and Pittsfield, Massachusetts.

Outside the scope of this project.

### ~~2.6. Panel trips~~:

These files represent the trips made by panelists who purchased at least one item.

Outside the scope of this project.

--------------------------------------------------------------------------------

## 3. MySQL DDL

### List of tables/entities:

1. Sales table

  - sale_id (PK)
  - IRI_KEY (FK)
  - WEEK (FK)
  - UPC_id (FupcK)
  - UNITS
  - DOLLARS

  For YEAR in range(1,13):  
      Combine drug and groceries and years


2. IRI_KEY (store) table

  - IRI_KEY (PK)
  - OUTLET_CAT_ID (FK)
  - Market_ID (FK)

  For YEAR in range(1,13):  
      Turn column 2,3 in /Year(YEAR)/Delivery_Stores into keys


3. UPC (product) table

  - UPC_id (PK)
  - Atomized UPC codes (SY, GW, VEND, ITEM) (FK)
  - Domestic/import: bernoulli
  - Vendor_id (FK) <- from L4 column, proxy for "brand"
  - VOL_EQ
  - Beer_type_id (FK)
  - Packaging_id (FK)
  - Flavor_id (FK)

  For YEAR in (01,07,11,12):  
      Create surrogate key `UPC_id` in /beer_attributes/Prod(YEAR)_beer
      Create new columns from foreign key dictionaries
      Turn Prod(YEAR)_beer[["L2"]] into 1(domestic); else 0


4. Week table (optional)

  - weed_id (PK)
  - Start_date

  Create a week table in Excel


5. Outlet_cat table

  - OUTLET_CAT_ID (PK)
  - Outlet category

  For YEAR in range(1,13):  
      Create dict of unique values in /Year(YEAR)/Delivery_Stores[["OU"]]


6. Market table

  - Market_ID (PK)
  - Market name

  For YEAR in range(1,13):  
      Create dict of unique values in /Year(YEAR)/Delivery_Stores[["Market_Name"]]


7. Vendor table

  - Vendor_id (PK)
  - Vendor name

  For YEAR in (01,07,11,12):  
      Create dict of unique values in /beer_attributes/Prod(YEAR)_beer[["L4"]]


8. Beer_type table

  - Beer_type_id (PK)
  - Beer_type name

  For YEAR in (01,07,11,12):  
      Create dict of unique values in /beer_attributes/Prod(YEAR)_beer [["TYPE OF BEER/ALE"]]

9. Packaging table

  - Packaging_id (PK)
  - Packaging name

  For YEAR in (01,07,11,12):  
      Create dict of unique values in /beer_attributes/Prod(YEAR)_beer[["PACKAGE"]]


10. Flavor table

  - Flavor_id (PK)
  - Flavor name

  For YEAR in (01,07,11,12):  
      Create dict of unique values in /beer_attributes/Prod(YEAR)_beer[["FLAVOR/SCENT"]]


## 4. Data prep and populating MySQL database `beer`

In [1]:
import os
import glob
import numpy as np
import pandas as pd
from functools import reduce

from sqlalchemy import create_engine
from tqdm import tqdm
import time

from fredapi import Fred

# MySQL server credentials
engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}".format(user="root", pw="rootroot", db="beer"))

### 4.0 [IMPORTANT] Create schema in MySQL server

In MySQL Workbench, first create the empty schema `beer`. We will populate the empty schema with data, then go back and modify the properties of each table. 

**SQL CODE CHUNK**

SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;  
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;  
SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='ALLOW_INVALID_DATES';  
SET SQL_SAFE_UPDATES=0;  

DROP SCHEMA IF EXISTS `beer_exp`;  
CREATE SCHEMA `beer_exp` DEFAULT CHARACTER SET utf8;  

### 4.1 Create unique UPC, flavor, packaging, beer_type, brand tables

In [2]:
# Product tables

prod_all_beer_df = pd.concat([pd.read_excel(f) for f in glob.glob("./IRI BEER DATASET/beer_attributes/prod*_beer.xls*")], ignore_index = True, sort=False)
glob.glob("./IRI BEER DATASET/beer_attributes/prod*_beer.xls*")
prod_all_beer_df.shape

(56938, 23)

In [3]:
prod_all_beer_df.head()

Unnamed: 0,L1,L2,L3,L4,L5,L9,Level,UPC,SY,GE,...,VOL_EQ,PRODUCT TYPE,TYPE OF BEER/ALE,PACKAGE,FLAVOR/SCENT,SIZE,CALORIE LEVEL,COLOR,*AG C=1+ CATEGORY 00004,*STUBSPEC 1416RC 00004
0,CATEGORY - BEER/ALE/ALCOHOLIC CID,DOMESTIC BEER/ALE (INC NON-ALCOH,ABC WINE & SPIRITS,ABC WINE & SPIRITS,ABC ALE,+ABCAL ALE BEER CAN 12OZ,9,00-01-85674-60002,0,1,...,0.0417,BEER,ALE,CAN,MISSING,MISSING,MISSING,MISSING,,
1,CATEGORY - BEER/ALE/ALCOHOLIC CID,DOMESTIC BEER/ALE (INC NON-ALCOH,ABC WINE & SPIRITS,ABC WINE & SPIRITS,ABC ALE,+ABCAL ALE BEER CAN 72OZ,9,00-01-85674-60001,0,1,...,0.25,BEER,ALE,CAN,MISSING,MISSING,MISSING,MISSING,,
2,CATEGORY - BEER/ALE/ALCOHOLIC CID,DOMESTIC BEER/ALE (INC NON-ALCOH,ABITA BREWING CO INC,ABITA BREWING CO INC,ABITA AMBER,+ABTAM LAGER BEER GB 12OZ,9,27-01-15502-01124,27,1,...,0.0417,BEER,LAGER,GLASS BOTTLE,MISSING,MISSING,MISSING,AMBER,,
3,CATEGORY - BEER/ALE/ALCOHOLIC CID,DOMESTIC BEER/ALE (INC NON-ALCOH,ABITA BREWING CO INC,ABITA BREWING CO INC,ABITA AMBER,+ABTAM LAGER BEER GB 12OZ,9,00-01-80020-00001,0,1,...,0.0417,BEER,LAGER,GLASS BOTTLE,MISSING,MISSING,MISSING,AMBER,,
4,CATEGORY - BEER/ALE/ALCOHOLIC CID,DOMESTIC BEER/ALE (INC NON-ALCOH,ABITA BREWING CO INC,ABITA BREWING CO INC,ABITA AMBER,+ABTAM LAGER BEER GBCRT 72OZ,9,00-01-80020-24221,0,1,...,0.25,BEER,LAGER,GLASS BOTTLE IN CRTN,MISSING,MISSING,MISSING,AMBER,,


In [4]:
# Unique flavors

flavor_df = prod_all_beer_df[["FLAVOR/SCENT"]].replace("MISSING", np.nan).replace("REGULAR", np.nan).dropna().drop_duplicates()

flavor_df["flavor_id"] = np.arange(1,1+len(flavor_df))
flavor_df.rename(columns={"FLAVOR/SCENT": "flavor_name"}, inplace=True)
flavor_df.reset_index(drop=True, inplace=True)
flavor_df.dtypes

flavor_name    object
flavor_id       int64
dtype: object

In [5]:
flavor_dict = flavor_df.set_index("flavor_name")["flavor_id"].to_dict()
flavor_df.head()

Unnamed: 0,flavor_name,flavor_id
0,ASSORTED,1
1,RASPBERRY,2
2,RUM,3
3,WATERMELON,4
4,PEACH,5


In [11]:
# Import into MySQL server
flavor_df.to_sql('flavor', con = engine, if_exists = 'replace', chunksize = 1000, index = False)

In [6]:
# Unique packaging

packaging_df = prod_all_beer_df[["PACKAGE"]].replace("MISSING", np.nan).dropna().drop_duplicates()

packaging_df["packaging_id"] = np.arange(1,1+len(packaging_df))
packaging_df.rename(columns={"PACKAGE": "packaging_name"}, inplace=True)
packaging_df.reset_index(drop=True, inplace=True)
packaging_df.dtypes

packaging_name    object
packaging_id       int64
dtype: object

In [7]:
packaging_dict = packaging_df.set_index("packaging_name")["packaging_id"].to_dict()
packaging_df.head()

Unnamed: 0,packaging_name,packaging_id
0,CAN,1
1,GLASS BOTTLE,2
2,GLASS BOTTLE IN CRTN,3
3,GLASS BOTTLE IN BOX,4
4,LONG NECK BTL CRTN,5


In [14]:
# Import into MySQL server
packaging_df.to_sql('packaging', con = engine, if_exists = 'replace', chunksize = 1000, index = False)

In [8]:
# Unique beer_type

beer_type_df = prod_all_beer_df[["TYPE OF BEER/ALE"]].replace("MISSING", np.nan).dropna().drop_duplicates()

beer_type_df["beer_type_id"] = np.arange(1,1+len(beer_type_df))
beer_type_df.rename(columns={"TYPE OF BEER/ALE": "beer_type_name"}, inplace=True)
beer_type_df.reset_index(drop=True, inplace=True)
beer_type_df.dtypes

beer_type_name    object
beer_type_id       int64
dtype: object

In [9]:
beer_type_dict = beer_type_df.set_index("beer_type_name")["beer_type_id"].to_dict()
beer_type_df.dtypes

beer_type_name    object
beer_type_id       int64
dtype: object

In [17]:
# Import into MySQL server
beer_type_df.to_sql('beer_type', con = engine, if_exists = 'replace', chunksize = 1000, index = False)

In [10]:
# Unique vendor

vendor_df = prod_all_beer_df[["L4"]].replace("MISSING", np.nan).replace("ALL OTHERS", np.nan).replace("PRIVATE LABEL", np.nan).dropna().drop_duplicates()

vendor_df["vendor_id"] = np.arange(1,1+len(vendor_df))
vendor_df.rename(columns={"L4": "vendor_name"}, inplace=True)
vendor_df.reset_index(drop=True, inplace=True)
vendor_df.dtypes

vendor_name    object
vendor_id       int64
dtype: object

In [11]:
vendor_dict = vendor_df.set_index("vendor_name")["vendor_id"].to_dict()
vendor_df.dtypes

vendor_name    object
vendor_id       int64
dtype: object

In [20]:
# Import into MySQL server
vendor_df.to_sql('vendor', con = engine, if_exists = 'replace', chunksize = 1000, index = False)

In [12]:
# Unique UPC products

prod_all_beer_unique_df = prod_all_beer_df.drop_duplicates(subset="UPC")
prod_all_beer_unique_df.dtypes

L1                                                                                   object
L2                                                                                   object
L3                                                                                   object
L4                                                                                   object
L5                                                                                   object
L9                                                                                   object
Level                                                                                 int64
UPC                                                                                  object
SY                                                                                    int64
GE                                                                                    int64
VEND                                                                            

In [13]:
prod_all_beer_unique_df.tail()

Unnamed: 0,L1,L2,L3,L4,L5,L9,Level,UPC,SY,GE,...,VOL_EQ,PRODUCT TYPE,TYPE OF BEER/ALE,PACKAGE,FLAVOR/SCENT,SIZE,CALORIE LEVEL,COLOR,*AG C=1+ CATEGORY 00004,*STUBSPEC 1416RC 00004
56900,CATEGORY - BEER/ALE/ALCOHOLIC CID,IMPORTED BEER/ALE (INC NON-ALCOH,WETTEN IMPORTERS INC,WETTEN IMPORTERS INC,LIEFMANS GOUDENBAND,+LFMGD BEER LNBTL 8% 1CT 12.7OZ,9,00-01-82153-33109,0,1,...,0.0441,BEER,MISSING,LONG NECK BOTTLE,MISSING,MISSING,MISSING,MISSING,,+LFMGD BEER LNBTL 8% 1CT 12.7OZ 0 1 8...
56911,CATEGORY - BEER/ALE/ALCOHOLIC CID,IMPORTED BEER/ALE (INC NON-ALCOH,ALL OTHERS,ALL OTHERS,ALL BRAND,+ALBND BEER GB 1CT 14OZ,9,27-01-00001-61879,27,1,...,0.0486,BEER,MISSING,GLASS BOTTLE,MISSING,MISSING,MISSING,MISSING,,+ALBND BEER GB 1CT 14OZ27 1 ...
56914,CATEGORY - BEER/ALE/ALCOHOLIC CID,IMPORTED BEER/ALE (INC NON-ALCOH,PRIVATE LABEL,PRIVATE LABEL,PRIVATE LABEL,+PRV * BEER LNBBX 12CT 144OZ,9,88-04-99998-65500,88,4,...,0.5,BEER,MISSING,LONG NECK BTL IN BOX,MISSING,MISSING,MISSING,GLD,,+PRV * BEER LNBBX 12CT 144OZ88 4 9...
56915,CATEGORY - BEER/ALE/ALCOHOLIC CID,IMPORTED BEER/ALE (INC NON-ALCOH,PRIVATE LABEL,PRIVATE LABEL,PRIVATE LABEL,+PRV * LAGER BEER LNBTL 72OZ,9,88-04-99998-65501,88,4,...,0.25,BEER,LAGER,LONG NECK BOTTLE,,,,,,+PRV * LAGER BEER LNBTL 72OZ88 4 9...
56934,CATEGORY - BEER/ALE/ALCOHOLIC CID,PLU - ALL BRANDS BEER,ALL OTHERS,ALL OTHERS,ALL BRAND,+ALBND BEER 40OZ,9,27-01-00001-63530,27,1,...,0.1389,BEER,MISSING,GLASS BOTTLE,MISSING,MISSING,MISSING,MISSING,,+ALBND BEER 40OZ27 1 ...


In [14]:
prod_all_beer_unique_df = prod_all_beer_unique_df[prod_all_beer_unique_df.L4 != "ALL OTHERS"]
prod_all_beer_unique_df = prod_all_beer_unique_df[prod_all_beer_unique_df.L4 != "PRIVATE LABEL"]
prod_all_beer_unique_df["domestic"] = [1 if x == "DOMESTIC BEER/ALE (INC NON-ALCOH" else 0 for x in prod_all_beer_unique_df["L2"]]
prod_all_beer_unique_df = prod_all_beer_unique_df[["UPC", "SY", "GE", "VEND", "ITEM", "domestic", "L4", "VOL_EQ", "TYPE OF BEER/ALE", "PACKAGE", "FLAVOR/SCENT"]]
prod_all_beer_unique_df.rename(columns={"L4": "vendor_id", "TYPE OF BEER/ALE": "beer_type_id", "PACKAGE": "packaging_id", "FLAVOR/SCENT": "flavor_id"}, inplace=True)

prod_all_beer_unique_df["vendor_id"] = prod_all_beer_unique_df["vendor_id"].map(vendor_dict)
prod_all_beer_unique_df["beer_type_id"] = prod_all_beer_unique_df["beer_type_id"].map(beer_type_dict)
prod_all_beer_unique_df["beer_type_id"] = prod_all_beer_unique_df["beer_type_id"].astype('Int64')
prod_all_beer_unique_df["packaging_id"] = prod_all_beer_unique_df["packaging_id"].map(packaging_dict)
prod_all_beer_unique_df["packaging_id"] = prod_all_beer_unique_df["packaging_id"].astype('Int64')
prod_all_beer_unique_df["flavor_id"] = prod_all_beer_unique_df["flavor_id"].map(flavor_dict)
prod_all_beer_unique_df["flavor_id"] = prod_all_beer_unique_df["flavor_id"].astype('Int64')

prod_all_beer_unique_df.reset_index(drop=True, inplace=True)
prod_all_beer_unique_df["UPC_id"] = np.arange(1,1+len(prod_all_beer_unique_df))
prod_all_beer_unique_df.dtypes

UPC              object
SY                int64
GE                int64
VEND              int64
ITEM              int64
domestic          int64
vendor_id         int64
VOL_EQ          float64
beer_type_id      Int64
packaging_id      Int64
flavor_id         Int64
UPC_id            int64
dtype: object

In [15]:
# Import into MySQL server
prod_all_beer_unique_df.to_sql('upc', con = engine, if_exists = 'replace', chunksize = 1000, index = False)

In [25]:
# UPC codes in `sales` tables are atomized with no leading zeros.
# Will need to normalise sales tables by replacing atomized UPC codes by surrogate UPC_id codes instead. 
# Can't use UPC column in `prod_all_beer_unique_df` to make dictionary since values have leading zeros.
# Create df by concat-ing "SY", "GE", "VEND", "ITEM" columns with dash as separator. 
# Create dict for use in Section `4.3`

# Atomized UPC codes to UPC_id dict

atom_upc_upcid_df = prod_all_beer_unique_df[["UPC_id", "SY", "GE", "VEND", "ITEM"]]
concat_upc_atom = atom_upc_upcid_df[["SY", "GE", "VEND", "ITEM"]].apply(lambda row: '-'.join(row.values.astype(str)), axis=1)
atom_upc_upcid_df = pd.concat([atom_upc_upcid_df, concat_upc_atom], axis=1)
atom_upc_upcid_df.drop(["SY", "GE", "VEND", "ITEM"], axis = 1, inplace=True)
atom_upc_upcid_df.rename(columns = {0: "UPC_atom_concat"}, inplace=True)
atom_upc_upcid_dict = atom_upc_upcid_df.set_index("UPC_atom_concat")["UPC_id"].to_dict()

atom_upc_upcid_df.head()


Unnamed: 0,UPC_id,UPC_atom_concat
0,1,0-1-85674-60002
1,2,0-1-85674-60001
2,3,27-1-15502-1124
3,4,0-1-80020-1
4,5,0-1-80020-24221


### 4.2 Create unique store, outlet_cat, market tables

In [26]:
# Store (Delivery_Stores) tables

#stores_all_df = pd.concat([pd.read_table(f) for f in glob.glob("./IRI BEER DATASET/Year*/Delivery_Stores")], ignore_index = True, sort=False)
stores_all_df = pd.concat([pd.read_csv(f, sep="\t") for f in glob.glob("./IRI BEER DATASET/Year*/Delivery_Stores")], ignore_index = True, sort=False)


stores_all_df.columns = ["string"]
stores_all_df.head()

Unnamed: 0,string
0,200032 GR 28.11499 NEW YORK 1...
1,200059 GR 20.80499 PHILADELPHIA 1...
2,200171 GR 25.282 MILWAUKEE ...
3,200197 GR 16.616 PEORIA/SPRINGFLD. ...
4,200272 GR 10.91199 LOS ANGELES ...


In [27]:
# Split column by character location

stores_all_df["store_id"] = stores_all_df.string.str[0:7].astype(str).astype(int)
stores_all_df["outlet_cat_name"] = stores_all_df.string.str[8:10]
stores_all_df["market_name"] = stores_all_df.string.str[20:45]

outlet_cat_convert_dict = {"DR": "drug", "GR": "groceries", "MA": "mass", "DK": "drug", "GK": "groceries", "MK": "mass"}
stores_all_df["outlet_cat_name"] = stores_all_df["outlet_cat_name"].map(outlet_cat_convert_dict)
stores_all_df.drop(["string"], axis = 1, inplace=True)
stores_all_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
stores_all_df.drop_duplicates(subset="store_id", inplace=True)
stores_all_df.dtypes

store_id            int64
outlet_cat_name    object
market_name        object
dtype: object

In [28]:
stores_all_df.head()

Unnamed: 0,store_id,outlet_cat_name,market_name
0,200032,groceries,NEW YORK
1,200059,groceries,PHILADELPHIA
2,200171,groceries,MILWAUKEE
3,200197,groceries,PEORIA/SPRINGFLD.
4,200272,groceries,LOS ANGELES


In [29]:
# Unique outlet_cat

outlet_cat_df = stores_all_df[["outlet_cat_name"]].drop_duplicates()
outlet_cat_df["outlet_cat_id"] = np.arange(1,1+len(outlet_cat_df))
outlet_cat_df.reset_index(drop=True, inplace=True)
outlet_cat_dict = outlet_cat_df.set_index("outlet_cat_name")["outlet_cat_id"].to_dict()
outlet_cat_df.head()

Unnamed: 0,outlet_cat_name,outlet_cat_id
0,groceries,1
1,drug,2


In [30]:
# Import into MySQL server
outlet_cat_df.to_sql('outlet_cat', con = engine, if_exists = 'replace', chunksize = 1000, index = False)

In [31]:
# Unique market

market_df = stores_all_df[["market_name"]].drop_duplicates()
market_df["market_id"] = np.arange(1,1+len(market_df))
market_df.reset_index(drop=True, inplace=True)
market_dict = market_df.set_index("market_name")["market_id"].to_dict()
market_df.dtypes

market_name    object
market_id       int64
dtype: object

In [32]:
market_df.tail()

Unnamed: 0,market_name,market_id
45,KANSAS CITY,46
46,DETROIT,47
47,CLEVELAND,48
48,"PROVIDENCE,RI",49
49,DES MOINES,50


In [33]:
# Import into MySQL server
market_df.to_sql('market', con = engine, if_exists = 'replace', chunksize = 1000, index = False)

In [34]:
stores_all_df["outlet_cat_name"] = stores_all_df["outlet_cat_name"].map(outlet_cat_dict)
stores_all_df["market_name"] = stores_all_df["market_name"].map(market_dict)
stores_all_df.head()

Unnamed: 0,store_id,outlet_cat_name,market_name
0,200032,1,1
1,200059,1,2
2,200171,1,3
3,200197,1,4
4,200272,1,5


In [35]:
# Import into MySQL server
stores_all_df.to_sql('store', con = engine, if_exists = 'replace', chunksize = 1000, index = False)

### 4.3. Sales Data

In [36]:
# List of all sales data and total size

sales_file_list = glob.glob("./IRI BEER DATASET/Year*/beer_????_????_????")
#sales_file_list = glob.glob("./IRI BEER DATASET/Year/beer_drug_????_????")

sales_files_size_GB = round(sum([os.stat(file).st_size for file in sales_file_list])/(1024**3),2)
print("Total size of sales data:", sales_files_size_GB, "GB.")

Total size of sales data: 6.93 GB.


In [37]:
sales_file_list

['./IRI BEER DATASET/Year9/beer_groc_1531_1582',
 './IRI BEER DATASET/Year9/beer_drug_1531_1582',
 './IRI BEER DATASET/Year7/beer_groc_1427_1478',
 './IRI BEER DATASET/Year7/beer_drug_1427_1478',
 './IRI BEER DATASET/Year1/beer_drug_1114_1165',
 './IRI BEER DATASET/Year1/beer_groc_1114_1165',
 './IRI BEER DATASET/Year6/beer_drug_1374_1426',
 './IRI BEER DATASET/Year6/beer_groc_1374_1426',
 './IRI BEER DATASET/Year8/beer_groc_1479_1530',
 './IRI BEER DATASET/Year8/beer_drug_1479_1530',
 './IRI BEER DATASET/Year12/beer_drug_1687_1739',
 './IRI BEER DATASET/Year12/beer_groc_1687_1739',
 './IRI BEER DATASET/Year3/beer_drug_1218_1269',
 './IRI BEER DATASET/Year3/beer_groc_1218_1269',
 './IRI BEER DATASET/Year4/beer_drug_1270_1321',
 './IRI BEER DATASET/Year4/beer_groc_1270_1321',
 './IRI BEER DATASET/Year5/beer_drug_1322_1373',
 './IRI BEER DATASET/Year5/beer_groc_1322_1373',
 './IRI BEER DATASET/Year2/beer_drug_1166_1217',
 './IRI BEER DATASET/Year2/beer_groc_1166_1217',
 './IRI BEER DATAS

Since files are big but each file has the same spacing format, for each sales file, 
1. read_table() into df
2. split column by location (IRI_KEY, WEEK, SY, GE, VEND, ITEM, UNITS, DOLLARS)
3. dump into mySQL by `if_exists = 'append'` method

Time expected: 90 minutes. System: macOS, 8GB memory, Intel Core i5.

**The following chunk is commented out to prevent overwriting.**

In [38]:
for series in tqdm(sales_file_list): 
    sales_each_df = pd.read_csv(series, sep="\t")
    sales_each_df.columns = ["string"]
    sales_each_df.tail()
    
    # split string into columns
    sales_each_df["store_id"] = sales_each_df.string.str[0:7].astype(str).astype(int)
    sales_each_df["week_id"] = sales_each_df.string.str[8:12].astype(str).astype(int)
    sales_each_df["SY"] = sales_each_df.string.str[13:15].astype(str).astype(int)
    sales_each_df["GE"] = sales_each_df.string.str[16:18].astype(str).astype(int)
    sales_each_df["VEND"] = sales_each_df.string.str[19:24].astype(str).astype(int)
    sales_each_df["ITEM"] = sales_each_df.string.str[25:30].astype(str).astype(int)
    sales_each_df["UNITS"] = sales_each_df.string.str[31:36].astype(str).astype(int)
    sales_each_df["DOLLARS"] = sales_each_df.string.str[37:45].astype(str).astype(float)
    sales_each_df.drop(["string"], axis = 1, inplace=True)
    sales_each_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
    concat_upc_atom = sales_each_df[["SY", "GE", "VEND", "ITEM"]].apply(lambda row: '-'.join(row.values.astype(str)), axis=1)
    sales_each_df = pd.concat([sales_each_df, concat_upc_atom], axis=1)
    sales_each_df.drop(["SY", "GE", "VEND", "ITEM"], axis = 1, inplace=True)
    sales_each_df.rename(columns = {0: "UPC_atom_concat"}, inplace=True)
    sales_each_df["upc_id"] = sales_each_df["UPC_atom_concat"].map(atom_upc_upcid_dict)
    # sales_each_df[sales_each_df.isna().any(axis=1)]
    # UPC code 0-1-11170-83511 does not exist in any of the `prod*_beer.xls*` tables so it's being dropped
    sales_each_df.dropna(inplace = True)
    sales_each_df.drop(["UPC_atom_concat"], axis = 1, inplace=True)
    sales_each_df["upc_id"] = sales_each_df["upc_id"].astype('Int64')
    
    # dump into MySQL
    #sales_each_df.to_sql('sales', con = engine, if_exists = 'append', chunksize = 1000, index = False)
    sales_each_df.to_sql('sales', con = engine, if_exists = 'append', chunksize = 1000, index = False)

100%|██████████| 24/24 [1:30:29<00:00, 273.62s/it]


### 4.4 Week table

Create a week table to convert IRI week codes to calendar date, refers to the Sunday of each week. 

In [39]:
week_index = pd.date_range(start='12/30/2000', end='01/01/2013', freq='W-MON')
week_df = week_index.to_frame(index=False)
week_df.columns = ["date"]
week_df["week_id"] = np.arange(1114,1114+len(week_df))
week_df.shape

(627, 2)

In [40]:
week_df.head()

Unnamed: 0,date,week_id
0,2001-01-01,1114
1,2001-01-08,1115
2,2001-01-15,1116
3,2001-01-22,1117
4,2001-01-29,1118


In [41]:
# Import into MySQL server
week_df.to_sql('week', con = engine, if_exists = 'replace', chunksize = 1000, index = False)

### 4.5 Economic data, and other dataset we choose to use 

FRED API documentation: [https://research.stlouisfed.org/docs/api/fred/ ]  

#### 4.5.1 State abbreviations data

**[IMPORTANT]** Downloaded from: http://www.whypad.com/wp-content/uploads/us_states.zip and placed in notebook directory.

In [42]:
state_code_df = pd.read_csv("us_states.csv")
state_code_df.columns = ["STATE", "state", "state_abbrev"]
state_code_df.drop(["STATE"], axis = 1, inplace=True)
state_code_df.head()

Unnamed: 0,state,state_abbrev
0,Alaska,AK
1,Arizona,AZ
2,Arkansas,AR
3,California,CA
4,Colorado,CO


#### 4.5.2. FRED economic data

JSON Request (HTTPS GET) example:  
https://api.stlouisfed.org/fred/series/observations?series_id=GDPC1&api_key=85482ff982c94bb52ca2ae28568ee970&file_type=json&observation_start=2001-01-01&observation_end=2012-12-31


API key: 85482ff982c94bb52ca2ae28568ee970  

series_id:  

Real Gross Domestic Product: `GDPC1`  
State unemployment rate: `state_code` + `UR`; use state abbreviations dataframe from section `4.5.1`  
CPI (for All Urban Consumers: All Items in U.S. City Average): `CPIAUCSL`  
Long-Term Government Bond Yields: 10-year: Main (Including Benchmark): `IRLTLT01USM156N`  
S&P/Case-Shiller U.S. National Home Price Index: `CSUSHPISA`

In [43]:
# List of series to pull

rest_api_list = ["IRLTLT01USM156N", "CSUSHPISA", "GDPC1", "CPIAUCSL"]

us_unemploy_api = ["UNRATE"]
state_unemploy_api_list = state_code_df['state_abbrev'].astype(str) + "UR"
state_unemploy_api_list = state_unemploy_api_list.tolist()
all_unemploy_api = us_unemploy_api + state_unemploy_api_list

all_api = rest_api_list + all_unemploy_api

In [44]:
# FRED API setup

api_key = "85482ff982c94bb52ca2ae28568ee970"
fred = Fred(api_key=api_key)

In [None]:
# Pull data from FRED and nest it in DICTIONARY with format {series_id : series_data_df}
# Dynamically creating objects through a loop is a bad idea in Python since they are unnecessary, hard to create (use exec or globals()), and I can't use them dynamically anyway. 
# But if you really want to I can use globals().

# Sometimes the API reaches 504 Gateway Timeout error and yells at you. Just keep trying. 

dict_series_values = {series: fred.get_series_latest_release(series).to_frame() for series in all_api}

In [45]:
for series, df in dict_series_values.items():
    df.columns = [series]
    df.reset_index(level=0, inplace=True)
    df.rename(columns={"index": "date"}, inplace = True)
    pd.to_datetime(df['date'], format = "%Y-%m-%d")
    dict_series_values[series] = df.loc[(df['date'] >= "2001-01-01") & (df['date'] <= "2012-12-31")]

In [46]:
econ_df = pd.concat([df.set_index('date') for (series, df) in dict_series_values.items()], axis=1, join='outer').reset_index()

# Import into MySQL server

econ_df.to_sql('econ', con = engine, if_exists = 'append', chunksize = 1000)


## 5. Set up MySQL entities and relationships 

## 6. Data analysis and dataviz