# Benchmark InfoGroup against other datasets

# Plan
- Donwload Census data
  - CBP and BDS
  - Store in GCS
- Convert Census data into comparable format
  - Select years and variables
  - Make properly indexed pandas dataframes
- Convert InfoGroup data into comparable format
  - SQL queries
  - pandas dataframes
- Benchmark
  - Compare with CBP and BDS
  - Report tables and figures

# TODO
- Understand difference in coverage between IG and CBP, BDS
  - different industry coverage
  - employers, self-employed
  - geographies (?)
- Adjust for difference, benchmark "apples to apples" and explore what's left behind
- Report sales data, by size or industry (watch for null)
- (minor) use US file to create CBP-by-industry tables.
- Benchmark against [Concentration Ratios](https://www.census.gov/econ/concentration.html), based on Economic Census data.

# Links
- [CBP data](https://www.census.gov/programs-surveys/cbp/data/datasets.html)
- [BDS](https://www.census.gov/ces/dataproducts/bds/), [data in CSV](https://www.census.gov/ces/dataproducts/bds/data.html), [codebook](https://www.census.gov/ces/pdf/BDS_2014_Codebook.pdf)
- [Census API](https://www.census.gov/data/developers/guidance/api-user-guide.html)
- [BDS API](https://www.census.gov/data/developers/data-sets/business-dynamics.html)
- [geo codes](https://www.census.gov/geo/)
- [SIC codes](https://www.osha.gov/pls/imis/sic_manual.html), [machine readable](https://www.census.gov/programs-surveys/cbp/technical-documentation/record-layouts/sic-code-descriptions.html)
- [BigQuery SQL reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/)
- [IG data](https://bigquery.cloud.google.com/dataset/info-group-162919:original) (BQ)

# 30-jun-17 Download CBP data
And push it to BigQuery

Ziqi downloaded 1986-2014 data to Storage buckets `cbp-txt` and `cbp-csv`.

Data documentation:
[1986-1997](https://www2.census.gov/programs-surveys/cbp/technical-documentation/records-layouts/full-layout/county_layout_sic.txt),
[1998-2006](https://www2.census.gov/programs-surveys/cbp/technical-documentation/records-layouts/full-layout/county_layout.txt),
[2007-2013](https://www2.census.gov/programs-surveys/cbp/technical-documentation/records-layouts/noise-layout/county_layout.txt),
[2014](https://www2.census.gov/programs-surveys/rhfs/cbp/technical%20documentation/2014_record_layouts/county_layout_2014.txt),
[2015](https://www2.census.gov/programs-surveys/rhfs/cbp/technical%20documentation/2015_record_layouts/county_layout_2015.txt)

I will start by looking at state - year - naics-2 breakdown, so I will download state CBP files.

# 02-jul-17 Explore CBP data

Thinking that BQ is probably an overkill for my current task. Will just read from CSVs into pandas.

Columns changed over years, let's explore them

InfoGroup data goes 1997-2015. CBP changes over this period:
- switched from SIC to NAICS in 1998 (IG has both NAICS and SIC in 1997)
- `nf` columns since 2007 - Noise Flag, don't even need at this point
- `lfo` column since 2010 - Legal Form of Organization, interesting but also not important now

# 03-jul-17 Convert CBP to dataframe

For now, I will ignore 1997, since CBP does not have NAICS there. Later can either translate SIC to NAICS, or benchmark against SIC directly.

Upload raw data to CS.

Read CBP data into pandas

# 05-jul-17 Download IG data
Get InfoGroup data from BQ and compare with CBP.

Create small random sample of data for testing.

```sql
select * from `original.data` where rand() < 0.01;
```

# 07-jul-17 Add establishments count to CBP
Makes sense to compare them together with employment.

# 11-jul-17 Compare CBP and IG

Download NAICS descriptions. Over time the only change in 2-digit codes was that "Mining" changed to "Mining, Quarrying, and Oil and Gas Extraction" between 2002 and 2007.

Align CBP and IG dataframes for comparison. Save comparison-ready tables

Compare conclusions:
- In aggregate, IG shows 50-100% more establishments and 10-30% more employees than CBP
- Difference grows over time
- There are not many outliers across states
- Very strong outliers in industries: too much 99, 92, 11, 61, too little 55.
- Patterns are similar between establishments and employment

# 25-jul-17 Interactive heatmaps
Made some interactive heatmaps with `seaborn` and `ipywidgets`, but could not find a way to save widgets and plots in a notebook for non-interactive view. `bokeh` is another option, might try it sometime.

# 09-aug-17 Compare IG and BDS

NAICS is not available, only most aggregate SIC sectors.

There are no CSVs that had state by sector aggregation. There is HTTP API, however it looks like it only supports same aggregations available for downloads: either by sector, or by state. It is pretty cool though, so I will use it here.


# 21-aug-17 Clean up lab book

Removed all code, it should be in `benchmark.ipynb`.

# 22-aug-17 Improve raw data downloads

Update code to download CBP and BDS data to GCS.

For now, only state CBP files.

Metadata is stored in `census.json`.

# 23-aug-17 Refactor CBP processing

Remove CBP documentation download, not useful right now. Maybe download it if choose to download all Census data.

Update CBP exploration and conversion to pandas.

# 24-aug-17 Read BDS, IG into pandas

Remove BDS download via API, get CSV files from GCS.

Read BDS into pandas.

Working with NAICS codes, got thinking: how important are changes introduced by code updates? Nothing happens at top 2-digit level, but how deep are the changes? Applies to SIC too.

Add state, SIC and NAICS code maps to GCS.

Read IG into pandas.

# 05-sep-17 `dill` module to save/load sessions

This module looks cool and should be very simple to use, but in my case it fails to load pickled session. Tried to understand why, but could not trace back the error. It is somewhere between `dill.Unpickler` and `pickle.Unpickler` class.

For now, might be safer to use basic `pickle`.

# 06-sep-17 Finish cleaning up benchmarks notebook

Replaced old comparisons with new ones, removed interactive widgets

# 08-sep-17 Explore IG

Learning more BQ.

It might be a good idea to [partition table by year](https://cloud.google.com/bigquery/docs/creating-partitioned-tables).

Query IG by size.

When comparing by size, keep in mind that BDS takes either `t-1` or average `(t-1, t)` size.

# 15-sep-17 Compare by size

Meta finder.

IG against BDS.

# 18-sep-17

Download US CBP files from Census.

Compare IG vs CBP by size.

Started computing entry and exit in BQ.