# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Move-YTS-data-to-BigQuery" data-toc-modified-id="Move-YTS-data-to-BigQuery-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Move YTS data to BigQuery</a></div><div class="lev2 toc-item"><a href="#TODO" data-toc-modified-id="TODO-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>TODO</a></div><div class="lev2 toc-item"><a href="#Costs-and-speed" data-toc-modified-id="Costs-and-speed-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Costs and speed</a></div><div class="lev2 toc-item"><a href="#Links" data-toc-modified-id="Links-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Links</a></div><div class="lev1 toc-item"><a href="#Explore-raw-data" data-toc-modified-id="Explore-raw-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Explore raw data</a></div><div class="lev1 toc-item"><a href="#Correct-raw-data" data-toc-modified-id="Correct-raw-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Correct raw data</a></div><div class="lev1 toc-item"><a href="#Upload-corrected-CSV-to-GCS" data-toc-modified-id="Upload-corrected-CSV-to-GCS-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Upload corrected CSV to GCS</a></div><div class="lev1 toc-item"><a href="#Import-CSV-from-GCS-into-BQ" data-toc-modified-id="Import-CSV-from-GCS-into-BQ-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Import CSV from GCS into BQ</a></div><div class="lev1 toc-item"><a href="#Convert-BQ-table-from-wide-to-long-format" data-toc-modified-id="Convert-BQ-table-from-wide-to-long-format-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Convert BQ table from wide to long format</a></div><div class="lev1 toc-item"><a href="#Prepare-lagged-values" data-toc-modified-id="Prepare-lagged-values-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Prepare lagged values</a></div><div class="lev1 toc-item"><a href="#Compute-aggregate-stats" data-toc-modified-id="Compute-aggregate-stats-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Compute aggregate stats</a></div><div class="lev2 toc-item"><a href="#Aggregate-on-FIPS-level" data-toc-modified-id="Aggregate-on-FIPS-level-81"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Aggregate on FIPS level</a></div><div class="lev2 toc-item"><a href="#Aggregate-on-state-level" data-toc-modified-id="Aggregate-on-state-level-82"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Aggregate on state level</a></div><div class="lev1 toc-item"><a href="#Query-results" data-toc-modified-id="Query-results-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Query results</a></div><div class="lev2 toc-item"><a href="#State,-year" data-toc-modified-id="State,-year-91"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>State, year</a></div><div class="lev2 toc-item"><a href="#FIPS,-year" data-toc-modified-id="FIPS,-year-92"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>FIPS, year</a></div><div class="lev1 toc-item"><a href="#Appendix" data-toc-modified-id="Appendix-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Appendix</a></div><div class="lev2 toc-item"><a href="#questions" data-toc-modified-id="questions-101"><span class="toc-item-num">10.1&nbsp;&nbsp;</span>questions</a></div><div class="lev2 toc-item"><a href="#&quot;Startup&quot;-variable" data-toc-modified-id="&quot;Startup&quot;-variable-102"><span class="toc-item-num">10.2&nbsp;&nbsp;</span>"Startup" variable</a></div>

# Move YTS data to BigQuery

## TODO

- Move in and out
- ExpStart and NewStart

- upload csv -> gcs -> bq using API

## Costs and speed

[BQ Pricing guide](https://cloud.google.com/bigquery/pricing)
- Storage: \$0.02 per 1 GB per month.
- Query: \$5 per 1 TB processed, first TB per month free.

Entire YTS dataset with all intermediate tables takes about 60 GB = $1.20 per month. All computation queries are about 40 GB = free.

Computation of aggregate stats takes less than 10 minutes. FIPS-year and state-year queries take 2-3 seconds.

## Links
Resources: [project](https://console.cloud.google.com/home/dashboard?project=info-group-162919) 
[storage](https://console.cloud.google.com/storage/browser?project=info-group-162919)
[bigquery](https://bigquery.cloud.google.com/dataset/info-group-162919:yts?pli=1)

[Variable definitions](https://docs.google.com/document/d/139zMJgQjDEwZLR40CiOMZIecklTKQ9Z4_pxWN7w9JZ0/edit)

Help:
[bq](https://cloud.google.com/bigquery/docs/how-to)
[bq sql](https://cloud.google.com/bigquery/docs/reference/standard-sql/)
[bq python api](https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html)

speed up queries: [bq partitions](https://cloud.google.com/bigquery/docs/partitioned-tables)

# Explore raw data

Gregg's files have `\r` for newlines, `csv` module can recognize it, but `bash` tools usually don't. To view them, need to convert newline characters first:
```bash
tr '\r' '\n' < YTS2_3_2.csv > YTS2_3_2_conv.csv
```

Some cells have vetrical tabs in them (`\v` or `^K`). Extract all rows that have such cells:
```bash
head -n1 YTS2_3_2_conv.csv > YTS2_3_2_vtab.csv
grep -nP '\v' YTS2_3_2_conv.csv >> YTS2_3_2_vtab.csv
```

In [1]:
import csv, json
from time import time
import pandas as pd
from google.cloud import storage, bigquery

In [5]:
# BQ parameters
client = bigquery.Client()
dataset = client.dataset('yts_sl')
wide_table_ref = dataset.table('wide')
long_table_ref = dataset.table('long')
lag_table_ref = dataset.table('lag')
fips_table_ref = dataset.table('fips')
state_table_ref = dataset.table('state')
wide_table_path = wide_table_ref.dataset_id + '.' + wide_table_ref.table_id
long_table_path = long_table_ref.dataset_id + '.' + long_table_ref.table_id
lag_table_path = lag_table_ref.dataset_id + '.' + lag_table_ref.table_id
fips_table_path = fips_table_ref.dataset_id + '.' + fips_table_ref.table_id
state_table_path = state_table_ref.dataset_id + '.' + state_table_ref.table_id

In [3]:
def query_report(query_job, start_time):
    elapsed_time = time() - start_time
    mb_proc = query_job.total_bytes_processed / (1 << 20)
    mb_bill = query_job.total_bytes_billed / (1 << 20)
    tb_bill = mb_bill / (1 << 20)
    cost = 5 * tb_bill 
    print('Query finished in %d seconds. Processed %d MB, billed %d MB, cost $%.2f.' % (elapsed_time, mb_proc, mb_bill, cost))

# Correct raw data
Keep only last value in fields with vertical tabs.

In [12]:
raw_filename = 'YTS10_23_17All.csv'
corrected_filename = 'YTS10_23_17All_corrected.csv'
corrections_filename = 'YTS10_23_17All_corrections.csv'
bq_schema_filename = 'bq_schema.json'

In [None]:
with open(raw_filename, mode='r', errors='ignore', newline='') as fp_in, \
    open(corrected_filename, mode='w', newline='') as fp_out, \
    open(corrections_filename, mode='w', newline='') as fp_cor:

    reader = csv.reader(fp_in)
    writer_out = csv.writer(fp_out)
    writer_cor = csv.writer(fp_cor)
    
    # header
    field_names_in = next(reader)
    _ = writer_out.writerow(field_names_in)
    field_names_cor = ['line_number', 'column_number', 'abi', 'field_name', 'old_value', 'new_value']
    _ = writer_cor.writerow(field_names_cor)
    
    # data rows
    line_i = 2
    for row_in in reader:
        row_out = []
        for col_i, value in enumerate(row_in):
            value_split = value.split('\v')
            new_value = value_split[-1]
            row_out.append(new_value)
            if len(value_split) > 1:
                abi = row_in[0]
                row_cor = [line_i, col_i, abi, field_names_in[col_i], value, new_value]
                _ = writer_cor.writerow(row_cor)
        _ = writer_out.writerow(row_out)
        
        line_i += 1
        if line_i % 1000000 == 0: print(line_i)

In [19]:
# check what fields where corrected
df = pd.read_csv(corrections_filename, dtype='object', usecols=['field_name'])
df.field_name.value_counts()

Latitude      7732297
Longitude     7732297
HQFIPS2016          1
Name: field_name, dtype: int64

There might be issues with some values, auto-detected schema sets ABI to integer and fails:
```
Errors:
gs://ig-anton/YTS10_23_17All_corrected.csv: CSV table encountered too many errors, giving up. Rows: 60519; errors: 1. (error code: invalid)
gs://ig-anton/YTS10_23_17All_corrected.csv: Could not parse 'IG4775007' as int for field ABI (position 0) starting at location 22211225052 (error code: invalid)
```

# Upload corrected CSV to GCS
```bash
gsutil -m cp YTS10_23_17All_corrections.csv gs://ig-anton
```

# Import CSV from GCS into BQ
Via Web UI, copy-paste JSON schema.

In [13]:
# prepare BQ schema
df = pd.read_csv(corrected_filename, nrows=100, dtype='object')

bq_schema = []
for field_name in df:
    if field_name == 'FirstYear' or field_name == 'LastYear' or field_name[:3] == 'Emp' or field_name[:5] == 'Sales':
        bq_type = 'INTEGER'
    elif field_name == 'Latitude' or field_name == 'Longitude':
        bq_type = 'FLOAT'
    else:
        bq_type = 'STRING'
    bq_field = {'name': field_name, 'type': bq_type}
    bq_schema.append(bq_field)

with open(bq_schema_filename, 'w') as fp:
    json.dump(bq_schema, fp, indent=2)

# Convert BQ table from wide to long format

In [6]:
# construct query: union of tables for every year
query_select_year = '''
SELECT
  {y} AS year,
  ABI AS abi,
  Emp{y} AS emp,
  Sales{y} AS sales,
  FIPS{y} AS fips,
  NAICS{y} AS naics,
  {startup} as startup
FROM
  `{table}`
WHERE
  Emp{y} IS NOT NULL
'''

query_list = []
for y in range(1997, 2017):
    startup = 'Startup%d' % y if y > 1997 else 'null'
    query_list.append(query_select_year.format(y=y, table=wide_table_path, startup=startup))
query = '\nUNION ALL\n'.join(query_list)

# configure job
job_id = 'yts_wide_to_long_%d' % time()
job_config = bigquery.QueryJobConfig()
job_config.destination = long_table_ref

# start job
t = time()
query_job = client.query(query, job_config=job_config, job_id=job_id)
_ = query_job.result()

query_report(query_job, t)

Query finished in 93 seconds. Processed 10517 MB, billed 10518 MB, cost $0.05.


# Prepare lagged values

In [7]:
# break emp into size categories
q_size = '''
SELECT abi, year, fips, emp, sales, naics, startup,
    CASE
        WHEN emp = 0 THEN 0
        WHEN emp = 1 THEN 1
        WHEN emp BETWEEN 2 AND 9 THEN 2
        WHEN emp BETWEEN 10 AND 99 THEN 10
        WHEN emp BETWEEN 100 AND 499 THEN 100
        WHEN emp >= 500 THEN 500
        ELSE -1
    END AS size
FROM `{table}`
'''.format(table=long_table_path)

# add lag and lead variables
q_lag = '''
SELECT *,
    LAG(year) OVER (PARTITION BY abi ORDER BY year) AS year_preceding,
    LAG(emp) OVER (PARTITION BY abi ORDER BY year) AS emp_lag,
    LAG(size) OVER (PARTITION BY abi ORDER BY year) AS size_lag,
    LAG(FALSE, 1, TRUE) OVER (PARTITION BY abi ORDER BY year) AS birth,
    LEAD(FALSE, 1, TRUE) OVER (PARTITION BY abi ORDER BY year) AS death
FROM ({q})
'''.format(q=q_size)

# add emp change, set lags to null if there was gap in years
query = '''
SELECT * 
    REPLACE (
        IF(year_preceding = year - 1, emp_lag, NULL) AS emp_lag,
        IF(year_preceding = year - 1, size_lag, NULL) AS size_lag
    ),
    IF(year_preceding = year - 1, emp - emp_lag, NULL) AS emp_change
FROM ({q})
'''.format(q=q_lag)

# prepare job
job_id = 'yts_long_to_lag_%d' % time()
job_config = bigquery.QueryJobConfig()
job_config.destination = lag_table_ref

# start job
t = time()
query_job = client.query(query, job_config=job_config, job_id=job_id)
_ = query_job.result()

query_report(query_job, t)

Query finished in 145 seconds. Processed 15136 MB, billed 15137 MB, cost $0.07.


# Compute aggregate stats

## Aggregate on FIPS level

In [8]:
# counts
q_count = '''
SELECT fips, year, size,
    COUNT(*) AS est,
    SUM(emp) AS emp,
    SUM(sales) AS sales
FROM `{table}`
GROUP BY fips, year, size
'''.format(table=lag_table_path)

# continuation
q_cont = '''
SELECT fips, year, size_lag AS size,
    COUNTIF(NOT birth AND emp_change > 0) AS est_expand,
    COUNTIF(NOT birth AND emp_change < 0) AS est_contract,
    SUM(IF(emp_change > 0, emp_change, 0)) AS emp_expand,
    SUM(IF(emp_change < 0, -emp_change, 0)) AS emp_contract
FROM `{table}`
WHERE size_lag IS NOT NULL
GROUP BY fips, year, size
'''.format(table=lag_table_path)

# birth
q_bir = '''
SELECT fips, year, size,
    COUNTIF(birth) AS est_birth,
    SUM(IF(birth, emp, 0)) AS emp_birth
FROM `{table}`
GROUP BY fips, year, size
'''.format(table=lag_table_path)

# death
q_dea = '''
SELECT fips, year + 1 AS year, size,
    COUNT(*) AS est_death,
    SUM(emp) AS emp_death
FROM `{table}`
WHERE death
GROUP BY fips, year, size
'''.format(table=lag_table_path)

# join continuation, birth and death into one table
q_join = '''
SELECT *
FROM
    ({q_count})
    FULL OUTER JOIN ({q_cont}) USING(fips, year, size)
    FULL OUTER JOIN ({q_bir}) USING(fips, year, size)
    FULL OUTER JOIN ({q_dea}) USING(fips, year, size)
'''.format(q_count=q_count, q_cont=q_cont, q_bir=q_bir, q_dea=q_dea)

# overwrite with zero null years: typically high size deaths
# overwrite with null undefined years: backward-looking variables in 1997 and forward-looking variables in 2016
query = '''
SELECT *
    REPLACE (
        IF(year = 1997, NULL, IF(est_birth is NULL, 0, est_birth)) AS est_birth,
        IF(year = 1997, NULL, IF(est_expand is NULL, 0, est_expand)) AS est_expand,
        IF(year = 1997, NULL, IF(est_contract is NULL, 0, est_contract)) AS est_contract,
        IF(year = 1997, NULL, IF(emp_birth is NULL, 0, emp_birth)) AS emp_birth,
        IF(year = 1997, NULL, IF(emp_expand is NULL, 0, emp_expand)) AS emp_expand,
        IF(year = 1997, NULL, IF(emp_contract is NULL, 0, emp_contract)) AS emp_contract,
        IF(year = 1997, NULL, IF(est_death is NULL, 0, est_death)) AS est_death,
        IF(year = 1997, NULL, IF(emp_death is NULL, 0, emp_death)) AS emp_death
    )
FROM ({q_join})
WHERE year != 2017 AND fips is not NULL
ORDER BY fips, year, size
'''.format(q_join=q_join)

# prepare job
job_id = 'yts_agg_fips_%d' % time()
job_config = bigquery.QueryJobConfig()
job_config.destination = fips_table_ref

# start job
t = time()
query_job = client.query(query, job_config=job_config, job_id=job_id)
_ = query_job.result()

query_report(query_job, t)

Query finished in 18 seconds. Processed 15555 MB, billed 15556 MB, cost $0.07.


## Aggregate on state level

In [9]:
# get state codes
df = pd.read_csv('https://www2.census.gov/geo/docs/reference/state.txt', delimiter='|', dtype='object')
state_codes = {
    'numcode_to_strcode': dict(zip(df.STATE, df.STUSAB)),
    'numcode_to_name': dict(zip(df.STATE, df.STATE_NAME)),
    'strcode_to_name': dict(zip(df.STUSAB, df.STATE_NAME))
}

# aggregate from FIPS by 2-digit state code
q_state_code = '''
SELECT
    SUBSTR(fips, 1, 2) as state_code,
    year,
    size,
    SUM(est) AS est,
    SUM(emp) AS emp,
    SUM(sales) AS sales,
    SUM(est_birth) AS est_birth,
    SUM(est_expand) AS est_expand,
    SUM(est_contract) AS est_contract,
    SUM(est_death) AS est_death,
    SUM(emp_birth) AS emp_birth,
    SUM(emp_contract) AS emp_contract,
    SUM(emp_expand) AS emp_expand,
    SUM(emp_death) AS emp_death
FROM `{table}`
GROUP BY state_code, year, size
'''.format(table=fips_table_path)

# add 2-letter state code
q_code_expr = 'CASE state_code\n'
for code_pair in state_codes['numcode_to_strcode'].items():
    q_code_expr += '  WHEN "%s" THEN "%s"\n' % code_pair
q_code_expr += '  ELSE "N/A"\nEND'

query = '''
SELECT
  {q_code_expr} AS state,
  *
FROM ({q_state_code})
ORDER BY state, year, size
'''.format(q_code_expr=q_code_expr, q_state_code=q_state_code)

# prepare job
job_id = 'yts_agg_state_%d' % time()
job_config = bigquery.QueryJobConfig()
job_config.destination = state_table_ref

# start job
t = time()
query_job = client.query(query, job_config=job_config, job_id=job_id)
_ = query_job.result()

query_report(query_job, t)

Query finished in 1 seconds. Processed 31 MB, billed 32 MB, cost $0.00.


# Query results

## State, year

In [12]:
query_state_year = '''
SELECT *
FROM `{table}`
WHERE state = '{st}' and year = {y}
'''.format(table=state_table_path, st='WI', y=1998)

vars_total = ['est', 'emp', 'sales', 'est_birth', 'est_expand', 'est_contract', 'est_death', 'emp_birth', 'emp_contract', 'emp_expand', 'emp_death']
query_agg = []
for v in vars_total:
    query_agg.append('  SUM(%s) as %s' % (v, v))
query_agg = ',\n'.join(query_agg)

query_total = '''
SELECT
  state, state_code, year,
  9999 as size,
{q_agg}
FROM state_year
GROUP BY state, state_code, year
'''.format(q_agg=query_agg)

query = '''
WITH state_year AS ({q_state_year})
SELECT * FROM state_year
UNION ALL
({q_total})
ORDER BY size
'''.format(q_state_year=query_state_year, q_total=query_total)

df = pd.read_gbq(query, verbose=False, dialect='standard', project_id=client.project)

df

Unnamed: 0,state,state_code,year,size,est,emp,sales,est_birth,est_expand,est_contract,est_death,emp_birth,emp_contract,emp_expand,emp_death
0,WI,55,1998,1,38378,38378,4812488,3796,2246,0,5389,3796,0,6220,5389
1,WI,55,1998,2,128572,495115,70877932,8368,10928,11309,18369,28409,19455,40244,65527
2,WI,55,1998,10,47010,1188288,153653201,1505,5742,4824,3005,34370,34933,57969,69267
3,WI,55,1998,100,4185,740391,103059933,102,554,393,164,18774,24631,25231,30896
4,WI,55,1998,500,506,540530,63673221,9,62,55,23,6620,27721,13779,21557
5,WI,55,1998,9999,218651,3002702,396076775,13780,19532,16581,26950,91969,106740,143443,192636


![Results layout](results_layout.png)

## FIPS, year

In [14]:
q_fips_year = '''
SELECT *
FROM `{table}`
WHERE fips = '{fips}' and year = {y}
'''.format(table=fips_table_path, fips='53033', y=2005)

vars_total = ['est', 'emp', 'sales', 'est_birth', 'est_expand', 'est_contract', 'est_death', 'emp_birth', 'emp_contract', 'emp_expand', 'emp_death']
q_agg = []
for v in vars_total:
    q_agg.append('  SUM(%s) as %s' % (v, v))
q_agg = ',\n'.join(q_agg)

q_total = '''
SELECT
  fips, year,
  9999 as size,
{q_agg}
FROM fips_year
GROUP BY fips, year
'''.format(q_agg=q_agg)

query = '''
WITH fips_year AS ({q_fips_year})
SELECT * FROM fips_year
UNION ALL
({q_total})
ORDER BY size
'''.format(q_fips_year=q_fips_year, q_total=q_total)

df = pd.read_gbq(query, verbose=False, dialect='standard', project_id=client.project)

df

Unnamed: 0,fips,year,size,est,emp,sales,est_expand,est_contract,emp_expand,emp_contract,est_birth,emp_birth,est_death,emp_death
0,53033,2005,1,12662,12662,2254249,648,0,1802,0,835,835,1358,1358
1,53033,2005,2,60313,236389,47349785,3369,3379,16304,6069,7697,30331,5773,21477
2,53033,2005,10,17344,423722,75885622,1289,1440,15011,10837,1090,19934,1068,24472
3,53033,2005,100,1340,234540,61835626,61,97,3146,8197,38,7138,71,12018
4,53033,2005,500,137,212970,30775911,3,8,925,1672,5,3851,10,6606
5,53033,2005,9999,91796,1120283,218101193,9665,5370,4924,8280,62089,26775,37188,65931


# Appendix

## questions

emp == 0: what is this?

Null employment: is employment_category also null?

## "Startup" variable

When firm with same ABI appears after a break, is it a startup?

```
Row	abi	year	fips	emp	sales	naics	startup	emp_lag	birth	death	year_preceding	size	emp_change	emp_lag0	birth0	death0	 
11	612580	2007	9003	50	6950	33351711	null	50	false	false	2006	10	0	50	false	false	 
12	612580	2012	9003	45	6950	33351711	New	50	false	false	2007	10	null	null	false	false	 
13	612580	2013	9003	45	10615	33351711	null	45	false	false	2012	10	0	45	false	false	 

20	632026	2000	25017	297	null	33911203	null	297	false	false	1999	100	0	297	false	false	 
21	632026	2001	25017	275	null	33911203	null	297	false	false	2000	100	-22	297	false	false	 
22	632026	2003	25017	170	null	33451315	ExpStart	275	false	false	2001	100	null	null	false	false	 
23	632026	2004	25017	170	null	33451315	null	170	false	false	2003	100	0	170	false	false	 

46	929265	2007	9001	60	6600	33399920	null	60	false	false	2006	10	0	60	false	false	 
47	929265	2012	9001	60	12720	33211912	New	60	false	false	2007	10	null	null	false	false	 
48	929265	2013	9001	60	11054	33211912	null	60	false	false	2012	10	0	60	false	false	 
```