# Scaling up empirical research to bigger data with Python

**Anton Babkin**  
Assistant Research Professor  
Department of Agricultural and Resource Economics  
University of Connecticut  
anton.babkin@uconn.edu

# Introduction

- Why not **big** data?
- We will talk about working with data that is bigger than memory
- I've used methods shown here to work with ~100GB of CSV files of InfoGroup and over 500GB of administrative data at Census

# Setup

- Launch Binder or start Jupyter locally  
https://github.com/antonbabkin/ds-bazaar-workshop
- Run code in setup notebook to test environment and get data

# General advice

- Tools presented here will add complexity to your code. Do not use them unless you have to. *Premature optimization is the root of all evil* (Donald Knuth)
- Make a small version of your data to develop and debug your code
- Use functions to modularize your code and let data be released from memory
- Use tests and assertions to verify correctness
- Learn to navigate diverse Python ecosystem of open source libraries

Get bigger machine: university clusters, rent a cloud...

# Overview

- Measuring resource usage
- Chunking, sampling, subsetting and split-apply-combine
- Data type optimization
- Using parquet for storage
- Parallelization
- Dask

# Data

- InfoGroup: propriatary dataset convering all US businesses since 1997
  - about 60 columns and 10M rows per year
- SynIG: synthetic random data that looks like InfoGroup
  - 20 years, 1M rows per year
  - location, industry and employment

### Longitudinal (panel) data

Table: hypothetical employment

| ABI | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | ... |
|-----|------|------|------|------|------|------|-----|
| 001 | 1    | 5    | 2    | .    | .    | .    |     |
| 002 | 1    | 2    | 2    | 10   | 6    | .    |     |
| 003 | .    | 3    | 6    | 5    | 7    | 8    |     |
| ... |      |      |      |      |      |      |     |

show CSV tables

[NAICS sectors](https://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2017)

### Requesting access to InfoGroup

*For University of Wisconsin affiliated researchers.*

InfoGroup serves as raw input for an enhanced dataset called YTS (Youreconomy Time Series).

General Info: https://wisconsinbdrc.org/data/  
DDaTSL: https://www.bdrcfm.org/DDATSL2020/ (authorized account required)

For more info about using the data including web based tool, contact:  jessica.nelson@business.wisconsin.edu

# Parting words

- Premature optimization is the root of all evil
- Do not optimize without measurement
- Inevitable memory-time-complexity trade-offs
- Python if extremely versatile and powerful
  - with great power comes great responsibility
  - keep learning

<script>
    document.querySelector('head').innerHTML += '<style>body {font-size: 200%}</style>';
</script>