# Scaling up empirical research to bigger data with Python

**Anton Babkin**  
Assistant Research Professor  
Department of Agricultural and Resource Economics  
University of Connecticut  
anton.babkin@uconn.edu

# Setup

- If you have not already, launch Binder or start Jupyter locally
- Run code in setup notebook to test environment and get data

# Introduction

- Why not **big** data?
- What can you do to analyze data that does not fit in memory?
  - Assuming that you use Python
- I've used methods shown here to work with ~100GB of CSV files, but they scale up

# General advice

- Tools presented here will add complexity to your code. Do not use them unless you have to. *Premature optimization is the root of all evil* (Donald Knuth)
- Make a small version of your data to develop and debug your code
- Use functions to modularize your code and let data be released from memory
- Use tests and assertions to verify correctness
- Learn to navigate diverse Python ecosystem of open source libraries

Get bigger machine: university clusters, rent a cloud...

# Overview

- Measuring resource usage
- Chunking, sampling, subsetting and split-apply-combine
- Data type optimization
- Using parquet for storage
- Parallelization
- Dask

# Measuring resource usage

- CPU, memory, disk and network I/O
- Task Manager on Windows, Activity Monitor on Mac, `top` on Unix
- Every running program is a process (PID)
  - and it's subprocesses
- Processes request memory from OS to store data (variables) and use CPU time to process them
- Under Jupyter, every notebook starts a kernel - Python subprocess

- Running time
- `DataFrame.memory_usage()`
- Total usage by the process: `psutil`
- Advanced tools

# Data

- InfoGroup: propriatary dataset convering all US businesses since 1997
  - about 60 columns and 10M rows per year
- SynIG: synthetic random data that looks like InfoGroup
  - 20 years, 1M rows per year
  - location, industry and employment

Longitudinal (panel) data

Table: hypothetical employment

| ABI | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | ... |
|-----|------|------|------|------|------|------|-----|
| 001 | 1    | 5    | 2    | .    | .    | .    |     |
| 002 | 1    | 2    | 2    | 10   | 6    | .    |     |
| 003 | .    | 3    | 6    | 5    | 7    | 8    |     |
| ... |      |      |      |      |      |      |     |

show CSV tables

[NAICS sectors](https://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2017)

# Subsetting and split-apply-combine

- Subset of rows and columns
- Read in chunks
- Split in subset, apply transformation to each subset, combine final result
- Memory-speed and memory-complexity trade-offs

## Examples

- Representative random sample
- Establishments and employment by sector and year
- Size vs age

<script>
    document.querySelector('head').innerHTML += '<style>body {font-size: 200%}</style>';
</script>