# Bigus datus: Working with bigger than memory data in Python

> InfoGroup dataset comes in plain CSV files and is rather big. Exploring alternatives for storage and processing.

In [None]:
# hide
# all_flag
%load_ext autoreload
%autoreload 2

from ig_format.core import lsdir


# Data


In [None]:
# Validated files
print(*lsdir('./out/valid'), sep='\n')

In [None]:
# Extracts: first 100k records
print(*lsdir('./out/extracts/100k'), sep='\n')

# Overview of options

## Google BigQuery

[Official page](https://cloud.google.com/bigquery/)

Pros

- Fast
- Easy to collaborate on
- SQL syntax
- Easy to ingest raw data: no need to normalize tables

Cons

- Costs
- Requires Internet connection
- Proprietary format, lock in
- Not available in RDC

## Relational database

- Can be on-permises or cloud.
- Many alternative implementations: SQLite, Postgres, MySQL, ...

Pros

- Robust standard
- Easily portable between providers
- SQL syntax

Cons

- Slow
  - Read-only might improve speeds
- Even slower if schemas are not optimized: normalization, indexing

## Data warehouse

- Similar to GBQ, but can be installed on-premises. Unlikely in RDC.
- Columnar storage optimized for analytics.

[ClickHouse](https://clickhouse.yandex/), maybe others.

## Plain CSV + pandas or Stata

Pros

- Universally supported format
- Human readable on disk (unless compressed)

Cons

- Slow
- In-memory processing

## Parquet, Arrow, dask

- Out of core processing
- Open standard
- Difference between serialization format (on disk storage) and processing. Pandas and dask can process both CSV and parquet files.

# Installation

Create `conda` environment and install packages. `conda-forge` channel must be enabled. Some packages are pip-only.

```bash
conda create -n ig_format jupyterlab pandas dask matplotlib fastparquet nodejs python-snappy
pip install nbdev
pip install dask_labextension
```