<img src="images/dask_horizontal.svg" align="right" width="30%">

# Table of Contents
* [Data Storage](#Data-Storage)
	* [Setup](#Setup)
    * [Read CSV](#Read-CSV)
	* [Parquet](#Parquet)
	* [Remote files](#Remote-files)


# Data Storage

<img src="images/hdd.jpg" width="20%" align="right">
Efficient storage can dramatically improve performance, particularly when operating repeatedly from disk.

Decompressing text and parsing CSV files is expensive.  One of the most effective strategies with big data is to use a binary storage format like parquet:
- encode the data in the most approrpiate format
- only read the oclumns you need
- only read the partitions you need
- implicit partitioning

In this section we'll learn how to efficiently arrange and store your datasets in on-disk binary formats.

1.  Storage formats affect performance by an order of magnitude
3.  A combination of binary formats, column storage, and partitioned data turns one second wait times into 80ms wait times.

## Setup

Create data if we don't have any

In [None]:
from prep import extract_flight
extract_flight()

In [None]:
!du -sh data/nycflights/

## Read CSV

First we read our csv data as before.

CSV and other text-based file formats are the most common storage for data from many sources, because they require minimal pre-processing, can be written line-by-line and are human-readable. Since Pandas' `read_csv` is well-optimized, CSVs are a reasonable input, but far from optimized, since reading required extensive text parsing.

In [None]:
import os
import dask.dataframe as dd
df = dd.read_csv(os.path.join('data', 'nycflights', '*.csv'),
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={'TailNum': str,
                        'CRSElapsedTime': float,
                        'Cancelled': bool})

In [None]:
%time df.Cancelled.sum().compute()

### CSV to parquet

`fastparquet` is a library for interacting with parquet-format files, which are a very common format in the Big Data ecosystem, and used by tools such as Hadoop, Spark and Impala.

In [None]:
target = os.path.join('data', 'flights.parquet')
df.to_parquet(target, compression='SNAPPY')

Investigate the file structure in the resultant new directory - what do you suppose those files are for?

`to_parquet` comes with many options, such as compression, whether to explicitly write NULLs information, and how to encode strings. You can experiment with these, to see what effect they have on the file size and the processing times, below.

In [None]:
ls -l data/flights.parquet/

In [None]:
!du -sh data/flights.parquet/

In [None]:
df_p = dd.read_parquet(target)
# the column types are already defined
df_p.dtypes

Rerun the sum computation above for this version of the data, and time how long it takes.

In [None]:
%time df_p.Cancelled.sum().compute()

When archiving data, it is common to sort and partition by a column with unique identifiers or ingerent order, to facilitate fast look-ups later. For this data, one possible index column is `Date`. Time how long it takes to count the number or rows corresponding for dates `< 1991` from the raw CSV and parquet versions, and finally from a new parquet version written after applying `set_index('Date')`.

In [None]:
# df_p.set_index('Date').to_parquet(...)

## Remote files

Dask can access Amazon S3 or GCS buckets or on HDFS

Advantages:
* scalable, secure storage

Disadvantages:
* network speed becomes bottleneck
* only utilize local compute resources
    * See dask.distributed


For this we'll need s3fs.

```
conda install s3fs
```

```python
taxi = dd.read_csv('s3://nyc-tlc/trip data/yellow_tripdata_2015-*.csv')
```

**Warning**: operations over the Internet can take a long time to run. Such operations work really well in a cloud clustered set-up, e.g., amazon EC2 machines reading from S3 or Google compute machines reading from GCS.