Scaling Scientific Data Analysis in Python
=======================


### Requirements

* Pandas
* concurrent.futures (standard in Python 3, `pip install futures` in Python 2)
* snakeviz (for profile visualization, `pip install snakeviz==0.4.1`)
* Dask `pip install dask[complete]` 

**Dependencies**: tables toolz pandas_datareader

**Note:** For better compatibility, prefer creating a conda environment from the environment.yml file. Alternatively there's also a requirements.txt file, which can install all dependencies in your virtualenv (`pip install -r requirements.txt`)

### Objective
Be sure the commands below run without problems and you will be able to see the Snakeviz plot with several layers. **It is a requirement for being able to follow the tutorial.**

## Before we start

We need to get some data to work with.
We generate some fake stock data by adding a bunch of points between real stock data points. This will take a few minutes the first time we run it.

In [2]:
%run ../prep.py

## Sequential Execution

In [1]:
%load_ext snakeviz
from glob import glob
import json
import pandas as pd
import os

filenames = sorted(glob(os.path.join('..', 'data', 'json', '*.json')))  # ../data/json/*.json
filenames[:5]

['../data/json/afl.json',
 '../data/json/aig.json',
 '../data/json/al.json',
 '../data/json/avy.json',
 '../data/json/bwa.json']

In [2]:
%%snakeviz

for fn in filenames:
    print(fn)
    with open(fn) as f:
        data = [json.loads(line) for line in f]
        
    df = pd.DataFrame(data)
    
    out_filename = fn[:-5] + '.h5'
    df.to_hdf(out_filename, '/data')

../data/json/afl.json


your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->axis0] [items->None]

  f(store)
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->block0_items] [items->None]

  f(store)
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->block1_values] [items->[u'timestamp']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->unicode,key->block1_items] [items->None]

  f(store)


../data/json/aig.json
../data/json/al.json
../data/json/avy.json
../data/json/bwa.json
../data/json/hal.json
../data/json/hp.json
../data/json/hpq.json
../data/json/ibm.json
../data/json/jbl.json
../data/json/jpm.json
../data/json/luv.json
../data/json/pcg.json
../data/json/usb.json
 
*** Profile stats marshalled to file u'/var/folders/tt/nxzmt8xd701cq_x0rp1n1hk8005dxz/T/tmp4OaAuu'. 
