# Examples - DataFrame
http://dask.pydata.org/en/latest/examples-tutorials.html#dataframe

In [1]:
import dask
import dask.dataframe as dd

data_path = '../dask-tutorial/data'

## Dataframes from CSV files
Suppose we have a collection of CSV files with data:

In [2]:
text = '''time,temperature,humidity
0,22,58
1,21,57
2,25,57
3,26,55
4,22,53
5,23,59 
''' 

with open('data1.csv', 'w') as f:
    f.write(text)

In [3]:
text = '''time,temperature,humidity
0,24,85
1,26,83
2,27,85
3,25,92
4,25,83
5,23,81 
''' 

with open('data2.csv', 'w') as f:
    f.write(text)

In [4]:
text = '''time,temperature,humidity
0,18,51
1,15,57
2,18,55
3,19,51
4,19,52
5,19,57
''' 

with open('data3.csv', 'w') as f:
    f.write(text)

In [5]:
import dask.dataframe as dd

df = dd.read_csv('data*.csv')
df

Unnamed: 0_level_0,time,temperature,humidity
npartitions=3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,int64,int64,int64
,...,...,...
,...,...,...
,...,...,...


In [6]:
type(df)

dask.dataframe.core.DataFrame

In [7]:
df.tail()

Unnamed: 0,time,temperature,humidity
1,1,15,57
2,2,18,55
3,3,19,51
4,4,19,52
5,5,19,57


In [8]:
type(df.compute())

pandas.core.frame.DataFrame

In [9]:
df.compute().tail()

Unnamed: 0,time,temperature,humidity
1,1,15,57
2,2,18,55
3,3,19,51
4,4,19,52
5,5,19,57


In [10]:
df.temperature.mean().compute()

22.055555555555557

In [11]:
df.humidity.std().compute()

14.710829233324224

## Dataframes from HDF5 files
This section provides working examples of dask.dataframe methods to read HDF5 files. HDF5 is a unique technology suite that makes possible the management of large and complex data collections. To learn more about HDF5, visit the HDF Group Tutorial page. For an overview of dask.dataframe, its limitations, scope, and use, see the DataFrame overview section.

Important Note – dask.dataframe.read_hdf uses pandas.read_hdf, thereby inheriting its abilities and limitations. See pandas HDF5 documentation for more information.

### Examples Covered
Use dask.dataframe to:
- Create dask DataFrame by loading a specific dataset (key) from a single HDF5 file
- Create dask DataFrame from a single HDF5 file with multiple datasets (keys)
- Create dask DataFrame by loading multiple HDF5 files with different datasets (keys)  

### Generate Example Data
Here is some code to generate sample HDF5 files.



In [12]:
'my{:02d}.h5'.format(5)

'my05.h5'

In [13]:
import string, json, random
import pandas as pd
import numpy as np

# dict to keep track of hdf5 filename and each key
fileKeys = {}

for i in range(10):
    # randomly pick letter as dataset key
    groupkey = random.choice(list(string.ascii_lowercase))

    # randomly pick a number as hdf5 filename
    filename = 'my{:02d}.h5'.format(i)

    # Make a dataframe; 26 rows, 2 columns
    df = pd.DataFrame({'x': np.random.randint(1, 1000, 26),
                       'y': np.random.randint(1, 1000, 26)},
                       index=list(string.ascii_lowercase))

    # Write hdf5 to current directory
    df.to_hdf(filename, key='/' + groupkey, format='table')
    fileKeys[filename] = '/' + groupkey

print(fileKeys) # prints hdf5 filenames and keys for each

{'my00.h5': '/g', 'my01.h5': '/w', 'my02.h5': '/i', 'my03.h5': '/x', 'my04.h5': '/f', 'my05.h5': '/u', 'my06.h5': '/v', 'my07.h5': '/x', 'my08.h5': '/o', 'my09.h5': '/w'}


### Read single dataset from HDF5
The first order of dask.dataframe business is creating a dask DataFrame using a single HDF5 file’s dataset. The code to accomplish this task is:

In [14]:
import dask.dataframe as dd
filename = 'my06.h5'
df = dd.read_hdf(filename, key=fileKeys[filename])
df

Unnamed: 0_level_0,x,y
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
,int64,int64
,...,...


In [15]:
len(df)

26

In [16]:
df.tail()

Unnamed: 0,x,y
v,273,448
w,995,415
x,327,21
y,883,972
z,205,837


### Load multiple datasets from single HDF5 file
Loading multiple datasets from a single file requires a small tweak and use of the wildcard character:

In [17]:
import dask.dataframe as dd
filename = 'my06.h5'
df = dd.read_hdf(filename, key='/*')
df

Unnamed: 0_level_0,x,y
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1
,int64,int64
,...,...
...,...,...
,...,...
,...,...


In [18]:
len(df)

130

In [19]:
df.tail()

Unnamed: 0,x,y
v,61,997
w,262,593
x,286,39
y,47,263
z,922,76


### Create dask DataFrame from multiple HDF5 files
The next example is a natural progression from the previous example (e.g. using a wildcard). Add a wildcard for the key and path parameters to read multiple files and multiple keys:

In [20]:
import dask.dataframe as dd
df = dd.read_hdf('*.h5', key='/*')
df

Unnamed: 0_level_0,x,y
npartitions=47,Unnamed: 1_level_1,Unnamed: 2_level_1
,int64,int64
,...,...
...,...,...
,...,...
,...,...


In [21]:
len(df)

1222

In [22]:
df.tail()

Unnamed: 0,x,y
v,486,705
w,727,6
x,38,278
y,859,212
z,412,905
