# DataFrames: Reading in messy data
     
In the [01-data-access](./01-data-access.ipynb) example we show how Dask Dataframes can read and store data in many of the same formats as Pandas dataframes.  One key difference, when using Dask Dataframes is that instead of opening a single file with a function like [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), we typically open many files at once with [dask.dataframe.read_csv](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv). This enables us to treat a collection of files as a single dataset. Most of the time this works really well. But real data is messy and in this notebook we will explore a more advanced technique to bring messy datasets into a dask dataframe.

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  It will provide a dashboard which 
is useful to gain insight on the computation.  

The link to the dashboard will become visible when you create the client below.  We recommend having it open on one side of your screen while using your notebook on the other side.  This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

In [1]:
from dask.distributed import Client
client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client

0,1
Client  Scheduler: inproc://10.1.0.4/8351/1  Dashboard: http://10.1.0.4:8787/status,Cluster  Workers: 1  Cores: 4  Memory: 2.00 GB


## Create artificial dataset

First we create an artificial dataset and write it to many CSV files.

You don't need to understand this section, we're just creating a dataset for the rest of the notebook.

In [2]:
import dask
df = dask.datasets.timeseries()
df

Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int64,object,float64,float64
2000-01-02,...,...,...,...
...,...,...,...,...
2000-01-30,...,...,...,...
2000-01-31,...,...,...,...


In [3]:
import os
import datetime

if not os.path.exists('data'):
    os.mkdir('data')

def name(i):
    """ Provide date for filename given index
    
    Examples
    --------
    >>> name(0)
    '2000-01-01'
    >>> name(10)
    '2000-01-11'
    """
    return str(datetime.date(2000, 1, 1) + i * datetime.timedelta(days=1))
    
df.to_csv('data/*.csv', name_function=name, index=False);

## Read CSV files

We now have many CSV files in our data directory, one for each day in the month of January 2000.  Each CSV file holds timeseries data for that day.  We can read all of them as one logical dataframe using the `dd.read_csv` function with a glob string.

In [4]:
!ls data/*.csv | head

data/2000-01-01.csv
data/2000-01-02.csv
data/2000-01-03.csv
data/2000-01-04.csv
data/2000-01-05.csv
data/2000-01-06.csv
data/2000-01-07.csv
data/2000-01-08.csv
data/2000-01-09.csv
data/2000-01-10.csv


In [5]:
import dask.dataframe as dd

df = dd.read_csv('data/2000-*-*.csv')
df

Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,int64,object,float64,float64
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


In [6]:
df.head()

Unnamed: 0,id,name,x,y
0,1031,Norbert,-0.285591,-0.415443
1,1077,Victor,-0.125232,0.414574
2,1018,Tim,-0.530455,0.291435
3,1017,Edith,-0.334858,0.315679
4,1020,Sarah,-0.10648,-0.854046


Let's look at some statistics on the data

In [7]:
df.describe().compute()

Unnamed: 0,id,x,y
count,2592000.0,2592000.0,2592000.0
mean,1000.04,-0.0001865578,-0.0004504002
std,31.6233,0.5770384,0.5772731
min,853.0,-0.999999,-0.9999995
25%,979.0,-0.492414,-0.4945359
50%,1000.0,0.004400842,0.008770296
75%,1022.0,0.5044882,0.5054544
max,1166.0,0.9999993,0.9999999


# Make some messy data

Now this works great, and in most cases dd.read_csv or dd.read_parquet etc are the preferred way to read in large collections of data files into a dask dataframe, but real world data is often very messy and some files may be broken or badly formatted. To simulate this we are going to create some fake messy data by tweaking our example csv files. For the file `data/2000-01-05.csv` we will replace with no data and for the file `data/2000-01-07.csv` we will remove the `y` column 

In [8]:
# corrupt the data in data/2000-01-05.csv
with open('data/2000-01-05.csv', 'w') as f:
    f.write('')

In [9]:
# remove y column from data/2000-01-07.csv
import pandas as pd
df = pd.read_csv('data/2000-01-07.csv')
del df['y']
df.to_csv('data/2000-01-07.csv', index=False)

In [10]:
!head data/2000-01-05.csv

In [11]:
!head data/2000-01-07.csv

id,name,x
967,Laura,0.4341572011626995
1022,George,0.13477427067650938
1037,Oliver,0.4025775422116451
1022,Bob,-0.9912760196055932
1006,Ray,-0.12982165632020234
960,George,-0.8806411585206141
1048,Bob,0.3086024326246655
949,Jerry,-0.4242706465864399
986,Wendy,-0.20820823386390508


# Reading the messy data

Let's try reading in the collection of files again

In [12]:
df = dd.read_csv('data/2000-*-*.csv')

In [13]:
df.head()

Unnamed: 0,id,name,x,y
0,1031,Norbert,-0.285591,-0.415443
1,1077,Victor,-0.125232,0.414574
2,1018,Tim,-0.530455,0.291435
3,1017,Edith,-0.334858,0.315679
4,1020,Sarah,-0.10648,-0.854046


Ok this looks like it worked, let us calculate the dataset statistics again

In [14]:
df.describe().compute()

Function:  execute_task
args:      ((subgraph_callable, (<function pandas_read_text at 0x7fd286f13d30>, <function _make_parser_function.<locals>.parser_f at 0x7fd28d0693a0>, (<function read_block_from_file at 0x7fd284ef6550>, <OpenFile '/home/runner/work/dask-examples/dask-examples/dataframes/data/2000-01-07.csv'>, 0, 64000000, b'\n'), b'id,name,x,y\n', {}, {'id': dtype('int64'), 'name': dtype('O'), 'x': dtype('float64'), 'y': dtype('float64')}, ['id', 'name', 'x', 'y'], False, False, None), 'read-csv-be1fa3ac9b1647bb775e0bad009b4961'))
kwargs:    {}
Exception: ValueError('Length mismatch: Expected axis has 3 elements, new values have 4 elements')



ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements

So what happened? 

When creating a dask dataframe from a collection of files, dd.read_csv samples the first few files in the dataset to determine the datatypes and columns available. Since it has not opened all the files it does not now if some of them are corrupt. Hence, `df.head()` works since it is only looking at the first file. `df.describe.compute()` fails because of the corrupt data in `data/2000-01-05.csv`

# Building a delayed reader

To get around this problem we are going to use a more advanced technique to build our dask dataframe. This method can also be used any time some custom logic is required when reading each file. Essentially, we are going to build a function that uses  pandas and some error checking and returns a pandas dataframe. If we find a bad data file we will either find a way to fix/clean the data or we will return and empty pandas dataframe with the same structure as the good data.

In [15]:
import numpy as np
import io

def read_data(filename):
    
    # for this to work we need to explicitly set the datatypes of our pandas dataframe 
    dtypes = {'id': int, 'name': str, 'x': float, 'y': float}
    try:
        # try reading in the data with pandas 
        df = pd.read_csv(filename, dtype=dtypes)
    except:
        # if this fails create an empty pandas dataframe with the same dtypes as the good data
        df = pd.read_csv(io.StringIO(''), names=dtypes.keys(), dtype=dtypes)
    
    # for the case with the missing column, add a column of data with NaN's
    if 'y' not in df.columns:
        df['y'] = np.NaN
        
    return df

Let's test this function on a good file and the two bad files

In [16]:
# test function on a normal file
read_data('data/2000-01-01.csv').head()

Unnamed: 0,id,name,x,y
0,1031,Norbert,-0.285591,-0.415443
1,1077,Victor,-0.125232,0.414574
2,1018,Tim,-0.530455,0.291435
3,1017,Edith,-0.334858,0.315679
4,1020,Sarah,-0.10648,-0.854046


In [17]:
# test function on the empty file
read_data('data/2000-01-05.csv').head()

Unnamed: 0,id,name,x,y


In [18]:
# test function on the file missing the y column
read_data('data/2000-01-07.csv').head()

Unnamed: 0,id,name,x,y
0,967,Laura,0.434157,
1,1022,George,0.134774,
2,1037,Oliver,0.402578,
3,1022,Bob,-0.991276,
4,1006,Ray,-0.129822,


# Assembling the dask dataframe

First we take our `read_data` function and convert it to a dask delayed function

In [19]:
from dask import delayed
read_data = delayed(read_data)

Let us look at what the function does now

In [20]:
df = read_data('data/2000-01-01.csv')
df

Delayed('read_data-8a0e1fc2-94d9-4d86-81f0-092f89dc83df')

It creates a delayed object, to actually run read the file we need to run `.compute()`

In [21]:
df.compute()

Unnamed: 0,id,name,x,y
0,1031,Norbert,-0.285591,-0.415443
1,1077,Victor,-0.125232,0.414574
2,1018,Tim,-0.530455,0.291435
3,1017,Edith,-0.334858,0.315679
4,1020,Sarah,-0.106480,-0.854046
...,...,...,...,...
86395,984,Patricia,-0.036748,-0.032029
86396,986,Frank,0.784252,0.591889
86397,969,Yvonne,-0.572503,-0.743844
86398,977,Michael,-0.791680,0.052211


Now let's build a list of all the available csv files

In [22]:
# loop over all the files
from glob import glob
files = glob('data/2000-*-*.csv')
files

['data/2000-01-29.csv',
 'data/2000-01-11.csv',
 'data/2000-01-23.csv',
 'data/2000-01-28.csv',
 'data/2000-01-18.csv',
 'data/2000-01-26.csv',
 'data/2000-01-03.csv',
 'data/2000-01-14.csv',
 'data/2000-01-08.csv',
 'data/2000-01-15.csv',
 'data/2000-01-27.csv',
 'data/2000-01-09.csv',
 'data/2000-01-16.csv',
 'data/2000-01-21.csv',
 'data/2000-01-10.csv',
 'data/2000-01-07.csv',
 'data/2000-01-05.csv',
 'data/2000-01-01.csv',
 'data/2000-01-02.csv',
 'data/2000-01-20.csv',
 'data/2000-01-12.csv',
 'data/2000-01-13.csv',
 'data/2000-01-22.csv',
 'data/2000-01-06.csv',
 'data/2000-01-19.csv',
 'data/2000-01-24.csv',
 'data/2000-01-25.csv',
 'data/2000-01-17.csv',
 'data/2000-01-30.csv',
 'data/2000-01-04.csv']

Now we run the delayed read_data function on each file in the list

In [23]:
df = [read_data(file) for file in files]
df

[Delayed('read_data-5cd7a8c2-7f8b-4d8f-97db-0df972ef642e'),
 Delayed('read_data-0216fe95-5e08-4d4d-9dcb-ffdf051fc7ec'),
 Delayed('read_data-31f254a5-9b6d-4d81-8996-4871cd05f49a'),
 Delayed('read_data-445e8d49-f0ec-43b4-a538-82e0e845d7fe'),
 Delayed('read_data-ced50558-a8db-47a5-9c3c-992bca0473c5'),
 Delayed('read_data-c44b07d9-de6b-47ee-9fb3-7b40256fde71'),
 Delayed('read_data-e451e918-3b91-4e2d-8f39-e588147cc226'),
 Delayed('read_data-c8a3b987-3ce5-4b27-872b-165c424a8071'),
 Delayed('read_data-a3840716-1d80-4f2a-a959-727f9e3d725d'),
 Delayed('read_data-00e6cf77-daa3-4efa-8c1d-cab761058582'),
 Delayed('read_data-d4607c01-705a-426c-81ed-e2facc9dbe82'),
 Delayed('read_data-b2896e2e-5d61-45a5-a1e1-356d37c13170'),
 Delayed('read_data-03e7bc8d-9486-4a6e-bbb0-293a96fbd2ed'),
 Delayed('read_data-c3c72765-e483-4e61-b23e-19ab676dea88'),
 Delayed('read_data-a9437614-b720-4fe4-80f3-9c4edcc5fe5a'),
 Delayed('read_data-e64c56e3-6ebf-4a81-a2e3-6da1f6357669'),
 Delayed('read_data-132413d6-a225-4e26-a

Then we use [dask.dataframe.from_delayed](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_delayed). This function creates a Dask DataFrame from a list of delayed objects as long as each delayed object returns a pandas dataframe. The structure of each individual dataframe returned must also be the same.

In [24]:
df = dd.from_delayed(df, meta={'id': int, 'name': str, 'x': float, 'y': float})
df

Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,int64,object,float64,float64
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


Note: we provided the dtypes in the `meta` keyword to explicitly tell Dask Dataframe what kind of dataframe to expect. If we did not do this Dask would infer this from the first delayed object which could be slow if it was a large csv file

## Now let's see if this works

In [25]:
df.head()

Unnamed: 0,id,name,x,y
0,1032,Patricia,-0.980309,0.269225
1,935,Hannah,-0.846994,-0.182445
2,969,Oliver,0.482768,-0.641575
3,1079,Zelda,0.269655,0.043605
4,1022,Zelda,-0.059803,-0.727712


In [26]:
df.describe().compute()

  x = np.divide(x1, x2, out)


Unnamed: 0,id,x,y
count,2505600.0,2505600.0,2419200.0
mean,1000.039,-0.0001115737,-0.0003991555
std,31.62804,0.5770438,0.5773141
min,853.0,-0.999999,-0.9999995
25%,979.0,-0.492414,-0.4945359
50%,1000.0,0.004400842,0.008770296
75%,1022.0,0.5044882,0.5054544
max,1157.0,0.9999993,0.9999999


## Success!

To recap, in this example, we looked at an approach to create a Dask Dataframe from a collection of many data files. Typically you would use built-in functions like `dd.read_csv` or `dd.read_parquet` to do this. Sometimes, this is not possible because of messy/corrupted files in your dataset or some custom processing that might need to be done. 

In these cases, you can build a Dask DataFrame with the following steps.

1. Create a regular python function that reads the data, performs any transformations, error checking etc and always returns a  Pandas dataframe with the same structure
2. Convert this read function to a delayed object using the `dask.delayed` function
3. Call each file in your dataset with the delayed data reader and assemble the output as a list of delayed objects
4. Used `dd.from_delayed` to covert the list of delayed objects to a Dask Dataframe 

This same technique can be used in other situations as well. Another example might be data files that require using a specialized reader, or several transformations before they can be converted to a pandas dataframe.