## Loading your Data from an AzureML Datastore

**Important**: Make sure to execute the steps to start the cluster in the notebook [StartDask.ipynb](StartDask.ipynb) before running this noteboook.

In [1]:
from azureml.core import Workspace, Experiment
from azureml.core import VERSION
import time
VERSION

'1.0.41'

### Uploading the data to the AzureML Datastore
AzureML has the concept of a Datastore that can be mounted to a job, so you script does not have to deal with reading from Azure Blobstorage. First, let's download some data and upload it to the blob store, so we can play with it in Dask
(parts of this code originates from https://github.com/dask/dask-tutorial).

In [2]:
import os
import tarfile
import urllib.request


cwd = os.getcwd()

data_dir = os.path.abspath(os.path.join(cwd, 'data'))
if not os.path.exists(data_dir):
    os.makedirs('data')

flights_raw = os.path.join(data_dir, 'nycflights.tar.gz')
flightdir = os.path.join(data_dir, 'nycflights')

if not os.path.exists(flights_raw):
    print("- Downloading NYC Flights dataset... ", end='', flush=True)
    url = "https://storage.googleapis.com/dask-tutorial-data/nycflights.tar.gz"
    urllib.request.urlretrieve(url, flights_raw)
    print("done", flush=True)

if not os.path.exists(flightdir):
    print("- Extracting flight data... ", end='', flush=True)
    tar_path = os.path.join(data_dir, 'nycflights.tar.gz')
    with tarfile.open(tar_path, mode='r:gz') as flights:
        flights.extractall('data/')
    print("done", flush=True)

    
print("- Uploading flight data... ")
ws = Workspace.from_config()
ds = ws.get_default_datastore()

ds.upload(src_dir=flightdir,
          target_path='nycflights',
          show_progress=True)

print("** Finished! **")

- Uploading flight data... 


Target already exists. Skipping upload for nycflights/1990.csv
Target already exists. Skipping upload for nycflights/1991.csv
Target already exists. Skipping upload for nycflights/1992.csv
Target already exists. Skipping upload for nycflights/1993.csv
Target already exists. Skipping upload for nycflights/1994.csv
Target already exists. Skipping upload for nycflights/1995.csv
Target already exists. Skipping upload for nycflights/1996.csv
Target already exists. Skipping upload for nycflights/1997.csv
Target already exists. Skipping upload for nycflights/1998.csv
Target already exists. Skipping upload for nycflights/1999.csv


** Finished! **


### Using the Datastore on the Dask cluster

Now, lets make use of the data on the Dask cluster you created in [StartDask.ipynb](StartDask.ipynb).
You might have noticed that we launched the cluster with a --data parameter which instructed AzureML to mount the workspace's default Datastore onto all the workers of the cluster.

```
est = Estimator('dask', 
                compute_target=dask_cluster, 
                entry_script='startDask.py', 
                conda_dependencies_file_path='environment.yml', 
                script_params=
                    {'--data': ws.get_default_datastore()},
                node_count=10,
                distributed_training=mpi_configuration)
```

At this time the local path on the compute is not determined, but it will be once the job starts. We therefore log the path back to the run history from which we can now retrieve it.

In [None]:
## get the last run on the dask experiment which should be running 
## our dask cluster, and retrieve the data path from it
ws = Workspace.from_config()
exp = ws.experiments['dask']
cluster_run = exp.get_runs().__next__()

if (not cluster_run.status == 'Running'):
    raise Exception('Cluster should be in state \'Running\'')

data_path = cluster_run.get_metrics()['data'] + '/nycflights'
data_path

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client


In [4]:
# Get the dask cluster
from dask.distributed import Client

c = Client('tcp://localhost:8786')
c

0,1
Client  Scheduler: tcp://localhost:8786  Dashboard: http://localhost:8787/status,Cluster  Workers: 19  Cores: 38  Memory: 138.86 GB


In [5]:
# create a dask dataframe that loads the data from the path on the cluster
import dask.dataframe as dd
from dask import delayed

def load_data(path):
    df = dd.read_csv(path + '/*.csv',
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={'TailNum': str,
                        'CRSElapsedTime': float,
                        'Cancelled': bool})    
    return df

In [6]:
# we need to delay the excution of the read to make sure the path 
# evaluated on the cluster, not the client
df = delayed(load_data)(data_path).compute()

In [29]:
# now run some interactive queries
print(len(df))
df.head()

2611892


Unnamed: 0,Date,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,...,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,Diverted
0,1990-01-01,1,1621.0,1540,1747.0,1701,US,33,,86.0,...,,46.0,41.0,EWR,PIT,319.0,,,False,0
1,1990-01-02,2,1547.0,1540,1700.0,1701,US,33,,73.0,...,,-1.0,7.0,EWR,PIT,319.0,,,False,0
2,1990-01-03,3,1546.0,1540,1710.0,1701,US,33,,84.0,...,,9.0,6.0,EWR,PIT,319.0,,,False,0
3,1990-01-04,4,1542.0,1540,1710.0,1701,US,33,,88.0,...,,9.0,2.0,EWR,PIT,319.0,,,False,0
4,1990-01-05,5,1549.0,1540,1706.0,1701,US,33,,77.0,...,,5.0,9.0,EWR,PIT,319.0,,,False,0


In [30]:
df.Origin.unique().compute()

0    EWR
1    LGA
2    JFK
Name: Origin, dtype: object

In [31]:
df.groupby('Origin').Distance.mean().compute()

Origin
EWR     876.278885
JFK    1484.209596
LGA     712.546238
Name: Distance, dtype: float64

In [60]:
df[~df.Cancelled].groupby('Origin').Origin.count().compute()

Origin
EWR    1139451
JFK     427243
LGA     974267
Name: Origin, dtype: int64

In [54]:
dest = df[~df.Cancelled].groupby('Dest').FlightNum.count().compute()
dest.sort_values(ascending=False)

Dest
ORD    219060
BOS    145105
ATL    128855
MIA    111001
LAX    109848
DFW    107606
DCA    106853
MCO    102730
DTW     88471
SFO     72681
PIT     72181
FLL     67950
DEN     62994
CLT     62714
CLE     61078
STL     58681
IAH     57230
PBI     56479
TPA     52879
MSP     52184
SJU     52173
BUF     51590
CMH     41376
CVG     39604
RDU     37855
GSO     32811
ORF     29093
PHX     28525
BWI     26859
ROC     25228
        ...  
AUS       776
HNL       569
SAV       543
OMA       528
EGE       508
BGR       504
MYR       366
LWB       306
ORH       237
ROA       200
CRW       166
TYS       166
HDN       146
PSE       129
ACK       129
BHM       112
ABE       101
MTJ        34
EWR        30
ANC        27
CHO        23
SWF        20
ICT        19
LGA         8
ISP         7
JFK         6
CRP         2
TUS         2
ABQ         1
STX         1
Name: FlightNum, Length: 99, dtype: int64