<center>
    <tr>
    <td><img src="./images/quansight-logo.png" width="25%"></img></td>
    <td><img src="./images/capital-one-logo.png" width="25%"></img></td>
    </tr>
</center>

# Verifying packages in the environment

In [None]:
import dask
import pathlib
import pandas as pd
import dask.dataframe as dd
print(f'Dask version: {dask.__version__}')
print(f'Pandas version: {pd.__version__}')

In [None]:
import dask_ml, dask_glm
print(f'Dask-ML version: {dask_ml.__version__}')
print(f'Dask-GLM version: {dask_glm.__version__}')

# Copying files from the `shared` folder

In each user's home folder (`/home/joyvan/`), there is a symbolic link to a `shared` folder. Within the `shared` folder is a subfolder `capitalone-users` and another called `admin`. We'll now copy data from the `shared/capitalone-users/data` into a local folder called `data`.

In [None]:
topdir = pathlib.Path().cwd()
local_data = topdir / 'data'
# 1. Creating local sub-folder data if necessary
if (not local_data.exists()):
    print(f'Making local directory {local_data}...')
    local_data.mkdir()

In [None]:
# 2. Downloading files from shared into local data if necessary
%run prep.py -d flights
flight_data = local_data / 'nycflights'

# A few sample Dask dataframe computations

Having copied the data locally, let's check that a few Dask computations work as intended.

First, let's load & examine all the CSV files of delay information for flights to or from New York City airports from the 1990s. There are ten files (one for each year) but they are represented by a single Dask dataframe.

In [None]:
flight_csvs = list(flight_data.glob('*.csv'))

df = dd.read_csv(flight_csvs,
                 parse_dates={'Date': [0, 1, 2]},
                 dtype={'TailNum': str,
                        'CRSElapsedTime': float,
                        'Cancelled': bool},
                 assume_missing=True)

In [None]:
df.head()

### 1.) How many rows are in our dataset?

If you aren't familiar with Pandas, how would you check how many records are in a list of tuples?

In [None]:
len(df)

### 2.) In total, how many non-canceled flights were taken?

With Pandas, you would use [boolean indexing](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing).


In [None]:
non_cancelled = df.loc[~df.Cancelled]
len(non_cancelled)

### 3.) In total, how many non-cancelled flights were taken from each airport?

Hint: use [`df.groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html).


In [None]:
result = (non_cancelled
            .groupby('Origin')   # groups rows according to Origin column
            .Origin              # extracts only Origin column
            .count()             # aggregates count
         )

In [None]:
result.compute()