# Australian Rainfall Data Acquisition and EDA

This notebook will be a quick run through of data acquisition and exploratory analysis to evaluate the impact of big data tools. These tools include Dask for out of core processing and Feather/Arrow/Parquet file formats to minimize memory usage and file size.

To run this notebook, use the `environment.yml` file in the root of the project to setup a Conda environment. You'll also need R > 4.0.0 installed and need to use:

```r
install.packages("arrow")
```
For using advanced file formats.

In [1]:
import re
import io
import os
import glob
import gc
import requests
import json
from urllib.request import urlretrieve
import zipfile
import pandas as pd
import numpy as np
import dask.dataframe as dd
# For different file types
import pyarrow.parquet as pq
import rpy2.rinterface
import rpy2_arrow.pyarrow_rarrow as pyra
import pyarrow.feather as feather

# Get helper functions
from scripts.utils import combine_australia_rainfall

%load_ext rpy2.ipython
%load_ext memory_profiler
%load_ext autoreload


# Constants
raw_data_directory_path = os.path.join("..", "data", "raw")
processed_data_directory_path = os.path.join("..", "data", "processed")

# Setup folders if not existing
os.makedirs(processed_data_directory_path, exist_ok=True)



First, download the zipped archive folder of all rainfall data - NOTE: THIS CAN TAKE ~ 45 minutes to run:

In [None]:
def download_progress(block_num, block_size, total_size):
    progress_size = int(block_num * block_size)
    percent = int(block_num * block_size * 100 / total_size)
    sys.stdout.write("\rDownloading... %d%%, %d MB of %d MB" %
                    (percent, progress_size / (1024 * 1024), total_size / (1024 * 1024)))
    sys.stdout.flush()


def download_from_figshare(article_id, files_to_dl, output_dir):
    url = f"https://api.figshare.com/v2/articles/{article_id}"
    response = requests.request("GET", url, headers={"Content-Type": "application/json"})
    data = json.loads(response.text)
    files = data["files"]
    for file in files:
        if file["name"] in files_to_dl:
            os.makedirs(output_dir, exist_ok=True)
            urlretrieve(file["download_url"], os.path.join(output_dir, file["name"]), download_progress)


download_from_figshare("14096681", ["data.zip"], raw_data_directory_path)

Next, iterate through the zipped files, extracting the raw csv files:

In [2]:
with zipfile.ZipFile(os.path.join(raw_data_directory_path, "data.zip"), "r") as zf:
    file_list = zf.namelist()
    for f in file_list:
        if "__MACOSX" not in f:
            print(f"Processing File: {f}")
            zf.extract(f, path = raw_data_directory_path)


Processing File: MPI-ESM-1-2-HAM_daily_rainfall_NSW.csv
Processing File: AWI-ESM-1-1-LR_daily_rainfall_NSW.csv
Processing File: NorESM2-LM_daily_rainfall_NSW.csv
Processing File: ACCESS-CM2_daily_rainfall_NSW.csv
Processing File: FGOALS-f3-L_daily_rainfall_NSW.csv
Processing File: CMCC-CM2-HR4_daily_rainfall_NSW.csv
Processing File: MRI-ESM2-0_daily_rainfall_NSW.csv
Processing File: GFDL-CM4_daily_rainfall_NSW.csv
Processing File: BCC-CSM2-MR_daily_rainfall_NSW.csv
Processing File: EC-Earth3-Veg-LR_daily_rainfall_NSW.csv
Processing File: CMCC-ESM2_daily_rainfall_NSW.csv
Processing File: NESM3_daily_rainfall_NSW.csv
Processing File: MPI-ESM1-2-LR_daily_rainfall_NSW.csv
Processing File: ACCESS-ESM1-5_daily_rainfall_NSW.csv
Processing File: FGOALS-g3_daily_rainfall_NSW.csv
Processing File: INM-CM4-8_daily_rainfall_NSW.csv
Processing File: MPI-ESM1-2-HR_daily_rainfall_NSW.csv
Processing File: TaiESM1_daily_rainfall_NSW.csv
Processing File: NorESM2-MM_daily_rainfall_NSW.csv
Processing File:

Next we'll evaluate the impact of combining multiple large CSV files using Pandas vs. Dask. We'll check how fast Dask is when creating the task graph as well as materializing to a Pandas dataframe.

In [2]:
%%time
%%memit
df_rainfall_pandas = combine_australia_rainfall(base_folder_path=raw_data_directory_path)

peak memory: 14989.59 MiB, increment: 14733.35 MiB
Wall time: 12min 24s


In [15]:
%%time
%%memit
combine_australia_rainfall(base_folder_path=raw_data_directory_path, method="dask", delay_dask_compute=False)

In [5]:
%%time
%%memit
df_rainfall_pandas.to_csv(os.path.join(processed_data_directory_path, "df_rainfall.csv"), index=False)

peak memory: 8657.47 MiB, increment: 4060.77 MiB
Wall time: 12min 15s


In [4]:
%%time
%%memit
feather.write_feather(df_rainfall_pandas, os.path.join(processed_data_directory_path, "df_rainfall.feather") )

peak memory: 13136.87 MiB, increment: 2609.02 MiB
Wall time: 39 s


In [7]:
%%time
%%memit
df_rainfall_pandas.to_parquet(os.path.join(processed_data_directory_path, "df_rainfall.parquet"),engine='pyarrow', compression='gzip')

peak memory: 11986.00 MiB, increment: 3184.09 MiB
Wall time: 2min 58s


Looks like using Dask is ~ 30% faster to read and concatentate multiple large CSV files. We also noted that allowing Dask to parallelize and chunk reading CSV files allowed the memory usage to stay lower as well.


# Reading Large Files

We'll read in the combined dataset now looking at different file types/chunk options/columns required.

In [8]:
%%time
%%memit
df_rainfall_pandas = pd.read_parquet(os.path.join(processed_data_directory_path, "df_rainfall.parquet"),engine='pyarrow')

peak memory: 17860.63 MiB, increment: 8927.25 MiB
Wall time: 47.3 s


In [9]:
%%time
%%memit
df_rainfall_pandas = pd.read_parquet(os.path.join(processed_data_directory_path, "df_rainfall.parquet"),columns=["time", "rain (mm/day)", "model"],engine='pyarrow')

peak memory: 14024.64 MiB, increment: 2371.00 MiB
Wall time: 32 s


In [5]:
%%time
%%memit
df_rainfall_pandas = pd.read_feather(os.path.join(processed_data_directory_path, "df_rainfall.feather"))

peak memory: 17895.05 MiB, increment: 7767.96 MiB
Wall time: 23.2 s


In [6]:
%%time
%%memit
df_rainfall_pandas = pd.read_feather(os.path.join(processed_data_directory_path, "df_rainfall.feather"), columns=["time", "rain (mm/day)", "model"])

peak memory: 18995.83 MiB, increment: 1104.16 MiB
Wall time: 22 s


In [4]:
%%time
%%memit
df_rainfall_pandas = pd.read_csv(os.path.join(processed_data_directory_path, "df_rainfall.csv"), index_col=False)

peak memory: 8678.25 MiB, increment: 8382.47 MiB
Wall time: 1min 42s


In [27]:
%%time
%%memit
df_rainfall_pandas = dd.read_csv(os.path.join(processed_data_directory_path, "df_rainfall.csv")).compute()

peak memory: 24421.03 MiB, increment: 5395.96 MiB
Wall time: 1min 57s


In [2]:
# For delayed calculations
df_rainfall_dask_delay = dd.read_csv(os.path.join(processed_data_directory_path, "df_rainfall.csv"))

# EDA

Next, we'll look at the speed of determining counts of values by model from the raw dataframe.


In [9]:
%%time
%%memit
df_rainfall_pandas["model"].value_counts()


peak memory: 10871.04 MiB, increment: 90.81 MiB
Wall time: 15.6 s


In [3]:
%%time
%%memit
model_counts = df_rainfall_dask_delay["model"].value_counts().compute()

peak memory: 1873.88 MiB, increment: 1615.46 MiB
Wall time: 1min 19s


In [8]:
df_rainfall_pandas.shape

(62467843, 3)

In [7]:
df_rainfall_pandas.head()

Unnamed: 0,time,rain (mm/day),model
0,1889-01-01 12:00:00,3.293256e-13,ACCESS-CM2
1,1889-01-02 12:00:00,0.0,ACCESS-CM2
2,1889-01-03 12:00:00,0.0,ACCESS-CM2
3,1889-01-04 12:00:00,0.0,ACCESS-CM2
4,1889-01-05 12:00:00,0.01047658,ACCESS-CM2


We can see that allowing Dask to optimize the task graph speeds up getting counts of values by model fairly significantly. To load the CSV with Pandas, then compute the value counts by group it took ~ (1 min 42 seconds + 15 seconds = 1 min 57 seconds) while for Dask delaying the computation it only took 1 min 19 seconds, an ~ 30% improvement.

# Challenges Encountered

Working with larger data sets on laptop can be difficult due to memory restrictions when loading a dataset. In the exercises in this lab we ran into difficulties with:

- Downloading large zip archives:
    - Getting the initial data downloaded was difficult as we were downloading one zip archive, and we couldn't parallelize this process. If there had been multiple files we could have used `multiprocessing` to download much quicker.
- Combining the dataset / EDA:
    - When combining the files with `Pandas` we were loading the entire dataset to memory and this caused issues for some group members with lower RAM. 
    - When saving the dataset, run times and memory usage were highly dependent on file type. Writing the dataset to CSV is memory intensive and takes > 3x as long as writing compressed Parquet files and  ~ 30x as long as writing a Feather file. 
    - Reading a single large file is much quicker with Feather or parquet file types - even parallelizing reading a CSV with Dask is slower than these more optimized file types.
    - If we knew in advance what calculations we wanted to carry out, we could have delayed execution and used Dask to build the graph and then run an optimized `compute()` step at the end. This can be seen above when combining all files and calculating counts of readings per model using lazy evaluation. If our calculation can be done by chunk - we could also read the CSV using Pandas in chunks and calculate our summary statistic. 