## 5. EDA with python

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

In [2]:
%load_ext rpy2.ipython
%load_ext memory_profiler

### Load Data

In [3]:
%%timeit -n 1 -r 1
%%memit

import dask.dataframe as dd

combined_data = dd.read_csv("../data/combined_data.csv/*")
combined_data = combined_data.drop(combined_data.columns[[0, -1]], axis=1)

peak memory: 164.56 MiB, increment: 22.70 MiB
1.03 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [4]:
print(combined_data.columns)

Index(['time', 'lat_min', 'lat_max', 'lon_min', 'lon_max', 'rain (mm/day)',
       'model'],
      dtype='object')


### Method 1: Using Dask only

Find geographical range of all records and mean rainfall level

In [23]:
%%timeit -n 1 -r 1
%%memit

print(f"Minimum lat_min is %0.4f" % 
      combined_data['lat_min'].astype('float64').mean().compute())
print(f"Maximum lat_max is %0.4f" % 
      combined_data['lat_max'].astype('float64').mean().compute())
print(f"Minimum lon_min is %0.4f" % 
      combined_data['lon_min'].astype('float64').mean().compute())
print(f"Maximum lon_max is %0.4f" % 
      combined_data['lon_max'].astype('float64').mean().compute())
print(f"Mean rainfall is %0.4f" % 
      combined_data['rain (mm/day)'].astype('float64').mean().compute())

Min lat_min is -33.1048
Max lat_max is -31.9776
Min lon_min is 146.9059
Max lon_max is 148.2150
Mean rainfall is 1.9018
peak memory: 1403.47 MiB, increment: 1213.18 MiB
6min 33s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Method 2: Using Dask and change datatypes

#### Default datatypes:

In [13]:
print(combined_data.dtypes)

time             object
lat_min          object
lat_max          object
lon_min          object
lon_max          object
rain (mm/day)    object
model            object
dtype: object


#### Changing to `float32`:

In [25]:
print("Memory usage with 'object' dtype:")
print(combined_data[['lat_min', 'lat_max', 'lon_min', 'lon_max', 'rain (mm/day)']].memory_usage().compute())

combined_data[['lat_min', 'lat_max', 'lon_min', 'lon_max', 'rain (mm/day)']] = \
    combined_data[['lat_min', 'lat_max', 'lon_min', 'lon_max', 'rain (mm/day)']].astype('float32')
    
print("\nMemory usage after changing to float32:")
print(combined_data[['lat_min', 'lat_max', 'lon_min', 'lon_max', 'rain (mm/day)']].memory_usage().compute())

Memory usage with 'object' dtype:
Index                22016
lat_max          500110904
lat_min          500110904
lon_max          500110904
lon_min          500110904
rain (mm/day)    500110904
dtype: int64

Memory usage after changing to float32:
Index                22016
lat_max          250055452
lat_min          250055452
lon_max          250055452
lon_min          250055452
rain (mm/day)    250055452
dtype: int64


#### Perform the same EDA with `float32`

In [30]:
%%timeit -n 1 -r 1
%%memit

print(f"Minimum lat_min is %0.4f" % 
      combined_data['lat_min'].mean().compute())
print(f"Maximum lat_max is %0.4f" % 
      combined_data['lat_max'].mean().compute())
print(f"Minimum lon_min is %0.4f" % 
      combined_data['lon_min'].mean().compute())
print(f"Maximum lon_max is %0.4f" % 
      combined_data['lon_max'].mean().compute())
print(f"Mean rainfall is %0.4f" % 
      combined_data['rain (mm/day)'].mean().compute())

Minimum lat_min is -33.1048
Maximum lat_max is -31.9776
Minimum lon_min is 146.9059
Maximum lon_max is 148.2150
Mean rainfall is 1.9018
peak memory: 1522.01 MiB, increment: 1335.64 MiB
8min 16s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Observation
- Since the `combined_data` was stored into partitions with Dask, we had to load them with Dask also. 
- Using Dask to perform the EDA, the peak memory usage was around 1403MB, which isn't that high due to Dask's partitioning and parallel mechanism. It took around 6.5 minutes to process all 5 columns. 
- Originally, Dask saved all columns as `object` datatype. After changing the numeric columns to `float32`, the memory usage went from around 500MB each column to around 250MB each column. <br> 
- Performing the same EDA, the peak memory usage is around the same level ~1500MB. This is because `%memit` measures the memory usage of this operation, instead of the data size. So although the data size decreased, the memory stack occupied by this code cell didn't change much. The slight increase was probably due to other programs in the background affecting the memory I/O. 