# Imports

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

# Dask
import dask.dataframe as dd

# pyarrow and feather
import pyarrow.feather as feather
import pyarrow.dataset as ds
import pyarrow as pa
import pyarrow.parquet as pq
import rpy2_arrow.pyarrow_rarrow as pyra

In [2]:
%load_ext rpy2.ipython
%load_ext memory_profiler

# 1. Teamwork Contract
The teamwork contract for our team, group 7, can be found [here](https://docs.google.com/document/d/1u4e5Z5C-uwTTSvCEyOYy-I30Fb8OEPYM6frM0NBEVVc/edit).

# 2. Create repository and project structure
The repository URL: https://github.com/UBC-MDS/DSCI525-Group7

# 3. Downloading the data

Using Python **requests** Library

We are using article id #14096681, which contains the data of **Daily rainfall over NSW, Australia.**

In [3]:
# Setup
article_id = 14096681  
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "rainfall/"

Review the files within the article:

In [4]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'id': 26579150,
  'name': 'daily_rainfall_2014.png',
  'size': 58863,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e'},
 {'id': 26579171,
  'name': 'environment.yml',
  'size': 192,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34'},
 {'id': 26586554,
  'name': 'README.md',
  'size': 5422,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c'},
 {'id': 26766812,
  'name': 'data.zip',
  'size': 814041183,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26766812',
  'supplied_md5': 'b517383f76e77bd03755a63a8f

# 3.1 Unzipping the data

In [5]:
%%time

files_to_dl = ["data.zip"]  
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

CPU times: user 3.64 s, sys: 3.42 s, total: 7.07 s
Wall time: 1min 8s


In [6]:
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

In [7]:
%ls -ltr rainfall/

total 12294480
-rw-r--r--   1 Rada  staff  814041183 30 Mar 10:08 data.zip
-rw-r--r--   1 Rada  staff   95376895 30 Mar 10:08 MPI-ESM-1-2-HAM_daily_rainfall_NSW.csv
-rw-r--r--   1 Rada  staff   94960113 30 Mar 10:08 AWI-ESM-1-1-LR_daily_rainfall_NSW.csv
-rw-r--r--   1 Rada  staff   82474546 30 Mar 10:08 NorESM2-LM_daily_rainfall_NSW.csv
-rw-r--r--   1 Rada  staff  127613760 30 Mar 10:08 ACCESS-CM2_daily_rainfall_NSW.csv
-rw-r--r--   1 Rada  staff  232118894 30 Mar 10:09 FGOALS-f3-L_daily_rainfall_NSW.csv
-rw-r--r--   1 Rada  staff  330360682 30 Mar 10:09 CMCC-CM2-HR4_daily_rainfall_NSW.csv
-rw-r--r--   1 Rada  staff  254009247 30 Mar 10:09 MRI-ESM2-0_daily_rainfall_NSW.csv
-rw-r--r--   1 Rada  staff  235661418 30 Mar 10:09 GFDL-CM4_daily_rainfall_NSW.csv
-rw-r--r--   1 Rada  staff  294260911 30 Mar 10:09 BCC-CSM2-MR_daily_rainfall_NSW.csv
-rw-r--r--   1 Rada  staff  295768615 30 Mar 10:09 EC-Earth3-Veg-LR_daily_rainfall_NSW.csv
-rw-r--r--   1 Rada  staff  328852379 30 Mar 10:09 CMCC-ES

# Comparison of Performance on Different Machines

The summary of all team members' time taken to unzip the data is recorded below. Each team member's Operating System, RAM, Processor and SSD are also recorded to check if they have any effect on the time taken.

| Team Member | Operating System     | RAM       | Processor                                                      | Is SSD   | Time Taken |
| ----------- | -----------          |-----------| ---------- ---------------------------------------------------|----------|---------  -|
| Jessie      | Windows 10 Education |  16GB      |   Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz   1.99 GHz        |   Yes    |  CPU times: total: 8.48 s <br> Wall time: 1min 35s          |
| Adrianne    | Windows 10 Pro       |  16GB     | Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz                |   Yes     | CPU times: total: 6.23 s <br> Wall time: 1min 7s          |
| Rada        |                      |           |                                                               |           |           |
| Moid        |                      |           |                                                                |           |           |

>**Discussion of results above**

# 4. Combining data CSVs

- Combine data CSVs into a single CSV using pandas.

- When combining the CSV files, add an extra column called "model" that identifies the model. Tip 1: you can get this column populated from the file name, eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON Tip 2: Remember how we added year when we combined airline CSVs. Tip 3: You can use regex generator.

_Note: There is a file called observed_daily_rainfall_SYD.csv in the data folder that you downloaded. Make sure you exclude this file (programmatically or just take out that file from folder) before you combine CSVs. We will use this file in our next milestone._

- Compare run times on different machines within your team and summarize your observations.
Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you discuss the reasons why you might not have been able to run this on your laptop.

Let's first view the data and the columns:

In [8]:
%%time

df_1 = pd.read_csv(output_directory+"/MPI-ESM-1-2-HAM_daily_rainfall_NSW.csv")
df_2 = pd.read_csv(output_directory+"/CMCC-CM2-SR5_daily_rainfall_NSW.csv")
df_3 = pd.read_csv(output_directory+"/SAM0-UNICON_daily_rainfall_NSW.csv")

CPU times: user 7.21 s, sys: 1 s, total: 8.21 s
Wall time: 8.46 s


Even loading three of the individual files is taking a little time.

In [9]:
df_1.head(2)

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
0,1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.244226e-13
1,1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.217326e-13


In [10]:
df_2.head(2)

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
0,1889-01-01 12:00:00,-35.811518,-34.86911,140.625,141.875,0.000424
1,1889-01-02 12:00:00,-35.811518,-34.86911,140.625,141.875,0.006158


In [11]:
df_3.head(2)

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
0,1889-01-01 12:00:00,-35.811518,-34.86911,140.625,141.875,3.04565e-13
1,1889-01-02 12:00:00,-35.811518,-34.86911,140.625,141.875,0.0003572392


In [12]:
df_3.tail(2)

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
3541151,2014-12-30 12:00:00,-30.157068,-29.21466,153.125,154.375,8.541592
3541152,2014-12-31 12:00:00,-30.157068,-29.21466,153.125,154.375,68.117489


In [13]:
%%time

files = glob.glob('rainfall/*NSW.csv')
df = pd.concat((pd.read_csv(file, index_col=0)
                .assign(model=re.findall(r'/([^_]*)', file)[0])
                for file in files)
              )
df.to_csv("rainfall/combined_data.csv")

CPU times: user 7min 21s, sys: 21 s, total: 7min 42s
Wall time: 7min 50s


In [14]:
# For Windows user 
##  Windows users will run into an index error when running the code above to combine the CSVs. This can be solved by adding a ./ to the filename as below.

In [15]:
%%time
%memit
files = glob.glob('./rainfall/*NSW.csv')
df = pd.concat((pd.read_csv(file, index_col=0)
                .assign(model=file.strip('./rainfall\\').split('_')[0])
                for file in files)
              )
df.to_csv("rainfall/combined_data.csv")

peak memory: 3939.19 MiB, increment: 0.10 MiB
CPU times: user 7min 14s, sys: 21.5 s, total: 7min 36s
Wall time: 7min 47s


Wow, this felt like an eternity!

Let's take a look at the combined file, see if head and tail are as we expect them to be:

In [16]:
df.head()

Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.244226e-13,MPI-ESM-1-2-HAM
1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.217326e-13,MPI-ESM-1-2-HAM
1889-01-03 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.498125e-13,MPI-ESM-1-2-HAM
1889-01-04 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.251282e-13,MPI-ESM-1-2-HAM
1889-01-05 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.270161e-13,MPI-ESM-1-2-HAM


In [17]:
df.tail()

Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-12-27 12:00:00,-30.157068,-29.21466,153.125,154.375,6.689683,SAM0-UNICON
2014-12-28 12:00:00,-30.157068,-29.21466,153.125,154.375,7.862555,SAM0-UNICON
2014-12-29 12:00:00,-30.157068,-29.21466,153.125,154.375,10.005026,SAM0-UNICON
2014-12-30 12:00:00,-30.157068,-29.21466,153.125,154.375,8.541592,SAM0-UNICON
2014-12-31 12:00:00,-30.157068,-29.21466,153.125,154.375,68.117489,SAM0-UNICON


## Comparison of Performance on Different Machines

The summary of all team members' time taken to combine the CSV's files is recorded below. Each team member's Operating System, RAM, Processor and SSD are also recorded to check if they have any effect on the time taken.

| Team Member | Operating System     | RAM       | Processor                                                      | Is SSD   | Time Taken |
| ----------- | -----------          |-----------| ---------- ---------------------------------------------------|----------|---------  -|
| Jessie      | Windows 10 Education |  16GB      |   Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz   1.99 GHz        |   Yes    |  peak memory: 800.19 MiB  increment: 0.00 MiB <br> CPU times: total: 8min 33s <br> Wall time: 8min 40s          |
| Adrianne    | Windows 10 Pro       |  16GB     | Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz                |   Yes     | peak memory: 120.25 MiB, increment: 0.30 MiB <br> CPU times: total: 7min 26s <br> Wall time: 7min 31s          |
| Rada        | Macbook Pro 2013 15" |  16GB     | 2.3 GHz Intel Core i7                                         |   No      | peak memory: 3939.19 MiB, increment: 0.10 MiB <br> CPU times: user 7min 14s, sys: 21.5 s, total: 7min 36s  <br> Wall time: 7min 47s           |
| Moid        |                      |           |                                                                |           |           |

> **Discussion of results above** 

# 5. Load the combined CSV to memory and perform a simple EDA

1. Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).

- Changing dtype of your data
- Load just columns what we want
- Loading in chunks
- Dask

2. Compare run times on different machines within your team and summarize your observations.

**The EDA will be to use value_counts() to count the number of data points that came from each .csv file, as recorded in the model column of combined_data.csv.**

### 5.1 Load the Entire Dataframe to Memory Using Pandas (Baseline for Comparison)

In [18]:
%%time
%%memit

df_pandas = pd.read_csv("rainfall/combined_data.csv")
print(df_pandas["model"].value_counts())

MPI-ESM1-2-HR       5154240
CMCC-CM2-HR4        3541230
CMCC-ESM2           3541230
CMCC-CM2-SR5        3541230
NorESM2-MM          3541230
TaiESM1             3541230
SAM0-UNICON         3541153
GFDL-ESM4           3219300
FGOALS-f3-L         3219300
GFDL-CM4            3219300
MRI-ESM2-0          3037320
EC-Earth3-Veg-LR    3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM4-8           1609650
INM-CM5-0           1609650
FGOALS-g3           1287720
KIOST-ESM           1287720
AWI-ESM-1-1-LR       966420
MPI-ESM1-2-LR        966420
NESM3                966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
Name: model, dtype: int64
peak memory: 7451.22 MiB, increment: 3863.56 MiB
CPU times: user 1min 6s, sys: 18.9 s, total: 1min 25s
Wall time: 1min 36s


**Observations**
>Our baseline approach is to use Pandas to load the entire data to memory. The above code loads the combined_data.csv to memory and performs a simple EDA to calculate counts of values in the "model" column. We see that the peak memory is 9060 MiB and the CPU and wall time is 1min 31s. We will explore some other approaches to see if we can reduce the time and memory usage.

### 5.2 Python EDA

We will perform an EDA on value counts and compare the execution time and memory required on 4 approaches:
- Change data type to float32
- Dask
- Load data in chunks of 10 millions and 1 million
- Select only columns of interest


### 5.2.1 Changing dtypes of data:

- We will attempt to change time column from datetime to date
- We will attempt to read the numerical columns using float32 format

Memory comparison for format changes adapted from Lecture notes:

In [19]:
df.index = pd.to_datetime(df.index).dt.date

AttributeError: 'DatetimeIndex' object has no attribute 'dt'

In [20]:
print(f"Memory usage with float64: {df[['lat_min','lat_max', 'lon_min', 'lon_max', 'rain (mm/day)']].memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df[['lat_min','lat_max', 'lon_min', 'lon_max', 'rain (mm/day)']].astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with float64: 2998.46 MB
Memory usage with float32: 1749.10 MB


In [21]:
df_float32 = df.copy()
df_float32[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float32')

#saving the dataframe of float32 to file
df_float32.to_csv("rainfall/combined_data_float32.csv")

In [22]:
%%time
%%memit

#loading the float32 dataframe to memory and perform a simple EDA for value counts of model column
df_float32 = pd.read_csv("rainfall/combined_data_float32.csv")
print(df_float32["model"].value_counts())

MPI-ESM1-2-HR       5154240
CMCC-CM2-HR4        3541230
CMCC-ESM2           3541230
CMCC-CM2-SR5        3541230
NorESM2-MM          3541230
TaiESM1             3541230
SAM0-UNICON         3541153
GFDL-ESM4           3219300
FGOALS-f3-L         3219300
GFDL-CM4            3219300
MRI-ESM2-0          3037320
EC-Earth3-Veg-LR    3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM4-8           1609650
INM-CM5-0           1609650
FGOALS-g3           1287720
KIOST-ESM           1287720
AWI-ESM-1-1-LR       966420
MPI-ESM1-2-LR        966420
NESM3                966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
Name: model, dtype: int64
peak memory: 4715.13 MiB, increment: 1308.29 MiB
CPU times: user 1min 9s, sys: 24.7 s, total: 1min 34s
Wall time: 1min 43s


**Observations:**
> When we changed the data type from float64 to float32 the memory usage reduced by nearly half. This is because float32 is stored as a 32-bit number, while float64 is stored as 64-bit number, which is twice as much memory as float32. With the EDA, we see that after converting dtypes to float 32, the peak memory usage decreased and the increment memory was halved. Both the CPU and wall time also decreased. Changing the dtype is effective in reducing the time and memory required to load data, and should be used when we have a large amount of data that does not require very high precision.


### 5.2.2 Dask:

- We will attempt to read dataframe using dask

In [23]:
%%time
%%memit
# Dask
df_dask = dd.read_csv('rainfall/combined_data.csv')
print(df_dask["model"].value_counts().compute())

MPI-ESM1-2-HR       5154240
TaiESM1             3541230
NorESM2-MM          3541230
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
SAM0-UNICON         3541153
FGOALS-f3-L         3219300
GFDL-CM4            3219300
GFDL-ESM4           3219300
EC-Earth3-Veg-LR    3037320
MRI-ESM2-0          3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM5-0           1609650
INM-CM4-8           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
MPI-ESM1-2-LR        966420
NESM3                966420
AWI-ESM-1-1-LR       966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
Name: model, dtype: int64
peak memory: 3904.75 MiB, increment: 1306.16 MiB
CPU times: user 49.2 s, sys: 13.4 s, total: 1min 2s
Wall time: 23.2 s


**Observations:**
> Using a Dask dataframe is much faster and lighter on memory. Compared to loading the csv to pandas data frame, when we load the csv file to dask, the peak memory, increment memory, and wall time all reduced significantly when calling the value_counts() function. This is likely because dask partitioned the dataframe based on row index and did the calculation in parallel to improve the efficiency. Thus, for large-scale data calculation, we could use dask instead of pandas to improve the code efficiency with minimal syntax change.

### 5.2.3 Loading in Chunks:

- We will attempt to read dataframe in chunks

#### Chunksize = 10 million:

In [24]:
%%time
%%memit
counts = pd.Series(dtype=int)
for chunk in pd.read_csv("rainfall/combined_data.csv", chunksize=10_000_000):
    counts = counts.add(chunk["model"].value_counts(), fill_value=0)
print(counts.astype(int))

ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
AWI-ESM-1-1-LR       966420
BCC-CSM2-MR         3035340
BCC-ESM1             551880
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
CanESM5              551880
EC-Earth3-Veg-LR    3037320
FGOALS-f3-L         3219300
FGOALS-g3           1287720
GFDL-CM4            3219300
GFDL-ESM4           3219300
INM-CM4-8           1609650
INM-CM5-0           1609650
KIOST-ESM           1287720
MIROC6              2070900
MPI-ESM-1-2-HAM      966420
MPI-ESM1-2-HR       5154240
MPI-ESM1-2-LR        966420
MRI-ESM2-0          3037320
NESM3                966420
NorESM2-LM           919800
NorESM2-MM          3541230
SAM0-UNICON         3541153
TaiESM1             3541230
dtype: int64
peak memory: 2729.70 MiB, increment: 1497.36 MiB
CPU times: user 1min 6s, sys: 11.6 s, total: 1min 17s
Wall time: 1min 20s


#### Chunksize = 1 million:

In [25]:
%%time
%%memit
counts = pd.Series(dtype=int)
for chunk in pd.read_csv("rainfall/combined_data.csv", chunksize=1_000_000):
    counts = counts.add(chunk["model"].value_counts(), fill_value=0)
print(counts.astype(int))

ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
AWI-ESM-1-1-LR       966420
BCC-CSM2-MR         3035340
BCC-ESM1             551880
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
CanESM5              551880
EC-Earth3-Veg-LR    3037320
FGOALS-f3-L         3219300
FGOALS-g3           1287720
GFDL-CM4            3219300
GFDL-ESM4           3219300
INM-CM4-8           1609650
INM-CM5-0           1609650
KIOST-ESM           1287720
MIROC6              2070900
MPI-ESM-1-2-HAM      966420
MPI-ESM1-2-HR       5154240
MPI-ESM1-2-LR        966420
MRI-ESM2-0          3037320
NESM3                966420
NorESM2-LM           919800
NorESM2-MM          3541230
SAM0-UNICON         3541153
TaiESM1             3541230
dtype: int64
peak memory: 1303.84 MiB, increment: 158.27 MiB
CPU times: user 1min 4s, sys: 9.08 s, total: 1min 13s
Wall time: 1min 15s


**Observations:**
> When loading the data in chunks, the peak memory is significantly lower than that without using chunking method. We can see that loading in 10 million per chunk requires nearly 6800 MiB in peak memory usage, which is less than using Pandas to load all at once. Loading in 1 million per chunk requires only 3740 MiB in peak memory usage. The increment memory is almost 10 times less than using Pandas. This is significantly more efficient than using Pandas. However, we also notice that the CPU and wall time remains roughly the same in all these approaches.


### 5.2.4 Selecting columns:

Since we only want the model for EDA, we will import just the model column. This is faster and uses less memory than loading the whole dataframe.


In [26]:
%%time
%%memit
df = pd.read_csv("rainfall/combined_data.csv", 
                 usecols = ["model"])

peak memory: 1962.19 MiB, increment: 951.45 MiB
CPU times: user 26.1 s, sys: 1.85 s, total: 28 s
Wall time: 29.5 s


In [27]:
%%time
%%memit
df["model"].value_counts()

peak memory: 1484.82 MiB, increment: 0.21 MiB
CPU times: user 3.55 s, sys: 64.7 ms, total: 3.61 s
Wall time: 4.17 s


**Observations:**
>Running value_counts takes the same time as it did using the entire data set, probably because it has to iterate through the same number of rows. However, this should still be done whenever possible because it reduces memory required and speeds up loading data.

## 5.3 Summary

### 5.3.1 Summary of different approaches in terms of memory usage and execution time on one machine (Macbook Pro):

| Approach                    | Peak Memory Usage (MB)     | Execution Wall Time | 
| ----------------------------| -------------------------- |---------------------| 
| Baseline                    |        7451.22 MiB         |        1min 36s     |
| Change dtype to float32     |        4715.13 MiB         |        1min 43s     |   
| Dask                        |        3904.75 MiB         |        23.2s        | 
| Load in chunks 10 millions  |        2729.70 MiB         |        1min 20s     |   
| Load in chunks 1 million    |        1303.84 MiB         |        1min 15s     | 
| Select single column        |        1484.8 MiB          |        4.17s        |

- In terms of memory usage, loading in chunks of 1 million and selecting single column of interest shows the best performance.

- In terms of execution time, using a single column is the fastest followed by Dask. Both methods are much faster than the Baseline and other methods had similar time performance as baseline.

- To summarize, if analysis does not involve many columns of the dataset, we could consider using the single column of interest to achieve the best memory and time performance. If more than 1 column is required in the analysis, we would consider using Dask with a reasonable memory usage and a fast execution time.


### 5.3.2 Comparison of different machines on two approaches - changing data type and using Dask:

| Team Member | Operating System     | RAM       | Processor                                                     | Is SSD   | Time Taken (changing dtype to float32) | Time Taken (Dask) |
| ----------- | -----------          |-----------| ---------- ---------------------------------------------------|----------|---------  --------------|  --------------      |
| Jessie      | Windows 10 Education |  16GB      |   Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz   1.99 GHz        |   Yes    |  peak memory: 8487.61 MiB, increment: 2461.12 MiB <br> CPU times: total: 1min 44s <br> Wall time: 1min 46s   |  peak memory: 4749.06 MiB, increment: 1255.98 MiB <br> CPU times: total: 1min 8s <br> Wall time: 23.5s  | 
| Adrianne    | Windows 10 Pro       |  16GB     | Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz                |   Yes     | peak memory: 13214.80 MiB, increment: 9233.91 MiB <br> CPU times: total: 55.4s <br> Wall time: 57s           | peak memory: 4240.98 MiB, increment: 1255.65 MiB <br> CPU times: total: 57.2s <br> Wall time: 20.6s  |
| Rada        | Macbook Pro 2013 15" |  16GB     | 2.3 GHz Intel Core i7                                         |   No      | peak memory: 4715.13 MiB, increment: 1308.29 MiB <br> CPU times: user 1min 9s, sys: 24.7 s, total: 1min 34s <br> Wall time: 1min 43s   | peak memory: 3904.75 MiB, increment: 1306.16 MiB <br> CPU times: user 49.2 s, sys: 13.4 s, total: 1min 2s <br> Wall time: 23.2 s
| Moid        |                      |           |                                                                |           |           |

# 6. Perform a simple EDA in R

To perform EDA in R, we first need to transfer the dataframe from Python to R.
In this section, we will pass data from python to R in various ways and asses each method.

## 6.1 Store the Data in Different Formats

### 6.1.1 Arrow file format

In [28]:
%%R
#Loading library
library(arrow);
library(dplyr);

R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [29]:
%%time
%%memit

dataset = ds.dataset("rainfall/combined_data.csv", format="csv")

table = dataset.to_table()

peak memory: 3466.32 MiB, increment: 1859.34 MiB
CPU times: user 23.6 s, sys: 4.62 s, total: 28.2 s
Wall time: 26.5 s


### 6.1.2 Feather format


In [30]:
%%time

feather.write_feather(table, 'rainfall/combined_data.feather')

CPU times: user 5.4 s, sys: 17.7 s, total: 23.1 s
Wall time: 7.31 s


> Feather format comes with over 3x Wall time improvement.

### 6.1.2 Parquet format


In [31]:
%%time
## writing as a single parquet 
pq.write_table(table, 'rainfall/combined_data.parquet')

CPU times: user 10.5 s, sys: 365 ms, total: 10.9 s
Wall time: 11.2 s


In [32]:
%%time
## writing as a partitioned parquet 
pq.write_to_dataset(table, 'rainfall/combined_data_partitioned.parquet',partition_cols=['model'])

CPU times: user 24.7 s, sys: 14.4 s, total: 39.1 s
Wall time: 35.1 s


In [33]:
%%sh
# Check the size of different format
du -sh rainfall/combined_data.csv
du -sh rainfall/combined_data.feather
du -sh rainfall/combined_data.parquet
du -sh rainfall/combined_data_partitioned.parquet

5.6G	rainfall/combined_data.csv
1.0G	rainfall/combined_data.feather
544M	rainfall/combined_data.parquet
548M	rainfall/combined_data_partitioned.parquet


>We can see that both Feather and Parquet have reduced the file size significantly. The wall time taken for feather and single parquet was much less than Arrow. Partitioned parquet took similar wall time as Arrow but it significantly reduced the file size.

## 6.2 Transfer the Data in Different Formats

### 6.2.1 Pandas Exchange

In [34]:
%%time
%%memit
#simple pandas: read the entire dataset into memory
df = pd.read_csv("rainfall/combined_data.csv")

peak memory: 7035.93 MiB, increment: 6672.44 MiB
CPU times: user 1min 3s, sys: 19.9 s, total: 1min 23s
Wall time: 1min 33s


In [None]:
%%time
%%R -i df
start_time <- Sys.time()
library(dplyr)
# print(class(df))
result <- df |> count(model)
#print(result)
end_time <- Sys.time()
print(end_time - start_time)

### 6.2.2 Arrow Exchange

In [None]:
%%time
%%memit
dataset = ds.dataset("rainfall/combined_data.csv", format="csv")
table = dataset.to_table()

In [None]:
%%time
%%memit
## Here we are converting arrow table so it can be passed to R
r_table = pyra.converter.py2rpy(table)

In [None]:
%%time
%%R -i r_table
# Pass r_table from python

start_time <- Sys.time()
library(dplyr)
counts <- r_table %>% collect() %>% count(model)
end_time <- Sys.time()

print(counts)
print(end_time - start_time)

### 6.2.3 Feather File

In [None]:
%%time
%%R
library(arrow)
start_time <- Sys.time()
r_table <- arrow::read_feather("rainfall/combined_data.feather")
print(class(r_table))
library(dplyr)
result <- r_table %>% count(model) 
end_time <- Sys.time()
print(result)
print(end_time - start_time)

### 6.2.4 Parquet File

In [None]:
%%time
%%R
library(arrow)
start_time <- Sys.time()
r_table <- arrow::read_parquet("rainfall/combined_data.parquet")
print(class(r_table))
library(dplyr)
result <- r_table %>% count(model)
end_time <- Sys.time()
print(result)
print(end_time - start_time)

### Observations