# Imports

In [None]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
import altair as alt

# Dask
import dask.dataframe as dd

# pyarrow and feather
import pyarrow.feather as feather
import pyarrow.dataset as ds
import pyarrow as pa
import pyarrow.parquet as pq
import rpy2_arrow.pyarrow_rarrow as pyra

In [None]:
%load_ext rpy2.ipython
%load_ext memory_profiler

# 1. Teamwork Contract
The teamwork contract for our team, group 7, can be found [here](https://docs.google.com/document/d/1u4e5Z5C-uwTTSvCEyOYy-I30Fb8OEPYM6frM0NBEVVc/edit).

# 2. Create repository and project structure
The repository URL: https://github.com/UBC-MDS/DSCI525-Group7

# 3. Downloading the data

Using Python **requests** Library

We are using article id #14096681, which contains the data of **Daily rainfall over NSW, Australia.**

In [None]:
# Setup
article_id = 14096681  
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "rainfall/"

Review the files within the article:

In [None]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

# 3.1 Unzipping the data

In [None]:
%%time

files_to_dl = ["data.zip"]  
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [None]:
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

In [None]:
%ls -ltr rainfall/

# Comparison of Performance on Different Machines

The summary of all team members' time taken to unzip the data is recorded below. Each team member's Operating System, RAM, Processor and SSD are also recorded to check if they have any effect on the time taken.

| Team Member | Operating System | RAM | Processor | Is SSD | Time Taken |
|-------------|------------------|-----|-----------|--------|------------|
| Jessie | Windows 10 Education | 16GB | Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz 1.99 GHz | Yes | CPU times: total: 8.48s <br> Wall time: 1min 35s |
| Adrianne | Windows 10 Pro | 16GB | Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz | Yes | CPU times: total: 6.23s <br> Wall time: 1min 7s |
| Rada | Macbook Pro 2013 15" | 16GB | 2.3 GHz Intel Core i7 | No | CPU times: total: 10.4 s<br>Wall time: 3min 5s |
| Moid | Windows 11 Education | 12GB | 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 1.38 GHz | Yes | CPU times: total: 6.81s <br> Wall time: 1min 31s |

Macbook Pro took palpably longer than the rest for this process. Would be curious to review the reason.

# 4. Combining data CSVs

- Combine data CSVs into a single CSV using pandas.

- When combining the CSV files, add an extra column called "model" that identifies the model. Tip 1: you can get this column populated from the file name, eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON Tip 2: Remember how we added year when we combined airline CSVs. Tip 3: You can use regex generator.

_Note: There is a file called observed_daily_rainfall_SYD.csv in the data folder that you downloaded. Make sure you exclude this file (programmatically or just take out that file from folder) before you combine CSVs. We will use this file in our next milestone._

- Compare run times on different machines within your team and summarize your observations.
Warning: Some of you might not be able to do it on your laptop. It's fine if you're unable to do it. Just make sure you discuss the reasons why you might not have been able to run this on your laptop.

Let's first view the data and the columns:

In [None]:
%%time

df_1 = pd.read_csv(output_directory+"/MPI-ESM-1-2-HAM_daily_rainfall_NSW.csv")
df_2 = pd.read_csv(output_directory+"/CMCC-CM2-SR5_daily_rainfall_NSW.csv")
df_3 = pd.read_csv(output_directory+"/SAM0-UNICON_daily_rainfall_NSW.csv")

Even loading three of the individual files is taking a little time.

In [None]:
df_1.head(2)

In [None]:
df_2.head(2)

In [None]:
df_3.head(2)

In [None]:
df_3.tail(2)

In [None]:
%%time

files = glob.glob('./rainfall/*NSW.csv')
df = pd.concat((pd.read_csv(file, index_col=0)
                .assign(model=re.findall(r'/([^_]*)', file)[0])
                for file in files)
              )
df.to_csv("rainfall/combined_data.csv")

**For Windows user:**   
Windows users will run into an index error when running the code above to combine the CSVs.   
This can be solved by adding a ./ to the filename as below.

In [None]:
%%time
%memit
files = glob.glob('./rainfall/*NSW.csv')
df = pd.concat((pd.read_csv(file, index_col=0)
                .assign(model=file.strip('./rainfall\\').split('_')[0])
                for file in files)
              )
df.to_csv("rainfall/combined_data.csv")

Wow, this felt like an eternity!

Let's take a look at the combined file, see if head and tail are as we expect them to be:

In [None]:
df.head()

In [None]:
df.tail()

## Comparison of Performance on Different Machines

The summary of all team members' time taken to combine the CSV's files is recorded below. Each team member's Operating System, RAM, Processor and SSD are also recorded to check if they have any effect on the time taken.

| Team Member | Operating System | RAM | Processor | Is SSD | Time Taken |
|-------------|------------------|-----|-----------|--------|------------|
| Jessie | Windows 10 Education | 16GB | Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz 1.99 GHz | Yes | peak memory: 800.19 MiB <br> increment: 0.00 MiB <br> CPU times: total: 8min 33s <br> Wall time: 8min 40s |
| Adrianne | Windows 10 Pro | 16GB | Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz | Yes | peak memory: 120.25 MiB <br> increment: 0.30 MiB <br> CPU times: total: 7min 26s <br> Wall time: 7min 31s |
| Rada | Macbook Pro 2013 15" | 16GB | 2.3 GHz Intel Core i7 | No | peak memory: 3939.19 MiB <br> increment: 0.10 MiB <br> CPU times: user 7min 14s, sys: 21.5 s, total: 7min 36s  <br> Wall time: 7min 47s |
| Moid | Windows 11 Education | 12GB | 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 1.38 GHz | Yes | peak memory: 3622.16 MiB <br> increment: 0.72 MiB <br> CPU times: total: 7min 1s <br> Wall time: 7min 5s |

**Observations:**
Combining process is long in general at around 7-8 minutes, but quite consistent. Interesting to note the peak memory spikes on Macbook Pro and Windows i5 core machines.

# 5. Load the combined CSV to memory and perform a simple EDA

1. Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).

- Changing dtype of your data
- Load just columns what we want
- Loading in chunks
- Dask

2. Compare run times on different machines within your team and summarize your observations.

**The EDA will be to use value_counts() to count the number of data points that came from each .csv file, as recorded in the model column of combined_data.csv.**

### 5.1 Load the Entire Dataframe to Memory Using Pandas (Baseline for Comparison)

In [None]:
%%time
%%memit

df_pandas = pd.read_csv("rainfall/combined_data.csv")
print(df_pandas["model"].value_counts())

**Observations**
>Our baseline approach is to use Pandas to load the entire data to memory. The above code loads the combined_data.csv to memory and performs a simple EDA to calculate counts of values in the "model" column. We see that the peak memory is 9060 MiB and the CPU and wall time is 1min 31s. We will explore some other approaches to see if we can reduce the time and memory usage.

### 5.2 Changing dtypes of data:

- We will attempt to read the numerical columns using float32 format

Memory comparison for format changes adapted from Lecture notes:

In [None]:
print(f"Memory usage with float64: {df[['lat_min','lat_max', 'lon_min', 'lon_max', 'rain (mm/day)']].memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df[['lat_min','lat_max', 'lon_min', 'lon_max', 'rain (mm/day)']].astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

In [None]:
df_float32 = df.copy()
df_float32[['lat_min','lat_max','lon_min', 'lon_max', 'rain (mm/day)']].astype('float32')

#saving the dataframe of float32 to file
df_float32.to_csv("rainfall/combined_data_float32.csv")

In [None]:
%%time
%%memit

#loading the float32 dataframe to memory and perform a simple EDA for value counts of model column
df_float32 = pd.read_csv("rainfall/combined_data_float32.csv")
print(df_float32["model"].value_counts())

**Observations:**
> When we changed the data type from float64 to float32 the memory usage reduced by nearly half. This is because float32 is stored as a 32-bit number, while float64 is stored as 64-bit number, which is twice as much memory as float32. With the EDA, we see that after converting dtypes to float 32, the peak memory usage decreased and the increment memory was halved. Both the CPU and wall time also decreased. Changing the dtype is effective in reducing the time and memory required to load data, and should be used when we have a large amount of data that does not require very high precision.


### 5.3 Dask:

- We will attempt to read dataframe using dask

In [None]:
%%time
%%memit
# Dask
df_dask = dd.read_csv('rainfall/combined_data.csv')
print(df_dask["model"].value_counts().compute())

**Observations:**
> Using a Dask dataframe is much faster and lighter on memory. Compared to loading the csv to pandas data frame, when we load the csv file to dask, the peak memory, increment memory, and wall time all reduced significantly when calling the value_counts() function. This is likely because dask partitioned the dataframe based on row index and did the calculation in parallel to improve the efficiency. Thus, for large-scale data calculation, we could use dask instead of pandas to improve the code efficiency with minimal syntax change.

### 5.4 Loading in Chunks:

- We will attempt to read dataframe in chunks

#### Chunksize = 10 million:

In [None]:
%%time
%%memit
counts = pd.Series(dtype=int)
for chunk in pd.read_csv("rainfall/combined_data.csv", chunksize=10_000_000):
    counts = counts.add(chunk["model"].value_counts(), fill_value=0)
print(counts.astype(int))

#### Chunksize = 1 million:

In [None]:
%%time
%%memit
counts = pd.Series(dtype=int)
for chunk in pd.read_csv("rainfall/combined_data.csv", chunksize=1_000_000):
    counts = counts.add(chunk["model"].value_counts(), fill_value=0)
print(counts.astype(int))

**Observations:**
> When loading the data in chunks, the peak memory is significantly lower than that without using chunking method. We can see that loading in 10 million per chunk requires nearly 6800 MiB in peak memory usage, which is less than using Pandas to load all at once. Loading in 1 million per chunk requires only 3740 MiB in peak memory usage. The increment memory is almost 10 times less than using Pandas. This is significantly more efficient than using Pandas. However, we also notice that the CPU and wall time remains roughly the same in all these approaches.


### 5.5 Selecting columns:

Since we only want the model for EDA, we will import just the model column. This is faster and uses less memory than loading the whole dataframe.


In [None]:
%%time
%%memit
df = pd.read_csv("rainfall/combined_data.csv", 
                 usecols = ["model"])

In [None]:
%%time
%%memit
df["model"].value_counts()

**Observations:**
>Running value_counts takes the same time as it did using the entire data set, probably because it has to iterate through the same number of rows. However, this should still be done whenever possible because it reduces memory required and speeds up loading data.

**Here is a comparison of different machines on two approaches - changing data type and using Dask:**

| Team Member | Operating System | RAM | Processor | Is SSD | Time Taken (changing dtype to float32) | Time Taken (Dask) |
|-------------|------------------|-----|-----------|--------|----------------------------------------|-------------------|
| Jessie | Windows 10 Education | 16GB | Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz 1.99 GHz | Yes | peak memory: 8487.61 MiB, increment: 2461.12 MiB <br> CPU times: total: 1min 44s <br> Wall time: 1min 46s | peak memory: 4749.06 MiB, increment: 1255.98 MiB <br> CPU times: total: 1min 8s <br> Wall time: 23.5s |
| Adrianne | Windows 10 Pro | 16GB | Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz | Yes | peak memory: 13214.80 MiB, increment: 9233.91 MiB <br> CPU times: total: 55.4s <br> Wall time: 57s | peak memory: 4240.98 MiB, increment: 1255.65 MiB <br> CPU times: total: 57.2s <br> Wall time: 20.6s |
| Rada | Macbook Pro 2013 15" | 16GB | 2.3 GHz Intel Core i7 | No | peak memory: 4715.13 MiB, increment: 1308.29 MiB <br> CPU times: user 1min 9s, sys: 24.7 s, total: 1min 34s <br> Wall time: 1min 43s | peak memory: 3904.75 MiB, increment: 1306.16 MiB <br> CPU times: user 49.2 s, sys: 13.4 s, total: 1min 2s <br> Wall time: 23.2s |
| Moid | Windows 11 Education | 12GB | 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz 1.38 GHz | Yes | peak memory: 9341.97 MiB, increment: 2808.60 MiB <br> CPU times: total: 1min 4s <br> Wall time: 1min 8s | peak memory: 3671.34 MiB, increment: 1263.45 MiB <br> CPU times: total: 1min 6s <br> Wall time: 23s |

### 5.6 Aggregation & Plot

In [None]:
df = pd.read_csv("rainfall/combined_data.csv")

**Extracting Month and Year from Date**

In [None]:
df_eda = df.copy()
df_eda = df_eda.reset_index()
df_eda.head(2)

In [None]:
%%time
%%memit
df_eda['year'] = pd.DatetimeIndex(df_eda['time']).year
df_eda['month'] = pd.DatetimeIndex(df_eda['time']).month

In [None]:
df_eda.head(2)

**Aggregation**

In [None]:
%%time
%%memit
df_eda = df_eda[['model','year','rain (mm/day)']]
df_eda = df_eda.groupby(['model', 'year']).agg('mean')

**<center>Extracting Month and Year from Date, and Aggregation Times Comparison</center>**

| Team Member | Operating System | RAM | Processor | Is SSD | Time Taken (Extracting) | Time Taken (Aggregation) |
|-------------|------------------|-----|-----------|--------|-------------------------|--------------------------|
| Jessie      | Windows 10 Education |  16GB      |   Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz   1.99 GHz        |   Yes    | 1min 45s    | 13.1s
| Adrianne    | Windows 10 Pro       |  16GB     | Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz                |   Yes     |  1min 27s |  12.3s
| Rada        | Macbook Pro 2013 15" |  16GB     | 2.3 GHz Intel Core i7                                         |   No      |  27.3 s  |   12.2s
| Moid        |                      |           |                                                                |           |           |

**Plotting**

In [None]:
df_eda = df_eda.reset_index()
df_eda.tail(2)

In [None]:
%%time
%%memit

alt.data_transformers.disable_max_rows()

plot = alt.Chart(df_eda).mark_line().encode(
    x='year',
    y='rain (mm/day)',
    color='model'
)

In [None]:
%%time
%%memit

alt.data_transformers.disable_max_rows()

plot2 = alt.Chart(df_eda).mark_bar().encode(
    x='rain (mm/day):Q',
    y=alt.Y('model:N', sort='-x')
)

In [None]:
plot

In [None]:
plot2

**<center>Plotting Times Comparison</center>**

| Team Member | Operating System | RAM | Processor | Is SSD | Time Taken (Plot1) | Time Taken (Plot2) |
|-------------|------------------|-----|-----------|--------|--------------------|--------------------|
| Jessie      | Windows 10 Education |  16GB      |   Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz   1.99 GHz        |   Yes    |  2.22s  | 2.45s
| Adrianne    | Windows 10 Pro       |  16GB     | Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz                |   Yes     |  2.11s |  2.06s
| Rada        | Macbook Pro 2013 15" |  16GB     | 2.3 GHz Intel Core i7                                         |   No      |  1.43s  |   1.82s
| Moid        |                      |           |                                                                |           |           |

# 6. Perform a simple EDA in R

To perform EDA in R, we first need to transfer the dataframe from Python to R.
In this section, we will pass data from python to R in various ways and asses each method.

## 6.1 Store the Data in Different Formats

### 6.1.1 Arrow file format

In [None]:
%%R
#Loading library
library(arrow);
library(dplyr);

In [None]:
%%time
%%memit

dataset = ds.dataset("rainfall/combined_data.csv", format="csv")

table = dataset.to_table()

### 6.1.2 Feather format


In [None]:
%%time

feather.write_feather(table, 'rainfall/combined_data.feather')

Evidently, feather format comes with over 3x Wall time improvement.

### 6.1.3 Parquet format


In [None]:
%%time
## writing as a single parquet 
pq.write_table(table, 'rainfall/combined_data.parquet')

In [None]:
%%time
## writing as a partitioned parquet 
pq.write_to_dataset(table, 'rainfall/combined_data_partitioned.parquet',partition_cols=['model'])

In [None]:
%%sh
# Check the size of different format
du -sh rainfall/combined_data.csv
du -sh rainfall/combined_data.feather
du -sh rainfall/combined_data.parquet
du -sh rainfall/combined_data_partitioned.parquet

>We can see that both Feather and Parquet have reduced the file size significantly. The wall time taken for feather and single parquet was much less than Arrow. Partitioned parquet took similar wall time as Arrow but it significantly reduced the file size.

## 6.2 Transfer the Data in Different Formats

### 6.2.1 Pandas Exchange

In [None]:
%%time
%%memit
#simple pandas: read the entire dataset into memory
df = pd.read_csv("rainfall/combined_data.csv")

In [None]:
%%time
%%R -i df
start_time <- Sys.time()
library(dplyr)
# print(class(df))
result <- df |> count(model)
#print(result)
end_time <- Sys.time()
print(end_time - start_time)

### 6.2.2 Arrow Exchange

In [None]:
%%time
%%memit
dataset = ds.dataset("rainfall/combined_data.csv", format="csv")
table = dataset.to_table()

In [None]:
%%time
%%memit
## Here we are converting arrow table so it can be passed to R
r_table = pyra.converter.py2rpy(table)

In [None]:
%%time
%%R -i r_table
# Pass r_table from python

start_time <- Sys.time()
library(dplyr)
counts <- r_table %>% collect() %>% count(model)
end_time <- Sys.time()

print(counts)
print(end_time - start_time)

### 6.2.3 Feather File

In [None]:
%%time
%%R
library(arrow)
start_time <- Sys.time()
r_table <- arrow::read_feather("rainfall/combined_data.feather")
print(class(r_table))
library(dplyr)
result <- r_table %>% count(model) 
end_time <- Sys.time()
print(result)
print(end_time - start_time)

### 6.2.4 Parquet File

In [None]:
%%time
%%R
library(arrow)
start_time <- Sys.time()
r_table <- arrow::read_parquet("rainfall/combined_data.parquet")
print(class(r_table))
library(dplyr)
result <- r_table %>% count(model)
end_time <- Sys.time()
print(result)
print(end_time - start_time)

### 6.3 Aggregation & Plot

Rename the column because it's easier to work that way in R:

In [None]:
%%time
%%R
r_table <- r_table %>% 
  rename(
    rain = `rain (mm/day)`,
    )
start_time <- Sys.time()
glimpse(r_table)
end_time <- Sys.time()
print(end_time - start_time)

**Aggregate by Model only:**

In [None]:
%%time
%%R
start_time <- Sys.time()
summ_rain <- r_table %>% 
  group_by(model) %>%
  summarise(mean_rain = mean(rain, na.rm = TRUE))
end_time <- Sys.time()
print(end_time - start_time)
summ_rain

**Extract Month and Year**

In [None]:
%%time
%%R
library(lubridate)
start_time <- Sys.time()
year_month_table <- r_table %>% 
  mutate(year = year(time), month = month(time))
end_time <- Sys.time()
print(end_time - start_time)
year_month_table

**Aggregated by Model and by Year:**

In [None]:
%%time
%%R
start_time <- Sys.time()
summ_rain_2 <- year_month_table %>% 
  group_by(model, year, month) %>%
  summarise(mean_rain = mean(rain, na.rm = TRUE))
end_time <- Sys.time()
print(end_time - start_time)
summ_rain_2

**<center>Extracting Month and Year from Date, and Aggregation Times Comparison</center>**

| Team Member | Operating System | RAM | Processor | Is SSD | Time Taken (Extracting) | Time Taken (Aggregation) |
|-------------|------------------|-----|-----------|--------|-------------------------|--------------------------|
| Jessie      | Windows 10 Education |  16GB      |   Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz   1.99 GHz        |   Yes    |  25.3s  |  6.02s
| Adrianne    | Windows 10 Pro       |  16GB     | Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz                |   Yes     |  18.6 s  |   5.05s
| Rada        | Macbook Pro 2013 15" |  16GB     | 2.3 GHz Intel Core i7                                         |   No      |  53.2 s  |   5.57s
| Moid        |                      |           |                                                                |           |           |

In [None]:
%%time
%%R
library(ggplot2)
start_time <- Sys.time()
plot <- summ_rain_2 %>% ggplot(aes(x = year, y = mean_rain, color = model)) + geom_line()
end_time <- Sys.time()
print(end_time - start_time)
plot

In [None]:
%%time
%%R
library(ggplot2)
start_time <- Sys.time()
plot2 <- summ_rain %>% ggplot(aes(x = mean_rain, y = model)) + geom_bar(stat = "identity")
end_time <- Sys.time()
print(end_time - start_time)
plot2

**<center>Plotting Times Comparison</center>**

| Team Member | Operating System     | RAM       | Processor                                                     | Is SSD   | Time Taken (Plot1) | Time Taken (Plot2) |
| ----------- | -----------          |-----------| ---------- ---------------------------------------------------|----------|---------  --------------|  --------------      |
| Jessie      | Windows 10 Education |  16GB      |   Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz   1.99 GHz        |   Yes    |   1.55s                      | 1.14s
| Adrianne    | Windows 10 Pro       |  16GB     | Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 2.80 GHz                |   Yes     |  1.28s                 |   0.139s |
| Rada        | Macbook Pro 2013 15" |  16GB     | 2.3 GHz Intel Core i7                                         |   No      |  1.22s                 |   1.43s
| Moid        |                      |           |                                                                |           |                       |

## Observations Summary

### File manipulation:  

Downloading and unzipping the rainfall file took 6-11sec, with Windows machines fairing better than the Macbook.

Combining the data by simple concatenation took about 7-8min. Windows 10 Edu machine took the longest, but Macbook pro and Windows i5 core had a scary high peak memory at about 1/4 of its total RAM. Another interesting thing to note is the memory increment on an Windows i5 core machine is double of that on an i7 core.

### Loading Data into R:

We attempted several methods of loading the data into R, after previous work in pandas.   
Loading only the columns we needed reduced loading times from raw of a bit over 1min to ~5sec. Loading using more suited memory-smart data types of float32 instead of float64 cut loading time roughly in half.

For value counts EDA, raw **Pandas** took a very long time, upwards of 40 minutes for each of us.   
This baseline looked intimidating. Fortunately, every alternative method improved the processing time significantly.   

**Arrow** exchange ~26, **Feather** loading ~8sec, **Parquet** ~11sec and **partitioned Parquet** which comes with some more optimization ~35sec.   
And this is for a very large file with upwards of 62 million observations. Parquet is clearly a very good tool for loading the data.

With that in mind, for files of this magnitude in the future, we would lean loading the data using **Parquet**, because it's optimized for working with large files and is comparable with alternative techniques.

### EDA Comparisons

Different file loading techniques resulted in various time savings for the purposes of simple EDA computing value counts.   
When loaded with **Arrow**, counting took ~53sec, with **Feather** ~20sec, with **Paraquet** ~10sec. Again, **Parquet** is proving to be a good tool for this kind of processing.

For the extraction of year/month from date, times varied a lot based on the run, but overall Pandas took ~12sec and R took ~40sec.
For the simple aggergation process, Pandas took ~6-10sec, and R took ~3-5sec to perform the same operation.
Plotting process was simple and fast despite once the data was aggregated, under 2sec per plot for both Pandas and R. Of course, plotting usually requires aggregation in the first place: it's rare that the user will be able to make sense of millions of data points of non-aggregated data. So really, the aggregation times are more important to quantify here, because it is unlikely that plotting of millions of points will ever come up in practice.


### Machine Comparison Overall: 

In general, we didn't have a particularly wide variety of machine: each of us are on a 16GB RAM and Intel core.   
We did have both Macbook and Windows machines to test the results.   
It appears that the times are fairly consistent between operating systems.
In fact, while we only recorded final times and memory usages for each step per laptop, rerunning the notebook several times resulted in similar variation in times and memory usage to what we got from using different laptops.   
However, for most of the processes, Macbook machine was performing worse than others. A lot more tests on similar computers would need to be performed to get meaningful performance comparison because the Macbook is very old (2013) so besides the components of the build, wear and tear could affect the process timing