# Milestone 1

In this milestone, we will be reading in the data via Pandas.

## 1. Downloading the data

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

In [2]:
# Necessary metadata
article_id = 14096681
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "rainfallNSW/"

In [3]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'id': 26579150,
  'name': 'daily_rainfall_2014.png',
  'size': 58863,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e'},
 {'id': 26579171,
  'name': 'environment.yml',
  'size': 192,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34'},
 {'id': 26586554,
  'name': 'README.md',
  'size': 5422,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c'},
 {'id': 26766812,
  'name': 'data.zip',
  'size': 814041183,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26766812',
  'supplied_md5': 'b517383f76e77bd03755a63a8f

In [4]:
files_to_dl = ["data.zip"]  # feel free to add other files here
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [5]:
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

## 2. Combining the data

In [8]:
%%time
import pandas as pd

path = r'rainfallNSW' # use your path

# uncomment for mac
all_files = glob.glob(path + "/*.csv")
all_files.remove(path + "/observed_daily_rainfall_SYD.csv")

# uncomment for windows
#all_files = glob.glob(path + "\\*.csv")
#all_files.remove(path + "\\observed_daily_rainfall_SYD.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    df['model'] = filename[12:-23]
    li.append(df)

combined_df = pd.concat(li, axis=0, ignore_index=True)
combined_df

CPU times: total: 2min 30s
Wall time: 2min 31s


Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
0,1889-01-01 12:00:00,-36.250000,-35.00000,140.625,142.500,3.293256e-13,ACCESS-CM2
1,1889-01-02 12:00:00,-36.250000,-35.00000,140.625,142.500,0.000000e+00,ACCESS-CM2
2,1889-01-03 12:00:00,-36.250000,-35.00000,140.625,142.500,0.000000e+00,ACCESS-CM2
3,1889-01-04 12:00:00,-36.250000,-35.00000,140.625,142.500,0.000000e+00,ACCESS-CM2
4,1889-01-05 12:00:00,-36.250000,-35.00000,140.625,142.500,1.047658e-02,ACCESS-CM2
...,...,...,...,...,...,...,...
62467838,2014-12-27 12:00:00,-30.157068,-29.21466,153.125,154.375,5.543748e-01,TaiESM1
62467839,2014-12-28 12:00:00,-30.157068,-29.21466,153.125,154.375,7.028577e+00,TaiESM1
62467840,2014-12-29 12:00:00,-30.157068,-29.21466,153.125,154.375,2.347570e-01,TaiESM1
62467841,2014-12-30 12:00:00,-30.157068,-29.21466,153.125,154.375,2.097459e+00,TaiESM1


**Saving combined csv**

In [12]:
combined_df.to_csv("../data/processed/figshare/combined_data.csv")

## 3. Time taken to combine CSV file

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Ruben       |       MacOS      |  8  |    M1     |   Yes  |            |
| Jacqueline  |       MacOS      |  8  |IntelCorei5|   Yes  |  1min 8s   |
| Kyle        |       Windows 10 |  16 |Intelcorei7|   Yes  |  2min 30s  |
| Sanchit     |       MacOS      |  8  |    M1     |   Yes  |  40s       |

### 3.1 Observations

TO BE FILLED IN ... 

## 4 Exploratory Data Analysis

**Comparing the run times for loading the csv**

In [2]:
%%time
df = pd.read_csv("../data/processed/figshare/combined_data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62467843 entries, 0 to 62467842
Data columns (total 8 columns):
 #   Column         Dtype  
---  ------         -----  
 0   Unnamed: 0     int64  
 1   time           object 
 2   lat_min        float64
 3   lat_max        float64
 4   lon_min        float64
 5   lon_max        float64
 6   rain (mm/day)  float64
 7   model          object 
dtypes: float64(5), int64(1), object(2)
memory usage: 3.7+ GB
CPU times: total: 4min 16s
Wall time: 4min 25s


**Time taken to load the dataset**

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Ruben       |       MacOS      |  8  |    M1     |   Yes  |            |
| Jacqueline  |       MacOS      |  8  |IntelCorei5|   Yes  |            |
| Kyle        |                  |     |           |        |            |
| Sanchit     |       MacOS      |  8  |    M1     |   Yes  |  1min 2s   |

**The first method we will use to reduce the runtime will be to change the datatype. We will convert float64 to float32 for the numerical columns.**

In [None]:
%%time
combined_df_f32 = pd.read_csv("../data/processed/figshare/combined_data.csv", dtype={'lat_min': 'float32', 'lat_max': 'float32', 'lon_min': 'float32', 'lon_max': 'float32', 'rain (mm/day)': 'float32'})
combined_df_f32.info()

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Ruben       |       MacOS      |  8  |    M1     |   Yes  |            |
| Jacqueline  |       MacOS      |  8  |IntelCorei5|   Yes  |            |
| Kyle        |                  |     |           |        |            |
| Sanchit     |       MacOS      |  8  |    M1     |   Yes  |  1min      |

**The total time did not reduce considerably. We will now try to load individual columns and person EDA on them.**

In [None]:
combined_df.head()

In [None]:
%%time
column = ['lat_max']
combined_df_value = pd.read_csv("../data/processed/figshare/combined_data.csv", usecols=column)
print(combined_df_value.value_counts())

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Ruben       |       MacOS      |  8  |    M1     |   Yes  |            |
| Jacqueline  |       MacOS      |  8  |IntelCorei5|   Yes  |            |
| Kyle        |                  |     |           |        |            |
| Sanchit     |       MacOS      |  8  |    M1     |   Yes  |  25s       |

It took only about 25 seconds. But this is not a feasible solution since its not scalable.

In [None]:
%%time
combined_df.describe()

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Ruben       |       MacOS      |  8  |    M1     |   Yes  |            |
| Jacqueline  |       MacOS      |  8  |IntelCorei5|   Yes  |            |
| Kyle        |                  |     |           |        |            |
| Sanchit     |       MacOS      |  8  |    M1     |   Yes  |  10s       |

### Code for Parquet file

In [None]:
%load_ext rpy2.ipython

In [None]:
%%time
combined_df.to_parquet("rainfallNSW/combined_data.parquet")

In [None]:
%%R
suppressMessages(library(data.table))
suppressMessages(library(dplyr, warn.conflicts = FALSE))
suppressMessages(library(ggplot2))

In [None]:
r_parquet <- open_dataset("rainfallNSW/combined_data.parquet")
r_df <- r_parquet |> collect()

> We decided to use a Parquet file to transfer our data frame for several reasons.
> 1. **Time:** This approach allows for the quickest and most convenient recovery in the event that we have to restart the kernel or the notebook crashes. Essentially this us to pick up where we left off without having to repeat the earlier steps in python.
> 2. **Flexibility**: Parquet allows us to easily read our data into many different languages without needing to know the interactions between said languages. Furthermore we can save time and memory by using partitioning to read in only what is needed, rather than loading the entire data frame into memory and then filtering it.
> 3. **Memory**: Having consolidated our data into a parquet file we no longer need to keep our original CSV files. This essentially allows us to store the same information with a fraction of the memory requirement, in a format that is arguably more desirable than CSV.
>  
> - We did experiment with Arrow Exchange before ultimately deciding to use Parquet. Arrow Exchange code is commented out at the bottom of the notebook

### EDA in R

In [None]:
%%time
%%R
r_df |> str()

In [None]:
%%time
%%R
r_df |> head()

In [None]:
%%time
%%R
r_df |> summary()

### Code for Arrow Exchange

In [None]:
#import pyarrow as pa
#import pyarrow 
#from pyarrow import csv
#import rpy2_arrow.pyarrow_rarrow as pyra

In [None]:
#pyarrow_table = pa.Table.from_pandas(combined_df)
#r_table = pyra.converter.py2rpy(pyarrow_table)

In [None]:
#%%time
#%%R -i r_table
#suppressMessages(library(arrow, warn.conflicts = FALSE))
#suppressMessages(library(dplyr, warn.conflicts = FALSE))
#r_df <- r_table |> collect()