# DSCI 525: Milestone 1 Group 20

## Group Members
- Lauren Zung
- Xinru Lu
- Spencer Gerlach

# Part 1 & 2: Contract & Repo

- Completed by Lauren Zung

# Part 3: Downloading the Data

- Spencer Gerlach

In [None]:
# Imports

import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

- Change directory to location files stored.

- Assuming we can't save data to our repo.

> Will need to be updated depending on who is running the notebook

In [None]:
%cd /Users/spencergerlach/Desktop/figshare

In [None]:
# Complete metadata required for API request

article_id = 14096681
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figshare-nswrain" # update depending on user

In [None]:
# GET request

response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]

- Now, download the file `data.zip`

In [None]:
%%time
files_to_dl = ["data.zip"]
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + "/" + file["name"])

In [None]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), "r") as f:
    f.extractall(output_directory)

In [None]:
# check all the file names
%ls -ltr /Users/spencergerlach/Desktop/figshare/figshare-nswrain/

# Part 4: Combine the Files with Python

- Spencer Gerlach

In [None]:
df_test = pd.read_csv("/Users/spencergerlach/Desktop/figshare/figshare-nswrain/AWI-ESM-1-1-LR_daily_rainfall_NSW.csv")
df_test.head()

In [None]:
df_test2 = pd.read_csv("/Users/spencergerlach/Desktop/figshare/figshare-nswrain/ACCESS-CM2_daily_rainfall_NSW.csv")
df_test2.head()

- From these results, we can now proceed with reading and combining all CSVs (except `observed_daily_rainfall_SYD.csv`).

- Use columns from the test CSVs above

In [None]:
%%time
# Combine into one CSV
files = glob.glob('/Users/spencergerlach/Desktop/figshare/figshare-nswrain/*.csv') 
# Manually removed observed_daily_rainfall_SYD.csv from the data folder
df = pd.concat((pd.read_csv(file).assign(model=re.findall("/([^_]*)", file)[0]) for file in files))
df.to_csv("/Users/spencergerlach/Desktop/figshare/figshare-nswrain/combined_data.csv") # Use absolute path for now

### Part 4: Time Taken to Combine CSV files

| Team Member | Operating System | RAM | Processor | Is SSD | Time Taken |
|-------------|------------------|-----|-----------|--------|------------|
|  Spencer    |   MacOS 12.6     |  8  | intel i5  |   Yes  |  16m 5s    |
|  Xinru      |   MacOS 13.2     | 16  | Apple M2  |   Yes  |  3m 50s    |
|             |                  |     |           |        |            |
|             |                  |     |           |        |            |


# Part 5: Load the combined CSV to memory and perform a simple EDA

- Xinru Lu

1. Changing dtype of the data
2. Load just columns that we want

In [None]:
import numpy as np


# local path to combined data (to be updated per user)
combined_data_path = 'data/figshare-nswrain/combined_data.csv'

# define column dtypes and columns to load
column_dtype = {'lat_min': np.float32, 'lat_max': np.float32, 'lon_min': np.float32, 'lon_max': np.float32, 'model': str}
use_columns = ['time', 'lat_min', 'lat_max', 'lon_min', 'lon_max', 'rain (mm/day)', 'model']

In [None]:
%%time

df = pd.read_csv(combined_data_path, dtype=column_dtype, parse_dates=['time'], usecols=use_columns)
print(df[['lat_min', 'lat_max', 'lon_min', 'lon_max']].describe())

### Part 5: Time Taken to Load CSV files

| Team Member | Operating System | RAM | Processor | Is SSD | Time Taken |
|-------------|------------------|-----|-----------|--------|------------|
|  Spencer    |   MacOS 12.6     |  8  | intel i5  |   Yes  |            |
|  Xinru      |   MacOS 13.2     | 16  | Apple M2  |   Yes  |  46.9 s    |
|             |                  |     |           |        |            |
|             |                  |     |           |        |            |


# Part 6: Perform a simple EDA in R

- Xinru Lu

I would use **Arrow exchange** since it helps with minimizing the time-consuming serialization/deserialization process.

In [None]:
%reset -f

In [None]:
%load_ext rpy2.ipython

In [None]:
import pyarrow.dataset as ds
import pyarrow as pa
import pandas as pd
import pyarrow 
from pyarrow import csv
import rpy2_arrow.pyarrow_rarrow as pyra

In [None]:
filepathparquet = "data/figshare-nswrain/combined_data.parquet"
filepathparquetr = "data/figshare-nswrain/combined_data_r.parquet"

In [None]:
%%time
# Converting the `pyarrow dataset` to a `pyarrow table`
table = pa.Table.from_pandas(df)
# Converting a `pyarrow table` to a `rarrow table`
r_table = pyra.converter.py2rpy(table)