# Milestone 1 Worksheet

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

## Downloading the data

In [2]:
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figshare/"

In [3]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]
files

[{'id': 26579150,
  'name': 'daily_rainfall_2014.png',
  'size': 58863,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e'},
 {'id': 26579171,
  'name': 'environment.yml',
  'size': 192,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34'},
 {'id': 26586554,
  'name': 'README.md',
  'size': 5422,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c'},
 {'id': 26766812,
  'name': 'data.zip',
  'size': 814041183,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26766812',
  'supplied_md5': 'b517383f76e77bd03755a63a8f

In [4]:
files_to_dl = ["data.zip"]  
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [5]:
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

In [6]:
os.remove("figshare/observed_daily_rainfall_SYD.csv")

In [7]:
%%time

use_cols = ["time", "lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/day)"]
files = glob.glob('figshare/*.csv')

df = pd.concat((pd.read_csv(file, index_col=0, usecols=use_cols)
                .assign(model=re.findall(r"/([^_]*)", file)[0])
                for file in files))

df.to_csv("figshare/combined_data.csv")

CPU times: user 5min 48s, sys: 9.82 s, total: 5min 57s
Wall time: 6min 2s



| Team Member  |Operating System|RAM|Processor|Is SSD| Time Taken|
| -------------| -------------- | - | --- ----| -----| --------- |
| Junrong Zhu  |macOS Monterey  |8GB|   CPU - Apple M1 chip 8-core   |  Yes | Total time 5min 57s|
| Amelia Tang  |macOS Monterey  |8GB|   CPU - 2.2 GHz Dual-Core Intel Core i7   |  Yes | Total time 10min 1s|
| Chaoran Wang | macOS Big Sur   | 16  | Intel Core i7-7700k | Yes | Total time 5min 39s |

## EDA

In order to understand our data better, we performed following exploratory data analysis steps:

- observing and changing the `dtype` of the data
- loading in chunks
- loading the columns of interest

We are going to present the EDA in `Python` and `R` respectively.

### Python Section

In [8]:
df = pd.read_csv("figshare/combined_data.csv")

Reading in the data takes quite some time, let's see how large is the data by checking on its shape.

In [9]:
df.shape

(62467843, 7)

In [10]:
df.dtypes

time              object
lat_min          float64
lat_max          float64
lon_min          float64
lon_max          float64
rain (mm/day)    float64
model             object
dtype: object

As we see from the output, `time` has data type as *object*, however, it would be better for us to change the data type to `datetime64[ns]` in case we want to apply extensive functions on the time series values.

In [11]:
df['time'] = pd.to_datetime(df['time'])

In [14]:
# check on the dtype after converting

df.dtypes

time             datetime64[ns]
lat_min                 float64
lat_max                 float64
lon_min                 float64
lon_max                 float64
rain (mm/day)           float64
model                    object
dtype: object

`time` is now shown as *datetie64[ns]* instead of *object*.

### R Section

... Reasoning of the approach ...

## Challenges

1. One of the challenges we had with Q5 was the long running time. For example, we wanted to have a general overview on the dataframe by using '.info()' like what we did in other courses, however, it took a long time to output the dtype for each variable as well as other information which we were not particularly interested in. As an alternative approach, we used `.dtypes` to get the data type of columns and it returned the results immediately.

2. 