# Milestone 1 Notebook

*Note: Steps 1 (team work contract) and 2 (creating repository) for this milestone are not included in this notebook.*

## Imports

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

## 3. Downloading the data

### Download using figshare's API

In [None]:
# figshare article metadata
article_id = 14096681  
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}

# directories for output
output_directory = "figsharerain/"
output_directory_files = "figsharerain/data/"

# get the files from figshare
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]

files

In [None]:
%%time

files_to_dl = ['data.zip']

# download files, this takes some time
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

### Extract and view files

In [None]:
%%time

# extract files
os.makedirs(output_directory_files, exist_ok=True)
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory_files)

# 4. Combine data CSVs with Pandas

### Combine CSVs and add "model" column

In [None]:
%%time
# files to combine
files = glob.glob('figsharerain/data/*.csv')
files.remove('figsharerain/data/observed_daily_rainfall_SYD.csv')

# combine with pandas
df = pd.concat(
    (
        pd.read_csv(file, index_col=0)
        .assign(model=re.findall(r'^[^_]*', file)[0]) for file in files) # model column
    )

df["model"] = df["model"].apply(lambda x: x.split("/")[-1])

# save combined file
df.to_csv("figsharerain/data/combined_data.csv")

In [None]:
df.head()

In [None]:
print(df.shape)

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Nico        | Mac              |32gb | Intel i9  |  Yes   |  7min 11 sec          |
| Kristin    |  Mac                | 8gb    |  Intel i5         |   Yes     |   9min 8 sec         |
| Jennifer    | Mac             | 8GB  | Intel i5       |   Yes     |     9min 10s     |
| Morgan    |                  |     |           |        |            |

*Summary of observations of runtimes*

TO ADD AT END

# 5. Load the combined CSV and perform EDA

Firstly, we'll check to see if we can load the entire dataset.

In [2]:
%%time
df = pd.read_csv("figsharerain/data/combined_data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62467843 entries, 0 to 62467842
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   time           object 
 1   lat_min        float64
 2   lat_max        float64
 3   lon_min        float64
 4   lon_max        float64
 5   rain (mm/day)  float64
 6   model          object 
dtypes: float64(5), object(2)
memory usage: 3.3+ GB
CPU times: user 1min 11s, sys: 25.8 s, total: 1min 37s
Wall time: 1min 57s


| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Nico        | Mac              |32gb | Intel i9  |  Yes   |            |
| Kristin    |  Mac                | 8gb    |  Intel i5         |   Yes     |            |
| Jennifer    | Mac             | 8GB  | Intel i5       |   Yes     |     1min 57s     |
| Morgan    |                  |     |           |        |            |

We were able to load the entire dataset, although this was fairly slow.

One approach to make the data more manageable is to reduce the memory requirement by reducing the precision of the numerical columns from float64 to float32.

In [3]:
%%time
df_float32 = pd.read_csv("figsharerain/data/combined_data.csv", dtype={'lat_min': 'float32', 'lat_max': 'float32', 'lon_min': 'float32', 'lon_max': 'float32', 'rain (mm/day)': 'float32'})
df_float32.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62467843 entries, 0 to 62467842
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   time           object 
 1   lat_min        float32
 2   lat_max        float32
 3   lon_min        float32
 4   lon_max        float32
 5   rain (mm/day)  float32
 6   model          object 
dtypes: float32(5), object(2)
memory usage: 2.1+ GB
CPU times: user 1min 8s, sys: 18.1 s, total: 1min 26s
Wall time: 1min 34s


| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Nico        | Mac              |32gb | Intel i9  |  Yes   |            |
| Kristin    |  Mac                | 8gb    |  Intel i5         |   Yes     |            |
| Jennifer    | Mac             | 8GB  | Intel i5       |   Yes     |     1min 34s     |
| Morgan    |                  |     |           |        |            |

Changing the data type reduced the size of the data from 3.3+ GB to 2.1+ GB, but had a small impact on the data loading time.

Next, let's check the number of observations per model. We can try loading only the "model" column to further reduce memory usage.

In [4]:
%%time
use_cols = ['model']
model = pd.read_csv("figsharerain/data/combined_data.csv",usecols=use_cols)
print(model.value_counts())

model           
MPI-ESM1-2-HR       5154240
TaiESM1             3541230
NorESM2-MM          3541230
CMCC-CM2-HR4        3541230
CMCC-CM2-SR5        3541230
CMCC-ESM2           3541230
SAM0-UNICON         3541153
FGOALS-f3-L         3219300
GFDL-CM4            3219300
GFDL-ESM4           3219300
EC-Earth3-Veg-LR    3037320
MRI-ESM2-0          3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM5-0           1609650
INM-CM4-8           1609650
KIOST-ESM           1287720
FGOALS-g3           1287720
MPI-ESM1-2-LR        966420
NESM3                966420
AWI-ESM-1-1-LR       966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
dtype: int64
CPU times: user 36.2 s, sys: 4.78 s, total: 41 s
Wall time: 42.5 s


| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Nico        | Mac              |32gb | Intel i9  |  Yes   |            |
| Kristin    |  Mac                | 8gb    |  Intel i5         |   Yes     |            |
| Jennifer    | Mac             | 8GB  | Intel i5       |   Yes     |     43s     |
| Morgan    |                  |     |           |        |            |

Loading and summarizing a single column with `value_counts` was a fairly quick operation and took less than half the time compared to loading in the entire dataframe, although this might not be an efficient approach for performing EDA on all columns in the data.

We'll summarize all the numeric columns from the original combined dataframe below:

In [5]:
%%time
df.describe()

CPU times: user 14.6 s, sys: 9.67 s, total: 24.3 s
Wall time: 28.9 s


Unnamed: 0,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
count,59248540.0,62467840.0,59248540.0,62467840.0,59248540.0
mean,-33.10482,-31.97757,146.9059,148.215,1.90117
std,1.963549,1.992067,3.793784,3.809994,5.585735
min,-36.46739,-36.0,140.625,141.25,-3.807373e-12
25%,-34.86911,-33.66221,143.4375,145.0,3.838413e-06
50%,-33.0,-32.04188,146.875,148.125,0.06154947
75%,-31.4017,-30.15707,150.1875,151.3125,1.020918
max,-29.9,-27.90606,153.75,155.625,432.9395


| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Nico        | Mac              |32gb | Intel i9  |  Yes   |            |
| Kristin    |  Mac                | 8gb    |  Intel i5         |   Yes     |            |
| Jennifer    | Mac             | 8GB  | Intel i5       |   Yes     |     29s     |
| Morgan    |                  |     |           |        |            |

*Summary of observations of runtimes*



# 6. Perform EDA in R