# Daily Rainfall Prediction in Australia - 525 Group 28

This notebook is to be run in the DSCI525 conda environment. You can download and install the [conda environment file](https://github.ubc.ca/mds-2021-22/DSCI_525_web-cloud-comp_students/blob/master/525.yml) and create a conda environment for the notebook and activate it as follows.

```Python
conda env create -f 525.yml
conda activate 525
```

When running please make sure to clone the [GitHub Repo](https://github.com/UBC-MDS/525-group28). Below we will install and load some extra dependencies, you will need to restart the kernel after installed them for the first time.

# Import packages

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
import numpy as np

# Downloading data 

In [2]:
# Necessary metadata
article_id = 14096681  # this is the unique identifier of the article
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figshareclimate_data/"

In [3]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]

In [4]:
%%time
files_to_dl = ["data.zip"]
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

CPU times: total: 9.41 s
Wall time: 2min 44s


## Extract contents of zipped file

In [5]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(f'{output_directory}/data')

CPU times: total: 24.8 s
Wall time: 25.1 s


## Remove unused file

In [6]:
unused_file = os.path.join(
    output_directory,
    "data/observed_daily_rainfall_SYD.csv")
if os.path.exists(unused_file):
    os.remove(unused_file)

# Combine data CSVs

In [7]:
%%time
files = glob.glob('figshareclimate_data/data/*.csv')
df = pd.concat(
    (pd.read_csv(file, index_col=0, parse_dates=['time'])
     .assign(model=re.findall(r'[^\/]+(?=_daily_rainfall_NSW\.)', file)[0])
     for file in files)
)
df.to_csv("figshareclimate_data/combined_data.csv")

CPU times: total: 15min 13s
Wall time: 15min 23s


In [8]:
print(df.shape)

(62467843, 6)


In [9]:
df.head()

Unnamed: 0_level_0,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1889-01-01 12:00:00,-36.25,-35.0,140.625,142.5,3.293256e-13,data\ACCESS-CM2
1889-01-02 12:00:00,-36.25,-35.0,140.625,142.5,0.0,data\ACCESS-CM2
1889-01-03 12:00:00,-36.25,-35.0,140.625,142.5,0.0,data\ACCESS-CM2
1889-01-04 12:00:00,-36.25,-35.0,140.625,142.5,0.0,data\ACCESS-CM2
1889-01-05 12:00:00,-36.25,-35.0,140.625,142.5,0.01047658,data\ACCESS-CM2


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 62467843 entries, 1889-01-01 12:00:00 to 2014-12-31 12:00:00
Data columns (total 6 columns):
 #   Column         Dtype  
---  ------         -----  
 0   lat_min        float64
 1   lat_max        float64
 2   lon_min        float64
 3   lon_max        float64
 4   rain (mm/day)  float64
 5   model          object 
dtypes: float64(5), object(1)
memory usage: 3.3+ GB


In [11]:
df.dtypes

lat_min          float64
lat_max          float64
lon_min          float64
lon_max          float64
rain (mm/day)    float64
model             object
dtype: object

## Combine data csv on different machines

| Team Member            | Operating System | RAM | Processor | Is SSD | Time taken |
|:----------------------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Kingslin Lv            | Windows          | 16GB|    i7      | Yes     | 15min 15s   |
| Sufang Tan             |  Mac OS   |    16GB       |  i7      | Yes    | 08min 16s
| Amir Abbas Shojakhani  |Mac OS     |16GB           |i7       |Yes           |08min 50s |

# EDA-Python

## 1st Approach - Changing dtype of the data

### Value counts with base data types

In [11]:
%%time
df.value_counts()

CPU times: total: 2min 34s
Wall time: 2min 38s


lat_min  lat_max  lon_min   lon_max   rain (mm/day)  model             
-30.625  -29.375  141.5625  143.4375  0.000000       data\ACCESS-ESM1-5    15271
-31.875  -30.625  141.5625  143.4375  0.000000       data\ACCESS-ESM1-5    13850
-30.625  -29.375  143.4375  145.3125  0.000000       data\ACCESS-ESM1-5    13615
-31.875  -30.625  143.4375  145.3125  0.000000       data\ACCESS-ESM1-5    12638
-33.125  -31.875  141.5625  143.4375  0.000000       data\ACCESS-ESM1-5    12112
                                                                           ...  
-34.000  -33.000  148.7500  150.0000  0.000702       data\GFDL-CM4             1
                                      0.000704       data\GFDL-CM4             1
                                                     data\GFDL-CM4             1
                                                     data\GFDL-CM4             1
-29.900  -29.100  152.7250  153.5250  199.089043     data\FGOALS-f3-L          1
Length: 55839634, dtype: int64

Value counts with dtype conversion from float64 to float32

In [12]:
df_conv = df.copy()
df_conv['lat_min'] = np.float32(df_conv["lat_min"])
df_conv['lat_max'] = np.float32(df_conv["lat_max"])
df_conv['lon_min'] = np.float32(df_conv["lon_min"])
df_conv['lon_max'] = np.float32(df_conv["lon_max"])

In [13]:
%%time
df.value_counts()

CPU times: total: 2min 13s
Wall time: 2min 16s


lat_min  lat_max  lon_min   lon_max   rain (mm/day)  model             
-30.625  -29.375  141.5625  143.4375  0.000000       data\ACCESS-ESM1-5    15271
-31.875  -30.625  141.5625  143.4375  0.000000       data\ACCESS-ESM1-5    13850
-30.625  -29.375  143.4375  145.3125  0.000000       data\ACCESS-ESM1-5    13615
-31.875  -30.625  143.4375  145.3125  0.000000       data\ACCESS-ESM1-5    12638
-33.125  -31.875  141.5625  143.4375  0.000000       data\ACCESS-ESM1-5    12112
                                                                           ...  
-34.000  -33.000  148.7500  150.0000  0.000702       data\GFDL-CM4             1
                                      0.000704       data\GFDL-CM4             1
                                                     data\GFDL-CM4             1
                                                     data\GFDL-CM4             1
-29.900  -29.100  152.7250  153.5250  199.089043     data\FGOALS-f3-L          1
Length: 55839634, dtype: int64

## 2nd Approach - Load just columns we want

In [14]:
df_reduced = df[['lat_min', 'lat_max', 'lon_min', 'lon_max']]
df_reduced = df_reduced.reset_index().drop('time', axis=1)
df_reduced.head()

Unnamed: 0,lat_min,lat_max,lon_min,lon_max
0,-36.25,-35.0,140.625,142.5
1,-36.25,-35.0,140.625,142.5
2,-36.25,-35.0,140.625,142.5
3,-36.25,-35.0,140.625,142.5
4,-36.25,-35.0,140.625,142.5


In [16]:
%%time
df_reduced.value_counts()

CPU times: total: 7.33 s
Wall time: 7.34 s


lat_min     lat_max     lon_min    lon_max  
-32.984293  -32.041885  148.12500  149.37500    275939
-32.041885  -31.099476  146.87500  148.12500    275939
                        143.12500  144.37500    275939
-32.984293  -32.041885  146.87500  148.12500    275939
-32.041885  -31.099476  144.37500  145.62500    275939
                                                 ...  
-33.000000  -32.000000  143.75000  145.00000     45990
                        142.50000  143.75000     45990
                        141.25000  142.50000     45990
-33.487232  -30.696652  150.46875  153.28125     45990
-29.900000  -29.100000  152.72500  153.52500     45990
Length: 897, dtype: int64

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken for EDA on base dataframe |Time taken for EDA after changing dtypes|Time taken for EDA after reducing features|
|:--------------:|:----------------:|:---:|:---------:|:------:|:----------:|----------:|----------:|
| Kingslin Lv | Windows| 16GB| i7 | Yes |02min 34s|02min 13s|7.33s|
| Sufang Tan|  Mac OS|16GB |i7 | Yes||||
| Amir Abbas Shojakhani |Mac OS|16GB|i7|Yes|01min 47s|01min 49s|01min 53s|