# Rainfall Prediction Project

## Data Loading, combining and EDA

*Group 12*

------------

In [3]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

## Download the data

In [2]:
article_id = 14096681
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "../data/raw/figsharerainfall/"

In [3]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)
files = data["files"]

In [4]:
files_to_dl = ["data.zip"]
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [5]:
with zipfile.ZipFile(os.path.join(output_directory, files_to_dl[0]), 'r') as f:
    f.extractall(output_directory)

## Combine the data

In [6]:
%%time
exclude = "observed_daily_rainfall_SYD.csv"
files = glob.glob('../data/raw/figsharerainfall/*.csv')
df = pd.concat(
    (pd.read_csv(file, index_col=0)
                .assign(model=re.findall(r'[A-Z][^_]+', file)[0])
                for file in files if file is not exclude)
)
df.to_csv("../data/processed/combined_data.csv")

CPU times: user 9min 27s, sys: 16.8 s, total: 9min 44s
Wall time: 9min 56s


In [10]:
df = pd.read_csv("../data/processed/combined_data.csv")
df.head()

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
0,1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.244226e-13,MPI-ESM-1-2-HAM
1,1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.217326e-13,MPI-ESM-1-2-HAM
2,1889-01-03 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.498125e-13,MPI-ESM-1-2-HAM
3,1889-01-04 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.251282e-13,MPI-ESM-1-2-HAM
4,1889-01-05 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.270161e-13,MPI-ESM-1-2-HAM


In [8]:
df.tail()

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
19283042,1912-12-04 12:00:00,-33.644675,-32.523187,152.4375,153.5625,0.133328,BCC-CSM2-MR
19283043,1912-12-05 12:00:00,-33.644675,-32.523187,152.4375,153.5625,20.639483,BCC-CSM2-MR
19283044,1912-12-06 12:00:00,-33.644675,-32.523187,152.4375,153.5625,1.329393,BCC-CSM2-MR
19283045,1912-12-07 12:00:00,-33.644675,-32.523187,152.4375,153.5625,0.000261,BCC-CSM2-MR
19283046,1912-12-08 12:00:00,-33.644675,-32.523187,152.4375,153.5625,0.015799,BCC-CSM2-MR


## Combine data csv on different machines

- Compare observations:

| Team Member   | Operating System | RAM | Processor | Is SSD | Time taken |
|:-------------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Vera Cui      | macOS            | 16GB| M1        | No     | 6min 39s   |
| Lynn Wu       | macOS            | 8GB | M1        | Yes    |  6min 7s   |
| Jasmine Ortega|  macOS         |  8GB |   M1   |  Yes   | 9min 56s     |
| Maeve Shi   | MacOS Big Sur    | 8GB | 2.3 GHz Dual-Core Intel Core i5 | Yes |  7min 30s   |

--------------

##  Load csv and perform EDA on different machines

#### Baseline `read_csv` time

In [8]:
%%time
df = pd.read_csv("../data/processed/combined_data.csv")

CPU times: user 1min 7s, sys: 17.1 s, total: 1min 24s
Wall time: 1min 37s


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62513863 entries, 0 to 62513862
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   time           object 
 1   lat_min        float64
 2   lat_max        float64
 3   lon_min        float64
 4   lon_max        float64
 5   rain (mm/day)  float64
 6   model          object 
dtypes: float64(5), object(2)
memory usage: 3.3+ GB


| Team Member   | Operating System | RAM | Processor | Is SSD | Time taken |
|:-------------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Vera Cui      |                  |     |           |        |            |
| Lynn Wu       |                  |     |           |        |            |
| Jasmine Ortega|  MacOS           |8GB  |    M1     |  yes   |1 min 46s   |
| Yike Shi      |                  |     |           |        |            |

As is, the csv file took 1 minute and 46 seconds to load. From `.info()` we can see that the df consists of 6 columns all of the dtype `float64`. To reduce memory usage, we will first convert the data type to `float32` and `float16`, both which will reduce memory used, as shown below.

In [13]:
print(f"Memory usage with float64: {df.memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df.astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float16: {df.astype('float16', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

Memory usage with float64: 3500.78 MB
Memory usage with float32: 2250.50 MB
Memory usage with float16: 1625.36 MB


#### Approaches to reduce memory usage while performing the EDA: *changing datatype*
##### Convert to float32

In [14]:
%%time
dtypes = {"lat_min" : "float32",
         "lat_max" : "float32",
         "lon_min" : "float32",
         "lon_max" : "float32",
         "rain (mm/day)" : "float32",
          "model" : "string"
        }

df_float32 = pd.read_csv('../data/processed/combined_data.csv', dtype=dtypes)

CPU times: user 1min 6s, sys: 13.9 s, total: 1min 19s
Wall time: 1min 26s


In [15]:
df_float32.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62513863 entries, 0 to 62513862
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   time           object 
 1   lat_min        float32
 2   lat_max        float32
 3   lon_min        float32
 4   lon_max        float32
 5   rain (mm/day)  float32
 6   model          object 
dtypes: float32(5), object(2)
memory usage: 2.1+ GB


##### Convert to float16

In [9]:
%%time
dtypes = {"lat_min" : "float16",
         "lat_max" : "float16",
         "lon_min" : "float16",
         "lon_max" : "float16",
         "rain (mm/day)" : "float16",
        "model" : "string"
        }

df_float16 = pd.read_csv('../data/processed/combined_data.csv', dtype=dtypes)

CPU times: user 1min 10s, sys: 8.54 s, total: 1min 18s
Wall time: 1min 22s


In [17]:
df_float16 

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
0,1889-01-01 12:00:00,-35.43750,-33.56250,141.500,143.500,0.000000,MPI-ESM-1-2-HAM
1,1889-01-02 12:00:00,-35.43750,-33.56250,141.500,143.500,0.000000,MPI-ESM-1-2-HAM
2,1889-01-03 12:00:00,-35.43750,-33.56250,141.500,143.500,0.000000,MPI-ESM-1-2-HAM
3,1889-01-04 12:00:00,-35.43750,-33.56250,141.500,143.500,0.000000,MPI-ESM-1-2-HAM
4,1889-01-05 12:00:00,-35.43750,-33.56250,141.500,143.500,0.000000,MPI-ESM-1-2-HAM
...,...,...,...,...,...,...,...
62513858,2014-12-27 12:00:00,-30.15625,-29.21875,153.125,154.375,6.691406,SAM0-UNICON
62513859,2014-12-28 12:00:00,-30.15625,-29.21875,153.125,154.375,7.863281,SAM0-UNICON
62513860,2014-12-29 12:00:00,-30.15625,-29.21875,153.125,154.375,10.007812,SAM0-UNICON
62513861,2014-12-30 12:00:00,-30.15625,-29.21875,153.125,154.375,8.539062,SAM0-UNICON


As demonstrated, changing `float64` to less precise datatypes reduced runtimes. Interestingly, it look like `float16` (1min 22s) took almost as long to load as the more precise `float32` ( 1min 26s). Moving forward, we will use the `float16` df because it takes up less memory.

#### Approaches to reduce memory usage while performing the EDA: *Loading in chunks*

In [6]:
%%time

chunk = pd.read_csv("../data/processed/combined_data.csv", chunksize=10_000_000, iterator=True)
df = pd.concat(chunk)
df

CPU times: user 1min 10s, sys: 15.8 s, total: 1min 25s
Wall time: 1min 34s


Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
0,1889-01-01 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.244226e-13,MPI-ESM-1-2-HAM
1,1889-01-02 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.217326e-13,MPI-ESM-1-2-HAM
2,1889-01-03 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.498125e-13,MPI-ESM-1-2-HAM
3,1889-01-04 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.251282e-13,MPI-ESM-1-2-HAM
4,1889-01-05 12:00:00,-35.439867,-33.574619,141.5625,143.4375,4.270161e-13,MPI-ESM-1-2-HAM
...,...,...,...,...,...,...,...
62513858,2014-12-27 12:00:00,-30.157068,-29.214660,153.1250,154.3750,6.689683e+00,SAM0-UNICON
62513859,2014-12-28 12:00:00,-30.157068,-29.214660,153.1250,154.3750,7.862555e+00,SAM0-UNICON
62513860,2014-12-29 12:00:00,-30.157068,-29.214660,153.1250,154.3750,1.000503e+01,SAM0-UNICON
62513861,2014-12-30 12:00:00,-30.157068,-29.214660,153.1250,154.3750,8.541592e+00,SAM0-UNICON


Loading the data in chunks of 10,000,000 reduced the loading time from 1min 37s to 1 minute 34s. Let's combine the `float16` strategy with loading in chunks! 

In [10]:
%%time

dtypes = {"lat_min" : "float16",
         "lat_max" : "float16",
         "lon_min" : "float16",
         "lon_max" : "float16",
         "rain (mm/day)" : "float16",
         "model" : "string"
        }

final_df = pd.DataFrame()

chunk = pd.read_csv("../data/processed/combined_data.csv", chunksize=10_000_000, iterator=True, dtype=dtypes)
final_df = pd.concat(chunk)

CPU times: user 1min 13s, sys: 9.08 s, total: 1min 23s
Wall time: 1min 28s


In [8]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62513863 entries, 0 to 62513862
Data columns (total 7 columns):
 #   Column         Dtype  
---  ------         -----  
 0   time           object 
 1   lat_min        float64
 2   lat_max        float64
 3   lon_min        float64
 4   lon_max        float64
 5   rain (mm/day)  float64
 6   model          object 
dtypes: float64(5), object(2)
memory usage: 3.3+ GB


We successfully reduced the load time from 1min 46s to 1min 28s. 


**Optimized data loading:**

| Team Member   | Operating System | RAM | Processor | Is SSD | Time taken |
|:-------------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Vera Cui      |                  |     |           |        |            |
| Lynn Wu       |                  |     |           |        |            |
| Jasmine Ortega|  MacOS           |8GB  |    M1     |  yes   |1 min 28s   |
| Yike Shi      |                  |     |           |        |            |

#### EDA