# MDS DSCI 525 - Group 15 Milestone 1

**Author**: Lennon Lok Lam Au-Yeung, Ken Wang, Ty Andrews, Peng Zhang

## Step 0 Importing library

In [None]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

## Step 1 Downloading the data via API

Navigate to the location of your computer where you would like to download the files to.

In [None]:
%cd ~/MDS/525_labs/figshareexp
## Change it to the location that you want to download your files to.

In [None]:
# Necessary metadata
article_id = 14096681  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figsharerainfall/"

In [None]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data
files = data["files"]             # this is just the data about the files, which is what we want
files

In [None]:
%%time
files_to_dl = ["data.zip"] 
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [None]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

In [None]:
%ls -ltr figsharerainfall

## Step 2 Combining data CSVs

Combine csv files into one file. Note that `observed_daily_rainfall_SYD.csv` has been manually removed as per the milestone 1 requirement.

In [None]:
%%time
# We are using a normal python way for merging the data 
# add extra column of "model"
use_cols = ["time", "lat_min", "lat_max", "lon_min","lon_max","rain (mm/day)"]
files = glob.glob('figsharerainfall/*.csv')
df = pd.concat((pd.read_csv(file, index_col=0, usecols=use_cols)
                .assign(model=re.findall("/([^_]*)", file)[0])
                for file in files)
              )
df.to_csv("figsharerainfall/combined_data.csv")

Compare the time for combining CSVs on team member's local computers. See the following table for results.

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Lennon Lok Lam |  MacOS Ventura V13.2.1 |  16GB  | Apple M1 Pro  | Yes | 3min 25s |
| Ken            | Ubuntu 18.04 | 4GB + 8GB swap|Intel N4020 @ 1.10Ghz |Yes | 28min |
| Ty             |  Windows 11  |32GB | Intel 11th Gen i7  | Yes  | 13min 41s  |
| Peng           | MacOS Ventura V13.2.1 | 16GB | Apple M2 | Yes | 3min 8s  |

From the above table, we can see that machines with Apple ARM processors has a much quicker processing time than using Windows OS, even though it has more RAM and a recent generation of CPU regarding Ty's machine. This might be because of the efficiency of the Apple chip and its optimizaion of the MacOS system. All of our computers has SSD so we are not able to compare the results between SSD and HDD. 

Challenges faced for Ubuntu 18.04, Intel N4020 @ 1.10GHz:
The combining takes a lot memory. Since my laptop only has 4GB RAM, the notebook kernel crashed when it ran out of RAM. Then I added 8GB swap to the system and this time the combining code ran fine. The processing took 28 minutes and I think it's so slow mainly because it's using a lot swap.

## Step 3 Load combined CSV to memory and perform a simple EDA in Python

We have tried changing the `dtype` of our data and loading in chunks

In [None]:
df = pd.read_csv("figsharerainfall/combined_data.csv", index_col = 'time')
df2 = pd.read_csv("figsharerainfall/combined_data.csv", index_col = 'time',
                  dtype= {'lat_min':'float32','lat_max':'float32','lon_min':'float32','lon_max':'float32','rain (mm/day)':'float32'})

In [None]:
df.info()

In [None]:
df2.info()

In [None]:
print(f"Memory usage with float64: {df.memory_usage().sum() / 1e6:.2f} MB")
print(f"Memory usage with float32: {df2.memory_usage().sum() / 1e6:.2f} MB")

As we can see from the message above, using `float32` has a lower memory usage.

The following we tried loading the data and doing EDA in the normal way that we usually do.

In [None]:
%%time
df = pd.read_csv("figsharerainfall/combined_data.csv", index_col = 'time')
print(df["model"].value_counts())

This time we tried loading the data in chuncks and counting them in chunks

In [None]:
%%time
counts = pd.Series(dtype=int)
for chunk in  pd.read_csv("figsharerainfall/combined_data.csv",
                          chunksize=10_000_000, usecols=['model']):
    counts = counts.add(chunk["model"].value_counts(), fill_value=0)
print(counts.astype(int).sort_values(ascending = False))

As we can see from the comparison above, doing EDA only with the columns we need and loading it in chunks have reduced the time required to complete EDA.

Compare the time for `value_counts` on team member's local computers. See the following table for results.

| Team Member | Operating System | RAM | Processor | Is SSD | Time taken |
|:-----------:|:----------------:|:---:|:---------:|:------:|:----------:|
| Lennon|  MacOS Ventura V13.2.1 |  16GB  | Apple M1 Pro  | Yes | 6s|
| Ken            |  Ubuntu 18.04|4GB + 8GB swap| Intel N4040 @ 1.10GHz|Yes|            |
| Ty             |  Windows 11  |32GB | Intel 11th Gen i7  | Yes  | |
| Peng           | MacOS Ventura V13.2.1 | 16GB | Apple M2 | Yes | 16s  |

## Step 4 Perform a simple EDA in R

In [None]:
import os
os.environ['R_HOME'] = '/opt/miniconda3/envs/525_2023/lib/R'

In [None]:
%load_ext rpy2.ipython

In [None]:
%cd ~/Desktop/MDS/Block6/DSCI525/figsharerainfall