# DSCI 525 - Web and Cloud Computing
## Milestone 1: Tackling big data on your laptop
### Group 14
Group Members: Sasha Babicki, Cheuk Ho, Sakshi Jain, Zeliha Ural Merpez

#### 1. Download the data
1. Download the data from figshare to your local computer using the figshare API (you can make use of requests library).
2. Extract the zip file, again programmatically, similar to how we did it in class.

#### Note: We are following the lecture notes for the code blocks... (make it better)

In [2]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
from memory_profiler import memory_usage

In [3]:
%load_ext rpy2.ipython
%load_ext memory_profiler



In [5]:
article_id = 14096681 # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figshareairline/"

In [6]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'is_link_only': False,
  'name': 'daily_rainfall_2014.png',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'id': 26579150,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'size': 58863},
 {'is_link_only': False,
  'name': 'environment.yml',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'id': 26579171,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'size': 192},
 {'is_link_only': False,
  'name': 'README.md',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'id': 26586554,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'size': 5422},
 {'is_link_only': False,
  'name': 'data.zip',
  'supplied_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'computed_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'id': 26766812,
  'download_url': 'https://

In [7]:
%%time
files_to_dl = ["data.zip"]  # feel free to add other files here
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

Wall time: 1min 37s


In [9]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

Wall time: 58.4 s


#### 2. Combining data CSVs
1. Use one of the following options to combine data CSVs into a single CSV. (Pandas, DASK)
2. When combining the csv files make sure to add extra column called "model" that identifies the model (tip : you can get this column populated from the file name eg: for file name "SAM0-UNICON_daily_rainfall_NSW.csv", the model name is SAM0-UNICON)
3. Compare run times and memory usages of these options on different machines within your team, and summarize your observations in your milestone notebook.

### change model to have name only `ACCESS-CM2`... arrange regex...

In [4]:
%%time
%memit
# Shows time that regular python takes to merge file
# Join all data together
## here we are using a normal python way of merging the data 
import pandas as pd
files = glob.glob('figshareairline/*.csv')
df = pd.concat((pd.read_csv(file, index_col=0)
                .assign(model=re.findall(r'[^\/]+(?=\.)', file)[0])
                for file in files)
              )
df.to_csv("figshareairline/combined_data.csv")

peak memory: 130.89 MiB, increment: 0.18 MiB
Wall time: 14min 21s


In [5]:
%%sh
du -sh figshareairline/combined_data.csv

7.7G	figshareairline/combined_data.csv


In [6]:
%%time
df = pd.read_csv("figshareairline/combined_data.csv")

Wall time: 5min 27s


In [7]:
print(df.shape)

(62513863, 7)


In [8]:
df.head()

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day),model
0,1889-01-01 12:00:00,-36.25,-35.0,140.625,142.5,3.293256e-13,figshareairline\ACCESS-CM2_daily_rainfall_NSW
1,1889-01-02 12:00:00,-36.25,-35.0,140.625,142.5,0.0,figshareairline\ACCESS-CM2_daily_rainfall_NSW
2,1889-01-03 12:00:00,-36.25,-35.0,140.625,142.5,0.0,figshareairline\ACCESS-CM2_daily_rainfall_NSW
3,1889-01-04 12:00:00,-36.25,-35.0,140.625,142.5,0.0,figshareairline\ACCESS-CM2_daily_rainfall_NSW
4,1889-01-05 12:00:00,-36.25,-35.0,140.625,142.5,0.01047658,figshareairline\ACCESS-CM2_daily_rainfall_NSW


In [9]:
df['model'].nunique()

28

In [10]:
df['model'].unique()

array(['figshareairline\\ACCESS-CM2_daily_rainfall_NSW',
       'figshareairline\\ACCESS-ESM1-5_daily_rainfall_NSW',
       'figshareairline\\AWI-ESM-1-1-LR_daily_rainfall_NSW',
       'figshareairline\\BCC-CSM2-MR_daily_rainfall_NSW',
       'figshareairline\\BCC-ESM1_daily_rainfall_NSW',
       'figshareairline\\CanESM5_daily_rainfall_NSW',
       'figshareairline\\CMCC-CM2-HR4_daily_rainfall_NSW',
       'figshareairline\\CMCC-CM2-SR5_daily_rainfall_NSW',
       'figshareairline\\CMCC-ESM2_daily_rainfall_NSW',
       'figshareairline\\EC-Earth3-Veg-LR_daily_rainfall_NSW',
       'figshareairline\\FGOALS-f3-L_daily_rainfall_NSW',
       'figshareairline\\FGOALS-g3_daily_rainfall_NSW',
       'figshareairline\\GFDL-CM4_daily_rainfall_NSW',
       'figshareairline\\GFDL-ESM4_daily_rainfall_NSW',
       'figshareairline\\INM-CM4-8_daily_rainfall_NSW',
       'figshareairline\\INM-CM5-0_daily_rainfall_NSW',
       'figshareairline\\KIOST-ESM_daily_rainfall_NSW',
       'figshareairline\\

### 3. Load the combined CSV to memory and perform a simple EDA
1. Investigate at least two of the following approaches to reduce memory usage while performing the EDA (e.g., value_counts).
    - Changing dtype of your data
    - Load just columns what we want
    - Loading in chunks
    - Dask
2. Discuss your observations.

Discussion:

#### 4. Perform a simple EDA in R
1. Pick an approach to transfer the dataframe from python to R.
    - Parquet file
    - Feather file
    - Pandas exchange
    - Arrow exchange
2. Discuss why you chose this approack over others.