# DSCI: 525 Milestone 1 - Group 8

## Rachel Wong, Rui Wang, Daniel Ortiz, Santiago Rugeles Schoonewolff

### Imports

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd
from memory_profiler import memory_usage

# Dask
import dask.dataframe as dd

# pyarrow and feather
import pyarrow.feather as feather
import pyarrow.dataset as ds

In [2]:
%load_ext rpy2.ipython
%load_ext memory_profiler

### Downloading the data

In [3]:
# Necessary metadata
article_id = 14096681  # unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "figshare/"

In [4]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'is_link_only': False,
  'name': 'daily_rainfall_2014.png',
  'supplied_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'computed_md5': 'fd32a2ffde300a31f8d63b1825d47e5e',
  'id': 26579150,
  'download_url': 'https://ndownloader.figshare.com/files/26579150',
  'size': 58863},
 {'is_link_only': False,
  'name': 'environment.yml',
  'supplied_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'computed_md5': '060b2020017eed93a1ee7dd8c65b2f34',
  'id': 26579171,
  'download_url': 'https://ndownloader.figshare.com/files/26579171',
  'size': 192},
 {'is_link_only': False,
  'name': 'README.md',
  'supplied_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'computed_md5': '61858c6cc0e6a6d6663a7e4c75bbd88c',
  'id': 26586554,
  'download_url': 'https://ndownloader.figshare.com/files/26586554',
  'size': 5422},
 {'is_link_only': False,
  'name': 'data.zip',
  'supplied_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'computed_md5': 'b517383f76e77bd03755a63a8ff83ee9',
  'id': 26766812,
  'download_url': 'https://

### Unzipping the data

In [None]:
%%time
files_to_dl = ["data.zip"]  # feel free to add other files here
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [None]:
%%time
with zipfile.ZipFile(os.path.join(output_directory, "data.zip"), 'r') as f:
    f.extractall(output_directory)

### Combining data CSVs using Pandas

In [None]:
df = pd.read_csv("./figshare/ACCESS-CM2_daily_rainfall_NSW.csv")
df

In [None]:
%%time
%memit
# Shows time that regular python takes to merge file
# Join all data together
## here we are using a normal python way of merging the data 

files = glob.glob('./figshare/*.csv') # load all the CSVs
df = pd.concat((pd.read_csv(file, index_col=0) # combine them all
                .assign(model=re.findall(r'/([^_]*)', file)[0])
                for file in files)
              )
df.to_csv("./figshare/combined_data.csv")

In [None]:
df_combined = pd.read_csv("./figshare/combined_data.csv")
df_combined # combined dataframe 

### Summary of Performance on Different Machines

Everyone in the team tried to run the combined files section and we recorded our time consumption and detailed `RAM`, `processor`, and `IF SSD` to check if they are relevant.

Below is the summarized table:

| Team Member      | RAM |Processor |Is SSD|Time used to combine csv files |Time used to load combined csv to memory |
| ----------- | ----------- |----------- |----------- |----------- |----------- |
| Rachel      | 16GB of 3733MHz       |2 GHz Quad-Core Intel Core i5       |Yes    |peak memory: 404.39 MiB, increment: 0.05 MiB, CPU times: user 5min 29s, sys: 19.5 s, total: 5min 48s Wall time: 6min       |peak memory: 7112.71 MiB, increment: 3458.57 MiB CPU times: user 58.7 s, sys: 15.7 s, total: 1min 14s Wall time: 1min 23s       |
| Daniel   | Text        |Text        |Text        |Text        |Text        |
| Santiago   | Text        |Text        |Text        |Text        |Text        |
| Rui   | 16 GB 2133 MHz |2.9 GHz Quad-Core Intel Core i7        |Yes        |peak memory: 356.82 MiB, increment: 0.31 MiB, CPU times: user 6min 43s, sys: 19.4 s, total: 7min 3s Wall time: 7min 13s        |peak memory: 2983.91 MiB, increment: 0.19 MiB,CPU times: user 1min, sys: 15 s, total: 1min 15s, Wall time: 1min 21s        |




In [None]:
df_combined["model"].unique() # print out the unique models

In [None]:
%%sh
du -sh figshare/combined_data.csv

We can see from our combined dataframe that we have 28 unique models to continue our analysis with.

### Load the combined CSV to memory and perform a simple EDA

In [None]:
%%time
%%memit
#simple pandas - This is how we do normally ,which means we are loading the entire data to the memory
df = pd.read_csv("figshare/combined_data.csv")
print(df["model"].value_counts())

In [None]:
df.head()

In [None]:
# checking datatypes for columns
df.dtypes

We can see that we have object and float64 type columns in our dataframe. This makes sense that `time` and `model` are object types and the rest such as latitude, longitude, and rain are float64 types.

### Investigate changing the `dtype` of our data

In [None]:
print(f"The memory usage with the original float64 dtype: {df[['lat_min','lat_max','rain (mm/day)']].memory_usage().sum() / 1e6:.2f} MB")
print(f"The memory usage after changing to float32 dtype: {df[['lat_min','lat_max','rain (mm/day)']].astype('float32', errors='ignore').memory_usage().sum() / 1e6:.2f} MB")

### Observation1:
> As we switch the data type from `float64` to `float32` the memory usage reduced by a half. This is because `float32` is stored as a 32-bit number, while `float64` is stored as twice as much memory as `float32`. If we have a large amount of data and we don't have a specific requirement on the precision or our original data is not accurate to a certain number of decimal places, `float32` is sufficient enough for us to process the data, which is not only faster but also resource-saving.

### Loading our data in chunks using Pandas and checking the value counts of models

In [None]:
%%time
%%memit
counts = pd.Series(dtype=int)
for chunk in pd.read_csv("figshare/combined_data.csv", chunksize=10_000_000):
    counts = counts.add(chunk["model"].value_counts(), fill_value=0)
print(counts.astype(int))

### Observation 2:
> By loading in chunks, the value counts are exactly the same as other methods we tried. The peak memory for chunks is significantly lower than that without using chunking method. From our observation, we can conclude that for large-scaled data, if we choose to load in chunk, we can gain the competitive edge for lower memory usage and faster processing speed.



### Loading our data using Dask and checking the value counts of models

In [None]:
%%time
%%memit
# Dask
df_dask = dd.read_csv('figshare/combined_data.csv')
print(df_dask["model"].value_counts().compute())

### Observation3:
> So far, Dask is the best pick for us to read large csv file to dataframe. Compared with loading the csv to pandas data frame, when we load the csv file to `dask`, the `peak memory`, `increment memory`, and `wall time` all reduced dramatically for the `value_count()` operation. This is likely because `dask` partitioned the dataframe based on row index and did the calculation in parallel to improve the efficiency. Thus, for large-scale data calculation, we could use dask instead of pandas to improve the code efficiency with minimum syntax change. 

### Transfering the dataframe from Python to R using Feather

In [None]:
%%time
%%memit
dataset = ds.dataset("figshare/combined_data.csv", format="csv")
## this is of arrow table format
table = dataset.to_table()

In [None]:
%%time
# writing in feather format
feather.write_feather(table, 'figshare/combined_data.feather')

### Reason for choosing `feather`
> Our team did a comprehensive comparision and research among the four data formats online and testing in practice, in the end `feather` is our best pick for this project scenario. Our reasoning is listed below:

> - `Feather` enable us to store and read the data from raw arrow format without much serialization and deserialization which renders it faster (higher I/O speed) than Parquet, although parquet can take less storage memory which is more suitable for long term data storage.
    
> - Feather is a columnar dataframe which can speed up the data analytics queries. 
    
> - It has the unique competitive advantage for not taking too much memory without the need to unpacking the data before loading to RAM.




### Simple EDA in R

In [None]:
%%R
library(tidyr)

In [None]:
%%time
%%R
library(arrow)
start_time <- Sys.time()
r_table <- arrow::read_feather("figshare/combined_data.feather")
print(class(r_table))
library(dplyr)
result <- r_table %>% count(model) # showing the different counts of the models 
end_time <- Sys.time()
print(result)
print(end_time - start_time)

In [None]:
%%R
result <- r_table %>% count(time) # showing the different counts of the time
print(result)

### Observation From EDA: 
> The counts for models in R is the same as the counts for models we did previously in python, time to count is faster, which double confirmed the accuracy of EDA analysis.

In [None]:
%%R
r_table_d <- r_table %>% drop_na() # drop NA values

In [None]:
%%R
r_table_d <- r_table_d %>% rename(rain_mmperday = `rain (mm/day)`) # rename the column for rain

In [None]:
%%R
# function to calculate the mode
mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

In [None]:
%%R
Columns <- c("lat_min", "lat_max", "lon_min", "lon_max", "rain (mm/perday)")
Mean <- c(mean(r_table_d$lat_min), mean(r_table_d$lat_max), mean(r_table_d$lon_min), mean(r_table_d$lon_max), mean(r_table_d$rain_mmperday))
Mode <- c(mode(r_table_d$lat_min), mode(r_table_d$lat_max), mode(r_table_d$lon_min), mode(r_table_d$lon_max), mode(r_table_d$rain_mmperday))
Median <- c(median(r_table_d$lat_min), median(r_table_d$lat_max), median(r_table_d$lon_min), median(r_table_d$lon_max), median(r_table_d$rain_mmperday))

result <- data.frame(Columns, Mean, Mode, Median)
print(result)

### Observation From Mean, Mode, Mean
> The mean and median of location data(`lagtitude` and `longitude`) are very close, which means the data collected are mostly from the same area. The median and mean of `rain` is not quite close which indicates that they are not normally distributed and there might be outliers in the for `rain`. The `mode` are close to the median which means if we randomly sample a value we are likely to sample a value close to the median. 

### Challenges and difficulties when dealing with large data

> Since we are running the code in the local machine, it took a long time to run. We combatted errors by restarting from scratch if there's anything we want to modify from the start which is frustrating. 
> Everytime we rerun the notebook we have to delete the downloaded files and redownload it again, which is quite challenging for large-scale data processing.
> As we only have one single machine, our EDA was very simple, we can hardly visualize our data or calculate correlation matrices. We were unable to do deep EDA like plots (histograms, correlation matrices, etc.) to show relationships between features because the data was so large and taking a sample of the data would not be ideal.