# DSCI 525 - Web and Cloud Computing

In this milestone we move our project to the cloud. As part of this initiative, our team sets up a server in the cloud, a collaborative environment, and later moved data to the cloud. After that, we wrangled the data in preparation for machine learning.

## Milestone 2 checklist  

- To set up a collaborative environment:  
    - [x] Setup your EC2 instance with JupyterHub.  
    - [x] Install all necessary things needed in your UNIX server (amazon ec2 instance).
    - [x] Set up your S3 bucket.  
    - [x] Move the data that you wrangled in your last milestone to s3.  
    - [x] To move data from s3.  
- Wrangle the data in preparation for machine learning  
    - [x] Get the data from S3 in your notebook and make data ready for machine learning.  

### 1. Setup your EC2 instance

rubric={correctness:20}

#### Please attach this screen shots from your group for grading.
![ec2](../Notebooks/images/EC2_instances.png)

### 2. Setup your JupyterHub

rubric={correctness:20}

#### Please attach this screen shots from your group for grading
I want to see all the group members here in this screenshot https://github.ubc.ca/mds-2021-22/DSCI_525_web-cloud-comp_students/blob/master/release/milestone2/image/2_result.png

![jupyter](../Notebooks/images/hub_users.png)

### 3. Setup the server 

rubric={correctness:20}

- [x] Add your team members to EC2 instance.

- [x] Setup a common data folder to download data, and this folder should be accessible by all users in the JupyterHub.

- [x] Install and configure AWS CLI.

#### Please attach this screen shots from your group for grading

Make sure you mask the IP address refer [here](https://www.anysoftwaretools.com/blur-part-picture-mac/).

https://github.ubc.ca/mds-2021-22/DSCI_525_web-cloud-comp_students/blob/master/release/milestone2/image/3_result.png

![shared](../Notebooks/images/shared_data_folder.png)

### 4. Get the data what we wrangled in our first milestone. 

You have to install the packages that are needed. Refer this TLJH [document]( https://tljh.jupyter.org/en/latest/howto/env/user-environment.html).Refer ```pip``` section.

Don't forget to add option -E. This way, all packages that you install will be available to other users in your JupyterHub.
These packages you must install and install other packages needed for your wrangling.

    sudo -E pip install pandas
    sudo -E pip install pyarrow
    sudo -E pip install s3fs

As in the last milestone, we looked at getting the data transferred from Python to R, and we have different solutions. Henceforth, I uploaded the parquet file format, which we can use moving forward.

In [1]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

Rememeber here we gave the folder that we created in Step 3.2 as we made it available for all the users in a group.

In [2]:
# Necessary metadata
article_id = 14226968  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "/srv/data/my_shared_data_folder/"

In [3]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'id': 26844650,
  'name': 'allyears.csv.zip',
  'size': 2405908113,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26844650',
  'supplied_md5': '9e046ac05ecd2c32a256a47dd1098b81',
  'computed_md5': '9e046ac05ecd2c32a256a47dd1098b81'},
 {'id': 26863682,
  'name': 'individual_years.zip',
  'size': 1896206676,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26863682',
  'supplied_md5': '921da748974b07b2a70bbfcc04535a77',
  'computed_md5': '921da748974b07b2a70bbfcc04535a77'},
 {'id': 27515426,
  'name': 'combined_model_data.csv.zip',
  'size': 821308997,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/27515426',
  'supplied_md5': '7638434c44a7d29cbb29fe200b4fd65d',
  'computed_md5': '7638434c44a7d29cbb29fe200b4fd65d'},
 {'id': 27520682,
  'name': 'combined_model_data_parti.parquet.zip',
  'size': 519743915,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/

In [4]:
files_to_dl = ["combined_model_data_parti.parquet.zip"]  ## Please download the partitioned 
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [5]:
with zipfile.ZipFile(os.path.join(output_directory, "combined_model_data_parti.parquet.zip"), 'r') as f:
    f.extractall(output_directory)

### 5. Setup your S3 bucket and move data

rubric={correctness:20}

- [x]  Create a bucket name should be mds-s3-xxx. Replace xxx with your "groupnumber".

- [x]  Create your first folder called "output".

- [x] Move the "observed_daily_rainfall_SYD.csv" file from the Milestone1 data folder to your s3 bucket from your local computer.

- [x] Moving the parquet file we downloaded(combined_model_data_parti.parquet) in step 4 to S3 using the cli what we installed in step 3.4.

#### Please attach this screen shots from your group for grading

Make sure it has 3 objects.

https://github.ubc.ca/mds-2021-22/DSCI_525_web-cloud-comp_students/blob/master/release/milestone2/image/4_result.png

![s3](../Notebooks/images/S3.png)

### 6. Wrangle the data in preparation for machine learning

rubric={correctness:20}

Our data currently covers all of NSW, but say that our client wants us to create a machine learning model to predict rainfall over Sydney only. There's a bit of wrangling that needs to be done for that:
1. We need to query our data for only the rows that contain information covering Sydney
2. We need to wrangle our data into a format suitable for training a machine learning model. That will require pivoting, resampling, grouping, etc.

To train an ML algorithm we need it to look like this:

||model-1_rainfall|model-2_rainfall|model-3_rainfall|...|observed_rainfall|
|---|---|---|---|---|---|
|0|0.12|0.43|0.35|...|0.31|
|1|1.22|0.91|1.68|...|1.34|
|2|0.68|0.29|0.41|...|0.57|

6.1) Get the data from s3 (```combined_model_data_parti.parquet``` and ```observed_daily_rainfall_SYD.csv```)

6.2) First query for Sydney data and then drop the lat and lon columns (we don't need them).
```
syd_lat = -33.86
syd_lon = 151.21
```
Expected shape ```(1150049, 2)```.

6.3) Save this processed file to s3 for later use:

  Save as a csv file ```ml_data_SYD.csv``` to ```s3://mds-s3-xxx/output/```
  expected shape ```(46020,26)``` - This includes all the models as columns and also adding additional column ```Observed``` loaded from ```observed_daily_rainfall_SYD.csv``` from s3.

### Data wrangling code:

In [10]:
import json
import urllib.parse
import pandas as pd

In [7]:
aws_credentials = {
    "key": "ASIAWPZFFOX5GKHYY6E4",
    "secret": "y2nn6f9/N2ArdPFh5wbI/aVRRRmqMG4VZu+x0YJH",
    "token" : "FwoGZXIvYXdzEEYaDHjp2mp9ZMqDq8qNPSLNAUbBLpdgpEocOIvBaqF+4+l5ZT/5QuDXjVegeTftVhAHxkRtFtHi0Mm3x7dtr5PTtzgELMl6Nv9GDm/KP47Zjv4N2g8jO3Ox73GWGaR8TdBBvRtqxxGthqfUuLJ5k/uDHLe2/Di4MrBCGa4vyq3WQFKmcvRE/Eh/6WlAxqWkXcXRnzhnLnfC+v8jRP/JmkWgj/VP416jBOS0KIlgXRXaHEO6zSR+dFtgSutdIjBeX8aQfxOwxDplNArb85kV7wGah3bMtxi1DZtznZkRWZIojfC+kgYyLauupVOOa94H1MC5KEU7dGn2JXKrIjVTkVg/sdzeTAwVsR1MQh+pLzLnGzZNcA=="
}

In [15]:
# reading observed rainfall csv dataset from s3
observed = pd.read_csv('s3://mds-s3-arlincherian/observed_daily_rainfall_SYD.csv',
                    storage_options = aws_credentials)

In [16]:
# reading combined data parquet file from s3
df = pd.read_parquet('s3://mds-s3-arlincherian/output/model=CMCC-CM2-HR4/bd6cc563dd314a97bc8e98274d16be49.parquet',
                    storage_options = aws_credentials)

In [13]:
df.head()

Unnamed: 0,time,lat_min,lat_max,lon_min,lon_max,rain (mm/day)
0,1889-01-01 12:00:00,-35.811518,-34.86911,140.625,141.875,1.162277
1,1889-01-02 12:00:00,-35.811518,-34.86911,140.625,141.875,4.016328
2,1889-01-03 12:00:00,-35.811518,-34.86911,140.625,141.875,1.639138e-17
3,1889-01-04 12:00:00,-35.811518,-34.86911,140.625,141.875,2.554377e-20
4,1889-01-05 12:00:00,-35.811518,-34.86911,140.625,141.875,0.006764351


In [17]:
# defining lat and long of Sydney for filtering
syd_lat = -33.86
syd_lon = 151.21

df_syd = df.query("lat_min <= @syd_lat and lat_max >= @syd_lat and \
                   lon_min <= @syd_lon and lon_max >= @syd_lon")
df_syd = df_syd.drop(columns=["lat_min", "lat_max", "lon_min", "lon_max"])

# selecting observed raninfall model from the second dataset
observed["model"] = "observed_rainfall"

# combining dataset
df_concat = pd.concat([df_syd, observed])
df_concat["time"] = pd.to_datetime(df_concat["time"]).dt.date
df_concat.set_index("time", inplace=True)

# new dataset output
ml_df = df_concat.pivot(values="rain (mm/day)", columns="model").reset_index().drop(columns=["time"])

In [19]:
#save dataset to output folder on s3
ml_df.to_csv("s3://mds-s3-arlincherian/output/ml_data_SYD.csv", storage_options = aws_credentials)

How the final file format looks like
https://github.ubc.ca/mds-2021-22/DSCI_525_web-cloud-comp_students/blob/master/release/milestone2/image/finaloutput.png

Shape ```(46020,26 )```

(***OPTIONAL***) If you are interested in doing some benchmarking!! How much time it took to read..
- Parquet file from your local disk ?
- Parquet file from s3 ?
- CSV file from s3 ?
    For that, upload the CSV file (```combined_model_data.csv```
     )to S3 and try to read it instead of parquet. 