# DSCI 525 - Web and Cloud Computing

Milestone 2: Your team is planning to migrate to the cloud. AWS gave 400$ (100$ each) to your team to support this. As part of this initiative, your team needs to set up a server in the cloud, a collaborative environment for your team, and later move your data to the cloud. After that, your team can wrangle the data in preparation for machine learning.

## Milestone 2 checklist  
You will have mainly 2 tasks. Here is the checklist...
- To set up a collaborative environment 
    - Setup your EC2 instance with JupyterHub.
    - Install all necessary things needed in your UNIX server (amazon ec2 instance).
    - Set up your S3 bucket.
    - Move the data that you wrangled in your last milestone to s3.
    - To move data from s3.
- Wrangle the data in preparation for machine learning
    - Get the data from S3 in your notebook and make data ready for machine learning.

**Keep in mind:**

- _All services you use are in region us-west-2._

- _Don't store anything in these servers or storage that represents your identity as a student (like your student ID number) ._

- _Use only default VPC and subnet._
    
- _No IP addresses are visible when you provide the screenshot._

- _You do proper budgeting so that you don't run out of credits._ 

- _We want one single notebook for grading, and it's up to your discretion on how you do it. ***So only one person in your group needs to spin up a big instance and a ```t2.xlarge``` is of decent size.***_

- _Please stop the instance when not in use. This can save you some bucks, but it's again up to you and how you budget your money. Maybe stop it if you or your team won't use it for the next 5 hours?

- _Your AWS lab will shut down after 3 hours 30 min. When you start it again, your AWS credentials (***access key***,***secret***, and ***session token***) will change, and you want to update your credentials file with the new one. _

- _Say something went wrong and you want to spin up another EC2 instance, then make sure you terminate the previous one._

- _We will be choosing the storage to be ```Delete on Termination```, which means that stored data in your instance will be lost upon termination. Make sure you save any data to S3 and download the notebooks to your laptop so that next time you have your jupyterHub in a different instance, you can upload your notebook there._

_***Outside of Milestone:*** If you are working as an individual just to practice setting up EC2 instances, make sure you select ```t2.large``` instance (not anything bigger than that as it can cost you money). I strongly recommend you spin up your own instance and experiment with the s3 bucket in doing something (there are many things that we learned and practical work from additional instructions and video series) to get comfortable with AWS. But we won't be looking at it for a grading purpose._

***NOTE:*** Everything you want for this notebook is discussed in lecture 3, lecture 4, and setup instructions.

### 1. Setup your EC2 instance

rubric={correctness:20}

#### Please attach this screen shots from your group for grading.
https://github.ubc.ca/mds-2021-22/DSCI_525_web-cloud-comp_students/blob/master/release/milestone2/image/1_result.png

![ec2.png](attachment:ec2.png)

### 2. Setup your JupyterHub

rubric={correctness:20}

#### Please attach this screen shots from your group for grading
I want to see all the group members here in this screenshot https://github.ubc.ca/mds-2021-22/DSCI_525_web-cloud-comp_students/blob/master/release/milestone2/image/2_result.png

![jupyter.png](attachment:jupyter.png)

### 3. Setup the server 

rubric={correctness:20}

3.1) Add your team members to EC2 instance.

3.2) Setup a common data folder to download data, and this folder should be accessible by all users in the JupyterHub.
    
3.3)(***OPTIONAL***) Setup a sharing notebook environment.

3.4) Install and configure AWS CLI.

#### Please attach this screen shots from your group for grading

Make sure you mask the IP address refer [here](https://www.anysoftwaretools.com/blur-part-picture-mac/).

https://github.ubc.ca/mds-2021-22/DSCI_525_web-cloud-comp_students/blob/master/release/milestone2/image/3_result.png

![aws_cli.png](attachment:aws_cli.png)

### 4. Get the data what we wrangled in our first milestone. 

You have to install the packages that are needed. Refer this TLJH [document]( https://tljh.jupyter.org/en/latest/howto/env/user-environment.html).Refer ```pip``` section.

Don't forget to add option -E. This way, all packages that you install will be available to other users in your JupyterHub.
These packages you must install and install other packages needed for your wrangling.

    sudo -E pip install pandas
    sudo -E pip install pyarrow
    sudo -E pip install s3fs

As in the last milestone, we looked at getting the data transferred from Python to R, and we have different solutions. Henceforth, I uploaded the parquet file format, which we can use moving forward.

In [2]:
import re
import os
import glob
import zipfile
import requests
from urllib.request import urlretrieve
import json
import pandas as pd

Rememeber here we gave the folder that we created in Step 3.2 as we made it available for all the users in a group.

In [3]:
# Necessary metadata
article_id = 14226968  # this is the unique identifier of the article on figshare
url = f"https://api.figshare.com/v2/articles/{article_id}"
headers = {"Content-Type": "application/json"}
output_directory = "/srv/data/my_shared_data_folder/"

In [4]:
response = requests.request("GET", url, headers=headers)
data = json.loads(response.text)  # this contains all the articles data, feel free to check it out
files = data["files"]             # this is just the data about the files, which is what we want
files

[{'id': 26844650,
  'name': 'allyears.csv.zip',
  'size': 2405908113,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26844650',
  'supplied_md5': '9e046ac05ecd2c32a256a47dd1098b81',
  'computed_md5': '9e046ac05ecd2c32a256a47dd1098b81'},
 {'id': 26863682,
  'name': 'individual_years.zip',
  'size': 1896206676,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/26863682',
  'supplied_md5': '921da748974b07b2a70bbfcc04535a77',
  'computed_md5': '921da748974b07b2a70bbfcc04535a77'},
 {'id': 27515426,
  'name': 'combined_model_data.csv.zip',
  'size': 821308997,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/files/27515426',
  'supplied_md5': '7638434c44a7d29cbb29fe200b4fd65d',
  'computed_md5': '7638434c44a7d29cbb29fe200b4fd65d'},
 {'id': 27520682,
  'name': 'combined_model_data_parti.parquet.zip',
  'size': 519743915,
  'is_link_only': False,
  'download_url': 'https://ndownloader.figshare.com/

In [23]:
files_to_dl = ["combined_model_data_parti.parquet.zip"]  ## Please download the partitioned 
for file in files:
    if file["name"] in files_to_dl:
        os.makedirs(output_directory, exist_ok=True)
        urlretrieve(file["download_url"], output_directory + file["name"])

In [22]:
with zipfile.ZipFile(os.path.join(output_directory, "combined_model_data_parti.parquet.zip"), 'r') as f:
    f.extractall(output_directory)

### 5. Setup your S3 bucket and move data

rubric={correctness:20}

5.1)  Create a bucket name should be mds-s3-xxx. Replace xxx with your "groupnumber".

5.2)  Create your first folder called "output".

5.3) Move the "observed_daily_rainfall_SYD.csv" file from the Milestone1 data folder to your s3 bucket from your local computer.

5.4) Moving the parquet file we downloaded(combined_model_data_parti.parquet) in step 4 to S3 using the cli what we installed in step 3.4.

#### Please attach this screen shots from your group for grading

Make sure it has 3 objects.

https://github.ubc.ca/mds-2021-22/DSCI_525_web-cloud-comp_students/blob/master/release/milestone2/image/4_result.png

![step5.png](attachment:step5.png)

### 6. Wrangle the data in preparation for machine learning

rubric={correctness:20}

Our data currently covers all of NSW, but say that our client wants us to create a machine learning model to predict rainfall over Sydney only. There's a bit of wrangling that needs to be done for that:
1. We need to query our data for only the rows that contain information covering Sydney
2. We need to wrangle our data into a format suitable for training a machine learning model. That will require pivoting, resampling, grouping, etc.

To train an ML algorithm we need it to look like this:

||model-1_rainfall|model-2_rainfall|model-3_rainfall|...|observed_rainfall|
|---|---|---|---|---|---|
|0|0.12|0.43|0.35|...|0.31|
|1|1.22|0.91|1.68|...|1.34|
|2|0.68|0.29|0.41|...|0.57|

6.1) Get the data from s3 (```combined_model_data_parti.parquet``` and ```observed_daily_rainfall_SYD.csv```)

6.2) First query for Sydney data and then drop the lat and lon columns (we don't need them).
```
syd_lat = -33.86
syd_lon = 151.21
```
Expected shape ```(1150049, 2)```.

6.3) Save this processed file to s3 for later use:

  Save as a csv file ```ml_data_SYD.csv``` to ```s3://mds-s3-xxx/output/```
  expected shape ```(46020,26)``` - This includes all the models as columns and also adding additional column ```Observed``` loaded from ```observed_daily_rainfall_SYD.csv``` from s3.

In [3]:
credentials = {'key': 'ASIAUVLEBR44Y5T3KTNZ',
              'secret': 'sgwmSXuoWaqdVv0WSvbuW8dPcSrgmRSsSOIUCZ92', 
              'token': 'FwoGZXIvYXdzEFQaDF5lihBb2aj9N8pQXiLIAZ6ZJtZ6pfa2x+gLD59B/YtJYhOBdLuwQLG3oPgCeMAxvv69cjPfQ2KaMfkwbUu9lE3WgLEF7xGck8pW8q34wVMejjBwv399DT4ZUtXHEJRPQDIYPUIGC/nVS/5mOjMgC3uj/hrrH+4aTOWyaFypvFu1LEWFiJpNqihfTWd06j1o6ytwhIb8OsG8z4exZJdLFBz8AzvE9GiiLifi3QYOGhT/XP4VTT25upbLVY2tcru8FQjupLd62bN3CEv1JYeLlLRPl2mhalePKJWCwpIGMi3nkf+YlPau97rdoN8LA7lhEC/V8oryLpNXTEOtN2a2unqPSnJrnzTYa7oZ1sE='}

df1 = pd.read_csv('s3://mds-s3-group10/observed_daily_rainfall_SYD.csv')
df2 = pd.read_parquet('s3://mds-s3-group10/combined_model_data_parti.parquet/', 
               storage_options = credentials)

In [4]:
syd_lat = -33.86
syd_lon = 151.21

sydney = df2[(df2['lat_min'] <= syd_lat) & (df2['lat_max'] >= syd_lat) & (df2['lon_min'] <= syd_lon) & (df2['lon_max'] >= syd_lon)]

sydney.drop(columns=["lat_min", "lat_max", "lon_min", "lon_max"], inplace = True)
sydney.set_index("time", inplace = True)
sydney.index = pd.to_datetime(sydney.index)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sydney.drop(columns=["lat_min", "lat_max", "lon_min", "lon_max"], inplace = True)


In [5]:
sydney

Unnamed: 0_level_0,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1
1889-01-01 12:00:00,0.040427,ACCESS-CM2
1889-01-02 12:00:00,0.073777,ACCESS-CM2
1889-01-03 12:00:00,0.232656,ACCESS-CM2
1889-01-04 12:00:00,0.911319,ACCESS-CM2
1889-01-05 12:00:00,0.698013,ACCESS-CM2
...,...,...
2014-12-27 12:00:00,17.444923,TaiESM1
2014-12-28 12:00:00,1.569647,TaiESM1
2014-12-29 12:00:00,1.444630,TaiESM1
2014-12-30 12:00:00,0.716019,TaiESM1


In [6]:
sydney.shape

(1150049, 2)

In [7]:
sydney.head()

Unnamed: 0_level_0,rain (mm/day),model
time,Unnamed: 1_level_1,Unnamed: 2_level_1
1889-01-01 12:00:00,0.040427,ACCESS-CM2
1889-01-02 12:00:00,0.073777,ACCESS-CM2
1889-01-03 12:00:00,0.232656,ACCESS-CM2
1889-01-04 12:00:00,0.911319,ACCESS-CM2
1889-01-05 12:00:00,0.698013,ACCESS-CM2


In [9]:
sydney = sydney.pivot(columns="model", 
                      values="rain (mm/day)").resample("1D").mean()

In [10]:
observed = pd.read_csv("s3://mds-s3-group10/observed_daily_rainfall_SYD.csv",
    storage_options=credentials,
)

observed.set_index("time", inplace = True)

observed.index = pd.to_datetime(observed.index)
observed.columns = ["Observed"]

In [12]:
ml_data_SYD = pd.concat([sydney, observed], 
                        axis=1)

In [13]:
ml_data_SYD.head()

Unnamed: 0_level_0,ACCESS-CM2,ACCESS-ESM1-5,AWI-ESM-1-1-LR,BCC-CSM2-MR,BCC-ESM1,CMCC-CM2-HR4,CMCC-CM2-SR5,CMCC-ESM2,CanESM5,EC-Earth3-Veg-LR,...,MPI-ESM-1-2-HAM,MPI-ESM1-2-HR,MPI-ESM1-2-LR,MRI-ESM2-0,NESM3,NorESM2-LM,NorESM2-MM,SAM0-UNICON,TaiESM1,Observed
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1889-01-01,0.040427,1.814552,35.579336,4.268112,0.001107466,11.410537,3.322009e-08,2.6688,1.321215,1.515293,...,4.244226e-13,1.390174e-13,6.537884e-05,3.445495e-06,15.76096,4.759651e-05,2.451075,0.221324,2.257933,0.006612
1889-01-02,0.073777,0.303965,4.59652,1.190141,0.0001015323,4.014984,1.3127,0.946211,2.788724,4.771375,...,4.409552,0.1222283,1.049131e-13,4.791993e-09,0.367551,0.4350863,0.477231,3.757179,2.287381,0.090422
1889-01-03,0.232656,0.019976,5.927467,1.003845e-09,1.760345e-05,9.660565,9.10372,0.431999,0.003672,4.23398,...,0.22693,0.3762301,9.758706e-14,0.6912302,0.1562869,9.561101,0.023083,0.253357,1.199909,1.401452
1889-01-04,0.911319,13.623777,8.029624,0.08225225,0.1808932,3.951528,13.1716,0.368693,0.013578,15.252495,...,0.02344586,0.4214019,0.007060915,0.03835721,2.472226e-07,0.5301038,0.002699,2.185454,2.106737,14.869798
1889-01-05,0.698013,0.021048,2.132686,2.496841,4.708019e-09,2.766362,18.2294,0.339267,0.002468,11.920356,...,4.270161e-13,0.1879692,4.504985,3.506923e-07,1.949792e-13,1.460928e-10,0.001026,2.766507,1.763335,0.467628


In [19]:
ml_data_SYD.to_csv(
    "s3://mds-s3-group10/output/ml_data_SYD.csv", storage_options=credentials
)

How the final file format looks like
https://github.ubc.ca/mds-2021-22/DSCI_525_web-cloud-comp_students/blob/master/release/milestone2/image/finaloutput.png

Shape ```(46020,26 )```

(***OPTIONAL***) If you are interested in doing some benchmarking!! How much time it took to read..
- Parquet file from your local disk ?
- Parquet file from s3 ?
- CSV file from s3 ?
    For that, upload the CSV file (```combined_model_data.csv```
     )to S3 and try to read it instead of parquet. 