# Lab : Schedule Data Ingest

## Overview
We will schedule data download to run periodically.

## Run time
20 minutes

## Depends on
[2-data-upload](1-data-upload-console.ipynb)


## Step 1- Go to Google Cloud Console

## Step 2 - Open Google Cloud Shell and clone repository
<img src="../assets/images/Setting_up_Cloud_Shell_Lab-2df6f86d.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

The below steps are repurposed from the textbook Data Science on the Google Cloud Platform and we will be using the exmaple code from this source as well. https://learning.oreilly.com/library/view/data-science-on/9781491974551/. You will be creating a Cloud Function that is called using the Cloud Scheduler to periodically update our data.


Start by cloning the below repository in the Cloud Shell.

```bash
git clone \
   https://github.com/GoogleCloudPlatform/data-science-on-gcp/
```

Next we will move into that directory.

```bash
cd data-science-on-gcp
```

<img src="../assets/images/scheduling-9.png" style="max-width:100%;border-width:3px;  border-style: solid;" />



---

## Step 3 - Making API calls using Cloud Shell

Now we will use the command line in Cloud Shell to make API calls to the BTS and then ingest that data into our bucket.


Navigate into the monthlyupdate directory and run the ingest_flights script to get the data for the first month of 2016.

```bash
cd monthlyupdate/

bash ingest_flights.py --bucket [bucketname]  --year 2016 --month 01
```
<img src="../assets/images/scheduling-8.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

This will upload the csv for the first month of 2016 flight data into your bucket. From here, you will create a Cloud Function that monthly updates the bucket with a new file by making calls to the BTS.

---

## Step 4 - Creating a Cloud Function 



We will be using two new GCP tools. We will be creating a Cloud Function that runs the script that made our API call in the last section. 

Then we will use the Scheduler to make periodic calls to the function to update the data stored in our bucket.




In the monthlyupdate directory in our Cloud Shell environment, we are going to run the generate_token.sh script to create a random access token.

```bash
./generate_token.sh
```
<img src="../assets/images/scheduling-10.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

Then open main.py and add the token in where the previous token was. It is highlighted below. This will allow us to access our function once it is hosted.


<img src="../assets/images/scheduling-11.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

Now that we are here. Take a look at the main.py. This is the code of the main function that will execute in the Cloud Function when the url is activated. It is using the functions from ingest_flights.py to download the flight data and then upload it to the bucket.

<img src="../assets/images/scheduling-13.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

Open deploy_cf.sh and then change the file to reflect the below code:

```bash
#!/bin/bash

URL=ingest_flights
echo $URL

gcloud functions deploy $URL --runtime python37 --trigger-http --timeout 480s
```

This will deploy the Cloud Function we are creating to the specified URL.


<img src="../assets/images/scheduling-12.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

Now let's run the script using the command:

```bash
./deploy_cf.sh
```
When you are asked to allow unauthenticated invocations of new function, select y for yes.

<img src="../assets/images/scheduling-14.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

It uploads our function that can be called in an API call.

We will take a look at the Cloud Function from the Console next.

Return to the Google Cloud Platform Console. On the left side, you will now see Cloud Functions on the Resources tab underneath Storage. Click on it to go to the Cloud Functions page.


<img src="../assets/images/Data_Ingestion_Using_the_Cloud_Shell-e45a65fa.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

You will now see a Function named ingest_flights, select it and double click to open it up.


<img src="../assets/images/scheduling-15.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

 The main page shows usage information.

<img src="../assets/images/scheduling-16.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

 There are additional tabs. Select Trigger and it will show you the full URL of your Cloud Function as well as the trigger type.


<img src="../assets/images/scheduling-17.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

 Now select Source and it will show that all of the scripts present in the monthlyupdate directory have been uploaded for this Cloud Function.


<img src="../assets/images/scheduling-18.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

---

## Step 5 - Creating a Scheduler

Upon API call, the function will run the main.py script which has a function call ingest_flights to pull in the flight data for the specified year and month and then upload it into your bucket. If dates are not specified, it will try to get the next month of data based on what it already has.

So we will use the Scheduler to call this function to populate the bucket with the new flight data each month.

Now let's make a call to the function to see if it works.

Go back to the Cloud Shell Environment.


Open the call_cf.sh script and replace the code with the code below:

```bash
#!/bin/bash

REGION='us-central1'
PROJECT=$(gcloud config get-value project)
BUCKET=[bucketname]
URL=ingest_flights
TOKEN=[the token in main.py]

echo {\"year\":\"2016\"\,\"month\":\"02\"\,\"bucket\":\"${BUCKET}\", \"token\":\"${TOKEN}\"} > /tmp/message
cat /tmp/message

curl -X POST "https://${REGION}-${PROJECT}.cloudfunctions.net/$URL" -H "Content-Type:application/json" --data-binary @/tmp/message
```

Make sure to add your bucket and generated token to the appropriate locations.


<img src="../assets/images/scheduling-19.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

Run the script using
```bash
./call_cf.sh
```

<img src="../assets/images/scheduling-20.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

This script will make a call to our cloud function through the url we saw on the Cloud Functions page. Then the function will run using the year 2016 and the month 02 as inputs to make the request to the BTS website to download that month of flight data. If successful, we will be able to see a new csv file in our bucket.


<img src="../assets/images/scheduling-21.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

Now finally, we will create a Scheduler to run this function every month.

Open setup_cron.sh and edit the file to reflect the below code:

```bash
#!/bin/bash

REGION='us-central1'
PROJECT=$(gcloud config get-value project)
BUCKET=[bucketname]

URL="https://${REGION}-${PROJECT}.cloudfunctions.net/ingest_flights"
TOKEN=[the token in main.py]
echo {\"bucket\":\"${BUCKET}\", \"token\":\"${TOKEN}\"} > /tmp/message

gcloud pubsub topics create cron-topic
gcloud pubsub subscriptions create cron-sub --topic cron-topic

gcloud beta scheduler jobs create http monthlyupdate \
       --schedule="8 of month 10:00" \
       --uri=$URL \
       --max-backoff=7d \
       --max-retry-attempts=5 \
       --max-retry-duration=3h \
       --min-backoff=1h \
       --time-zone="US/Eastern" \
       --message-body-from-file=/tmp/message
```



Again make sure to add in your bucket name and the token in the appropriate locations.



<img src="../assets/images/scheduling-22.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

Now we will run the script:

```bash
./setup_cron
```

<img src="../assets/images/scheduling-23.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

This script uses the gcloud function to create a cron topic, cron subcription, and scheduler job with the specified settings. Specifically, it will make a call to our ingest_flights Cloud Function on the 8th of every month at 10:00 which serves the purpose of checking the BTS for new data every month.


---

## Step 6 - Verify Scheduler and clean up

Return to the GCP Console. Select the drop down menu in the top left and scroll down to Tools and then select Cloud Scheduler.

<img src="../assets/images/Data_Ingestion_Using_the_Cloud_Shell-424feff8.png" style="max-width:100%;border-width:3px;  border-style: solid;" />


We can now see the scheduled job of the name monthlyupdate with the target URL of our Cloud Function.

We are done with the Scheduler and Function for now. So we will now delete both.

Select the check box and then select delete.


<img src="../assets/images/scheduling-25.png" style="max-width:100%;border-width:3px;  border-style: solid;" />


Then go back into the Cloud Functions and delete that function.

<img src="../assets/images/scheduling-26.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

Now go into the search bar and search subscriptions to get to the Pub/Sub page.

<img src="../assets/images/scheduling-27.png" style="max-width:100%;border-width:3px;  border-style: solid;" />

We will now delete our cron-subcription and cron-topic to finish cleaning everything up.


<img src="../assets/images/scheduling-28.png" style="max-width:100%;border-width:3px;  border-style: solid;" />
