<h1>Google Cloud Platform Practice</h1>

In this assignment, you will examine the data on NYC 311 complaints, compute a month by month average time to completion, and show, in a bar chart, that the 311 system has been getting better at handling complaints. The assignment is fairly straightforward and is designed to give you a feel for using the cloud cluster. Follow the steps below and you should be fine.

<s><h2>STEP 1: Create a Plotly account</h2></s>
<s><li>If you don't already have one, create an account at <a href="https://plotly.com/api_signup">https://plotly.com/api_signup</a></li></s>
<s><li>Get an API key and copy it somewhere safely</li></s> (NOT NECESSARY - I found a library that doesn't need an account

<h2>STEP 2: Create a cluster</h2>
<li>Follow the steps in the document on our canvas page</li>

<h2>STEP 3: Take a deep breath</h2>
<li>You'll need it!</li>

<h2>STEP 4: Upload data</h2>
<li>upload the file nyc_311_2022_big.csv to the data folder in your cloud storage bucket</li>
<li>the file is at <a href="https://drive.google.com/file/d/1iGMlHBwycGG3Z9lUInVsF939UP0gLD9_/view?usp=sharing">https://drive.google.com/file/d/1iGMlHBwycGG3Z9lUInVsF939UP0gLD9_/view?usp=sharing</a> You need to be logged into google with your lion mail account to access this 4GB file)</li>


<h2>STEP 5: Create a notebook on your cluster</h2>
<li>Create a new notebook (see instructions on canvas)</li>
<li>In the first cell, you must have the following commands</li>
<pre>
%%init_spark
launcher.packages=["org.plotly-scala:plotly-jupyter-scala_2.12:0.4.0"]
</pre>
<li>these load the plotly package and must be the first thing executed in the notebook. If you restart the notebook, run this again</li>
<li>The following is an example of the code for constructing a plotly barchart</li>
<pre>
//Plotly barchart example
import plotly._
import plotly.element._
import plotly.layout.Layout 
import plotly.Plotly._

val data = Seq(
  Bar(
    Seq("giraffes", "orangutans", "monkeys"),
    Seq(20, 14, 23)
  )
)

data.plot()
</pre>
<li>run it from a cell. Note that you won't see anything on GCP after you run it so try this on your local jupyter first. A new webpage with the barchart should show up</li>
<li>plotly-scala creates a file and stores that file on your local machine</li>
<li>on GCP, that file is stored in the root directory of the master node. To find it, do the following</li>
<ol>
    <li>Go to the file navigation jupyter notebook page</li>
    <li>Navigate to the top of the tree (click on the folder icon in the top bar)</li>
    <li>Choose "Local Disk"</li>
    <li>In the file listing, scroll down till you find plot-1.html (or plot-2.html,....)</li>
    <li>these plot-n.html files are where your plots are. Click on any one and it should opne up in your browser</li>
</ol>
<li><b>IMPORTANT</b>: Note that when you delete the cluster, everything in the master, including these html files <s>and the .credentials file</s>, will be gone! Download and save them if necessary</li>

<h2>STEP 6: Write code!</h2>
<li>Read the data file into an RDD</li>
<li>extract the date and the processing time from the rdd into a new rdd</li>
<li>modify the date so that it is in the form yyyymm (e.g., 202004)</li>
<li>group the data by this modified date</li>
<li>calculate the average processing time for each group</li>
<li>At this point, you should have a rdd that contains the following sample (note the by-date sort)</li>
<pre>
(201001,21.9500192299817)
(201002,19.845731532163732)
(201003,25.077314883297213)
(201004,30.277893547725235)
(201005,29.459862935681343)
(201006,29.17964166482454)
(201007,26.227416736921505)
(201008,26.24405141747815)
(201009,30.43295355377878)
(201010,24.145453184279294)
(201011,21.589596849143458)
(201012,20.51565411056409)
(201101,15.898290248909372)
(201102,14.422152387906875)
(201103,19.963673360824025)
</pre>
<li>Finally, construct a plotly bar chart of this data using the example above as a guide</li>

<h2>STEP 7: Submission requirements</h2>
<li>Your notebook</li>
<li>The appropriate plot-n.html with the bar chart</li>