<a href="https://www.eventsforce.net/turingevents/frontend/reg/thome.csp?pageID=89551&eventID=249&traceRedir=2"> <img src="images/turing.png" alt="Header" style="height: 200px;" align="left"/> </a> <a href="https://www.baskerville.ac.uk/"> <img src="images/baskerville.png" alt="Header" style="height: 200px;" /> </a> <a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;" align="right"/> </a>

# Challenge Notebook - copy
## Baskerville — Accelerate your research with GPUs 2023 

This challenge is based on part of the NVIDIA DLI course for Accelerating Data Engineering Pipelines.

Here's the story - You have inherited this Jupyter notebook from a coworker who has gone away on vacation but *ughhhh* it is so slow! They have kindly left us a handoff list with ideas on how to improve the code.

---

*Dear Colleague,*

*I do not know how long I will stay on this tropical island. Hopefully forever. Here is small list of things I wish I could have done:*

- *right now I am only reading data for one station - how do I get all of them?*
- *for efficiency, I only want to load the columns "STATION", "LATITUDE", "LONGITUDE", "DlySum" and "DATE" from the data*
- *I think I need to clean the data - I'm pretty sure there are NA or anomalous values*
- *if I did load all of data, I better check that it loaded correctly using `df.describe()` this somewhere*
- *the rainfall data is in units of hundredths of inches, I should probably convert this into centimetres*
- *I'd like to visualise a single day's worth of rainfall on a map - but I forget how to use the Boolean conditions to filter the dataframe*
- *before plotting I should send the data to the host*
- *I need to figure out how to use `Scattergeo` to plot rainfall on a map - at least I have the "LATITUDE" and "LONGITUDE" data*

If I implement the above, then I believe I can run this notebook in under **70** seconds using the ⏩  button.

*Best wishes,*

*Madame Kay Oss*

---

In [None]:
##NOTE: don't change this cell, we will use this to keep track of how long your notebook takes to execute
from time import time
time_start = time()

### Start Dask Cluster

Here I wanted to set up a Dask cluster but I'm not sure how to configure it for more than one GPU.

In [None]:
import numpy as np
import pandas as pd
import dask_cudf

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from dask_cuda.initialize import initialize
import rmm

visible_devices = [0] # FIXME: I want the maximum 2 GPUs per node
temp_data_directory = '/tmp/' # define directory to buffer data
device_memory_limit = 2**32 * .9 # use a fraction of 4GiB capacity to prevent memory errors

cluster = LocalCUDACluster(
    CUDA_VISIBLE_DEVICES=FIXME,
    local_directory=FIXME,
    protocol='ucx', # allows direct GPU-to-GPU data transfer over NVLink
    enable_tcp_over_ucx=True,
    enable_nvlink=True,
    rmm_pool_size=device_memory_limit
)
client = Client(cluster)

# Initialize RMM pool on ALL workers
def _rmm_pool():
    rmm.reinitialize()
client.run(_rmm_pool)


If it is not running already, I highly recommend setting up a terminal on the side to monitor the GPUs with the following command:

`watch -n0.1 nvidia-smi --query-gpu=index,memory.used,memory.total,utilization.gpu --format=csv`

In [None]:
time_read = time()

ddf = pd.read_csv(
    "input_data/AQC00914594.csv",
    usecols=["STATION", "FIXME", "FIXME", "FIXME", "DATE"],
    dtype={
        "STATION": "object",
        "FIXME": FIXME,
        "FIXME": FIXME,
        "FIXME": FIXME,
        "DATE": "datetime64[ns]",
    },
    na_values=["FIXME"],
)
ddf = ddf.FIXME() # what's the method to ignore NA values?

print('This took {:.3f} seconds'.format(time()-time_read))

Let's check if the data was loaded correctly

In [None]:
ddf[["FIXME", "FIXME", "FIXME"]].FIXME()

### Convert rainfall to centimetres

How do you convert to centimetres again?

In [None]:
ddf["cm"] = ddf["FIXME"] / FIXME * FIXME

## Filter Data

I'd like to visualise rainfall data from one day on a map...better filter my data here to the date 2021-01-01.

In [None]:
# Pick the date using pd.Timestamp(2021, 1, 1)
precip_one_day = ddf[FIXME] # how does the Boolean condition work here?

# Calculate a result and send to host
precip_one_day = precip_one_day.FIXME().to_FIXME()

## Visualise Big Data with Plotly

Hm... I'd quite like to make a geographic map like in this [example](https://plotly.com/python/scatter-plots-on-maps/#us-airports-map) but with the rainfall data in centimetres.

In [None]:
import plotly.graph_objects as go

# Create text column for the hovertext in the figure
precip_one_day["TEXT"] = (
    precip_one_day["FIXME"] + "FIXME"
)

fig = go.Figure([go.Scattergeo(
    lon=precip_one_day['FIXME'],
    lat=precip_one_day['FIXME'],
    mode='markers',
    marker_color=precip_one_day['DlySum'], # is there any way to make the markers look nicer?
    text=precip_one_day["TEXT"])])

fig.update_layout(
        title = 'USA Precipitation',
        geo = dict(
            scope='north america',
            projection_type='albers usa',
            landcolor = "rgb(225, 225, 225)",
            subunitcolor = "rgb(200, 200, 200)",
        ),
    )

fig.show()

In [None]:
##NOTE: don't change this cell, we will use this to keep track of how long your notebook takes to execute
time_end = time()-time_start
print('This notebook took {:.2f} seconds to run'.format(time_end))

In [None]:
# Save to results folder

import os
user = os.getenv('USER')
with open("results/{}-time.txt".format(user), "w") as f:
    f.write(str(time_end))

fig.write_image("results/{}-image.png".format(user))


#### Shut down the kernel

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>