<img src="images/coiled-logo.svg"
     align="right"
     width="5%"
     alt="Coiled logo\">

### Sign up for the next live session https://www.coiled.io/tutorials


<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">
     
# Parallelize your Python code

In this lesson you will learn how to parallelize custom Python code using Dask using the Futures API. We will take normal for-loopy Python code that looks like this:

```python
urls = [...]
results = []
for url in urls:
    page = download(url)
    result = process(page)
    results.append(result)
```

or more dynamic Python code that looks like this:

```python
urls = [...]
results = []
while urls:
    url = urls.pop()
    page = download(url)
    result = process(page)
    results.append(result)
    
    new_urls = scrape(page)
    urls.extend(new_urls)
```

and parallelize it using [Dask Futures](https://docs.dask.org/en/stable/futures.html). 


## Futures: a low-level collection.

Dask low-level collections are the best tools when you need to have fine control to build custom parallel and distributed computations. 

The `futures` interface (derived from the built-in `concurrent.futures`) provides fine-grained real-time execution for custom situations. It allows you to submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. 

### Why use Futures?

The `futures` API offers a work submission style that can easily emulate the map/reduce paradigm. If that is familiar to you then futures might be the simplest entrypoint into Dask.

The other big benefit of futures is that the intermediate results, represented by futures, can be passed to new tasks without having to pull data locally from the cluster. The **call returns immediately**, giving one or more *futures*, whose status begins as "pending" and later becomes "finished". There is no blocking of the local Python session. With futures, as soon as the inputs are available and there is compute available, the computation starts. 

## Outline

We will learn how to use futures, and then use them on a real-world example, first in a simple case, and then in a complex case:

1.  How to use Futures 
2.  Use futures to download and parse webpages
3.  Dynamic/changing workloads
4.  Crawl and scrape a website


### Parallel Code with low-level Futures

This is an example of an embarrassingly parallel computation.  We want to run the same Python code on many pieces of data.  This is a very simple and also very common case that comes up all the time.

First, we're going to see a very simple example, then we'll try to parallelize the code above.

### Set up a Dask cluster locally

In [None]:
from dask.distributed import Client

client = Client()
client

### Dask Futures introduction

In [None]:
import time
import random

def inc(x):
    time.sleep(random.random())
    return x + 1

def double(x):
    time.sleep(random.random())
    return 2 * x

def add(x, y):
    time.sleep(random.random())
    return 2 * x
    

In [None]:
%%time

y = inc(10)
z = double(y)
z

Dask futures lets us run Python functions remotely on parallel hardware.  Rather than calling the function directly, like in the cell above, we can ask Dask to run that function, `inc` on the data `10` by passing each as arguments into the `client.submit` method.  The first argument is the function to call and the rest of the arguments are arguments to that function.

Normal Execution

```python
result = function(*args, **kwargs)   # e.g inc(10)
```

Submit function for remote execution

```python
future = client.submit(function, *args, **kwargs)  # instantaneously fire off work
...
result = future.result()  # when we need, block until done and collect the result
```

In [None]:
%%time

y = client.submit(inc, 10)
z = client.submit(double, y)
z

You'll notice that that happened immediately.  That's because all we did was submit the `inc` function to run on Dask, and then return a `Future`, or a pointer to where the data will eventually be.

We can gather the future by calling `future.result()`

In [None]:
z

In [None]:
z.result()

### Submit many tasks in a loop

We can submit lots of functions to run at once, and then gather them when we're done.  This allows us to easily parallelize simple for loops.

*This section uses the following API*:

-  [Client.submit and Future.result](https://docs.dask.org/en/stable/futures.html#submit-tasks)

#### Sequential code

In [None]:
%%time 

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
results = []

for x in data:
    y = inc(x)
    z = double(y)
    results.append(z)
    
results

#### Parallel code

In [None]:
%%time 

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
results = []

for x in data:
    y = client.submit(inc, x)
    z = client.submit(double, y)
    results.append(z)
    
results = client.gather(results)
results

### Lessons:

1.  Submit a function to run elsewhere

    ```python
    y = f(x)
    future = client.submit(f, x)
    ```
    
    
2.  Get results when you're done

    ```python
    y = future.result()
    # or 
    results = client.gather(futures)
    ```

## Use futures to download and parse webpages

### Sequential Code

The code below downloads 50 question pages from a Stack Overflow tag, parses those pages, and collects the title and list of tags from each page.

We then count up all the tags to see what are the most popular kinds of questions.  We divide this code into four sections:

1.  Define useful functions
2.  Get a list of pages to download and scrape
3.  Download and scrape
4.  Analyze results

#### Define useful functions

You don't need to study these.  Feel free to skip.

In [None]:
import re
import requests
from bs4 import BeautifulSoup
import time

def download(url: str, delay=0) -> str:
    time.sleep(delay)
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        response.raise_for_status()
        
        
def scrape_title(body: str) -> str:
    html = BeautifulSoup(body, "html.parser")
    return str(html.html.title)


def scrape_links(body: str, base_url="") -> list[str]:
    html = BeautifulSoup(body, "html.parser")
    
    return [
        str(base_url + link.attrs["href"]).split("?")[0]
        for link in html.find_all("a") 
        if re.match("/questions/\d{5}", link.attrs.get("href", ""))
    ]


def scrape_tags(body: str) -> list[str]:
    html = BeautifulSoup(body, "html.parser")
    
    return sorted({
        str(list(link.children)[0])
        for link in html.find_all("a", class_="post-tag")
    })

### Serial for-loopy code

#### Get list of pages to download and scrape

In [None]:
url = "https://stackoverflow.com/questions/tagged/dask"
body = download(url)
urls = scrape_links(body, base_url="https://stackoverflow.com")
urls[:5]

In [None]:
len(urls)

#### Download and scrape

In [None]:
%%time

all_tags = []
titles = []

for url in urls:
    page = download(url)
    print(".", end="")
    tags = scrape_tags(page)
    title = scrape_title(page)
    
    all_tags.append(tags)
    titles.append(title)
print()

#### Analyze Results

Aggregate tags to find related topics

In [None]:
import collections

tag_counter = collections.defaultdict(int)

for tags in all_tags:
    for tag in tags:
        tag_counter[tag] += 1
        
sorted(tag_counter.items(), key=lambda kv: kv[1], reverse=True)[:10]

### Exercise: Parallelize this code

Take the code above, and use Dask futures to run it in parallel

Which sections should we think about parallelizing?

In [None]:
url = "https://stackoverflow.com/questions/tagged/dask"
body = download(url)
urls = scrape_links(body, base_url="https://stackoverflow.com")

In [None]:
# TODO: parallelize me

%%time

all_tags = []
titles = []

for url in urls:
    page = download(url)
    tags = scrape_tags(page)
    title = scrape_title(page)
    
    all_tags.append(tags)
    titles.append(title)
print()

#### Solution

Expand the three dots below if you want to see the answer

In [None]:
%%time

all_tags = []
titles = []

for url in urls:
    page = client.submit(download, url)
    tags = client.submit(scrape_tags, page)
    title = client.submit(scrape_title, page)
    
    all_tags.append(tags)
    titles.append(title)
    
all_tags = client.gather(all_tags)
titles = client.gather(titles)

In [None]:
import collections

tag_counter = collections.defaultdict(int)

for tags in all_tags:
    for tag in tags:
        tag_counter[tag] += 1
        
sorted(tag_counter.items(), key=lambda kv: kv[1], reverse=True)[:10]

### Exercise:  Scale out

There are different reasons to scale out for this problem:

1.  Parallelize bandwidth
2.  StackOverflow's rate-limits won't affect us as much if we spread out our requests from many different machines
3.  ~CPU Processing speed~ (not really an issue here)

Let's ask for some machines from Coiled, and switch our Dask client to use that cluster.

In [None]:
client.close()

In [None]:
import coiled

cluster = coiled.Cluster(
    n_workers=20,
    account="dask-tutorials",
)

client = cluster.get_client()

In [None]:
client

**Rerun your computation and see.**

#### Solution

In [None]:
%%time

all_tags = []
titles = []

for url in urls:
    page = client.submit(download, url)
    tags = client.submit(scrape_tags, page)
    title = client.submit(scrape_title, page)
    
    all_tags.append(tags)
    titles.append(title)
    
all_tags = client.gather(all_tags)
titles = client.gather(titles)

## 3. Evolving computations

Dask futures are flexible.  There are many ways to coordinate them including ...

1.  Distributed locks and semaphores
2.  Distributed queues
3.  Launching tasks from tasks
4.  Global variables
5.  ... [and lots more](https://docs.dask.org/en/stable/futures.html)

We're going to get a taste of this by learning about one Dask futures feature, [`as_completed`](https://docs.dask.org/en/stable/futures.html#distributed.as_completed), which lets us dynamically build up a computation as it completes.

We will use this to build a parallel web crawler over Stack Overflow.  

1.  First, we'll build this sequentially.
2.  Second, we'll learn how `as_completed` works in a simple example
3.  Third, we'll convert the sequential code into parallel code

### Sequential Code to Crawl Stack Overflow

In [None]:
%%time
from collections import deque

urls = deque()
urls.append("https://stackoverflow.com/questions/tagged/dask")  # seed with a single page

all_tags = []
titles = []
seen = set()
i = 0

while urls and i < 10: 
    url = urls.popleft()
    
    # Don't scrape the same page twice
    if url in seen:  
        continue
    else:
        seen.add(url)
    
    print(".", end="")
    i += 1
    
    # This is like before
    page = download(url)
    tags = scrape_tags(page)
    title = scrape_title(page)
    all_tags.append(tags)
    titles.append(title)

    # This is new!  
    # We scrape links on this page, and add them to the list of URLs
    new_urls = scrape_links(page, base_url="https://stackoverflow.com")
    urls.extend(new_urls)

## Exercise: Parallelize code to crawl Stack Overflow

Expand the sequential code that we saw below. Parallelize it with futures and as_completed.

In [None]:
from collections import deque
from dask.distributed import as_completed

In [None]:
%%time

urls = deque()
urls.append("https://stackoverflow.com/questions/tagged/dask")  # seed with a single page

all_tags = []
titles = []
url_futures = as_completed()
seen = set()
i = 0

while urls or not url_futures.is_empty() and i < 1000:
    
    # TODO: If urls is empty, 
    #   get the next future from url_futures
    #   collect those new url results to the local notebook
    #   and add those new urls to urls

    url = urls.popleft()

    if url in seen:
        continue
    else:
        seen.add(url)
    
    print(".", end="")
    i += 1

    # This is like before
    # TODO: Submit this work to happen in parallel
    page = download(url, delay=0.25)
    tags = scrape_tags(page)
    title = scrape_title(page)
    
    all_tags.append(tags)
    titles.append(title)

    # We scrape links on this page, and add them to the list of URLs
    # TODO: Submit this work to happen in parallel.  Add the future to url_futures
    new_urls = scrape_question_links(page, base_url="https://stackoverflow.com")
    urls.extend(new_urls)

#### Solution

In [None]:
%%time

urls = deque()
urls.append("https://stackoverflow.com/questions/tagged/dask")  # seed with a single page

all_tags = []
titles = []
url_futures = as_completed()
seen = set()
i = 0

while urls or not url_futures.is_empty() and i < 1000:
    
    # TODO: If urls is empty, 
    #   get the next future from url_futures
    #   collect those new url results to the local notebook
    #   and add those new urls to urls
    if not urls:
        future = url_futures.next()
        new_urls = future.result()
        urls.extend(new_urls)
        continue
    
    url = urls.popleft()
    
    if url in seen:
        continue
    else:
        seen.add(url)
    
    print(".", end="")
    i += 1

    # This is like before
    # TODO: Submit this work to happen in parallel
    page = client.submit(download, url, delay=0.25)
    tags = client.submit(scrape_tags, page)
    title = client.submit(scrape_title, page)

    all_tags.append(tags)
    titles.append(title)
    
    # We scrape links on this page, and add them to the list of URLs
    # TODO: Submit this work to happen in parallel.  Add the future to url_futures
    new_urls = client.submit(scrape_links, page, base_url="https://stackoverflow.com")
    url_futures.add(new_urls)

### Analyze results

At this point you likely have lists `titles` and `all_tags` that are lists of futures.  Let's gather them and analyze results.

In [None]:
titles = client.gather(titles)

In [None]:
len(titles)

In [None]:
titles[:20]

In [None]:
all_tags = client.gather(all_tags)

In [None]:
import collections

tag_counter = collections.defaultdict(int)

for tags in all_tags:
    for tag in tags:
        tag_counter[tag] += 1
        
sorted(tag_counter.items(), key=lambda kv: kv[1], reverse=True)[:20]

## Clean up

In [None]:
cluster.shutdown()
client.close()

### Useful links

- https://tutorial.dask.org/05_futures.html
- [Futures documentation](https://docs.dask.org/en/latest/futures.html)
- [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
- [Futures examples](https://examples.dask.org/futures.html)

### More Dask Tutorials

Coiled also runs regular Dask tutorials.  See [coiled.io/tutorials](https://www.coiled.io/tutorials) for more information. 
