<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">
     
# Parallelize your python code

In this lesson you will learn how to paralellize custom python code using Dask. You will learn about the Futures and Delayed APIs, and how to use them to parallelize customs functions.

The example we will be tackling consist of scrapping and cleaning some data from Stack Overflow website. But first let's do a quick recap on Futures and Delayed objects.

## Futures: a low-level collection.

Dask low-level collections are the best tools when you need to have fine control control to build custom parallel and distributed computations.

**NOTE:** For an introductory lesson on futures revisit:
- https://tutorial.dask.org/05_futures.html


## Recap the Basics
### Futures

Submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. 

The `futures` interface (derived from the built-in `concurrent.futures`) provide fine-grained real-time execution for custom situations. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. The call returns immediately, giving one or more *futures*, whose status begins as "pending" and later becomes "finished". There is no blocking of the local Python session. With futures, as soon as the inputs are available and there is compute available, the computation starts. 

In [1]:
from dask.distributed import Client

In [2]:
client = Client(n_workers=4)
client

2022-12-12 12:52:22,630 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-br_u0eh_', purging
2022-12-12 12:52:22,632 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-5tfwyx8u', purging
2022-12-12 12:52:22,633 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-9akfbug0', purging
2022-12-12 12:52:22,634 - distributed.diskutils - INFO - Found stale lock file and directory '/var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-n5eeszs0', purging


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:51888,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 8
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:51906,Total threads: 2
Dashboard: http://127.0.0.1:51912/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:51893,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-acmvseul,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-acmvseul

0,1
Comm: tcp://127.0.0.1:51905,Total threads: 2
Dashboard: http://127.0.0.1:51909/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:51894,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-xdblybin,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-xdblybin

0,1
Comm: tcp://127.0.0.1:51904,Total threads: 2
Dashboard: http://127.0.0.1:51911/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:51891,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-kfwjfq3v,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-kfwjfq3v

0,1
Comm: tcp://127.0.0.1:51903,Total threads: 2
Dashboard: http://127.0.0.1:51907/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:51892,
Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-trsdlbkr,Local directory: /var/folders/1y/ydztfpnd11b6qmvbb8_x56jh0000gn/T/dask-worker-space/worker-trsdlbkr


Let's make a toy functions, `inc`  that sleep for a while to simulate work. We'll then time running these functions normally.

In [None]:
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

We can run these locally

In [None]:
inc(1)

**`client.submit()`**

Or we can submit them to run remotely with Dask. This immediately returns a future that points to the ongoing computation, and eventually to the stored result.

In [None]:
future = client.submit(inc, 1)  # returns immediately with pending future
future

If you wait a second, and then check on the future again, you’ll see that it has finished.

In [None]:
future

You can block on the computation and gather the result with the `.result()` method.

In [None]:
future.result()

**`client.map()`**

In [None]:
futures = client.map(inc, range(8))  # returns immediately with pending list of futures
futures

In [None]:
future_sum = client.submit(sum, futures)
future_sum.result()

**Useful links: futures**
* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)

## Grown-up example: Scrapping and cleaning SO data

In the re-cap as well as in plenty of introductory tutorials we use toy examples. In this section, we graduate to a grownup example. You will learn how to parallelize a scrapping and cleaning workflow.

You'll be scraping multiple pages from https://stackoverflow.com/questions/, then you will be finding all the links for every post in each page and finally, getting some data from each post. You'll first see how the sequential code works, and then you'll use `futures` to do this in parallel

Note about throttling: 

Stack exchange has a throttling limit.

> Every application is subject to an IP based concurrent request throttle. If a single IP is making more than 30 requests a second, new requests will be dropped. The exact ban period is subject to change, but will be on the order of 30 seconds to a few minutes typically. Note that exactly what response an application gets (in terms of HTTP code, text, and so on) is undefined when subject to this ban; we consider > 30 request/sec per IP to be very abusive and thus cut the requests off very harshly.

> If an application does not have an access_token, then the application shares an IP based quota with all other applications on that IP. This quota is based on the key being passed by the applications; it is the max of the daily request limit for the applications involved, which by default is 10,000. This quota scheme is essentially unchanged from earlier versions of the API.

In the following code we will be working well within those limits (set a sleep in the code), but if you want to explore more, keep in mind that limitations you'll run into. https://api.stackexchange.com/docs/throttle

In [12]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import time

## Scrape, crawl, get data

We wrote some functions that get a page and clean the data from all the posts in that page and returns it as a list of dictionaries.

In [13]:
def request_html_page(url):    
    req = requests.get(url)
    html = bs(req.text, "html.parser")
    
    #requests problems when trying 10 pages using dask need to figure out a good sleep. 
    # I think we are hitting the > 30 request per second when runnig with dask not sure exactly 
    #how to handle this
    time.sleep(0.5)
    
    return html

In [14]:
def get_page_html_links(page_num, tag="dask", query_filter = "MostVotes"):
    base_url = "https://stackoverflow.com/questions/tagged/"
    
    page_url = f"{base_url}{tag}?sort={query_filter}&page={page_num}"

    page_html = request_html_page(page_url)
    
    return page_html

In [15]:
def get_post_links_per_page(html_page):
    question_href = html_page.find_all("a", class_="s-link")[3:-1]
    
    question_link = [f"https://stackoverflow.com{q['href']}" for q in question_href]
    
    return question_link

In [16]:
def get_data(post_link):
    
    html_post = request_html_page(post_link)
    post_info = {}
    
    
    post_info["title"] = html_post.title.text
    post_info["question"] = html_post.find("div", class_="s-prose js-post-body").text
    
    answ = html_post.find("div", class_="answer") #this will gets us the first/most voted answer
    
    #post_info["best_answer"] = answ.find("div", class_="s-prose js-post-body").text #might not need
    post_info["best_answer_votes"] = int(answ["data-score"])
    
    best_answer_author_obj = answ.find("span", itemprop="name")
    
    if best_answer_author_obj:
        best_answer_author = best_answer_author_obj.text
    else:
        best_answer_author = "comunity_post"
    
    post_info["best_answer_usrname"] = best_answer_author
    
    return post_info

## Serial

In [17]:
%%time

df_list =[]
for page_num in range(1, 3):
    page_html = get_page_html_links(page_num)
    posts_links = get_post_links_per_page(page_html)
    list_post_data = []
    
    for link in posts_links:
        p_data = get_data(link)
        list_post_data.append(p_data)

    df = pd.DataFrame(list_post_data)
    df_list.append(df)

CPU times: user 15.4 s, sys: 1.28 s, total: 16.6 s
Wall time: 58.4 s


In [None]:
df_list[0].head()

## Parallel

In [18]:
from dask.distributed import wait, as_completed

### Get pages and links of posts

In [19]:
%%time
pages_futures = client.map(get_page_html_links, range(1,5))
wait(pages_futures)

CPU times: user 94.1 ms, sys: 33 ms, total: 127 ms
Wall time: 893 ms


DoneAndNotDoneFutures(done={<Future: finished, type: bs4.BeautifulSoup, key: get_page_html_links-433c8ad887d19ff77cc58fc9f88a09f7>, <Future: finished, type: bs4.BeautifulSoup, key: get_page_html_links-80b866616107c2b2e21c538f74960d23>, <Future: finished, type: bs4.BeautifulSoup, key: get_page_html_links-f88590512862e1e2b52a94095f93fd6f>, <Future: finished, type: bs4.BeautifulSoup, key: get_page_html_links-4c5bc0256e725d0fbb1982dbeaec5ed8>}, not_done=set())

**`wait()`**

Notice that here we used `wait()`, you can wait on a future or collection of futures using the `wait` function, which blocks until all futures are finished or have erred. This is useful when you need the all the futures to be completed to proceed with your computations. 

**`as_completed()`**

In other situations you might need to iterate over the futures as they complete, to do so you will use the `as_completed` function.

In [None]:
%%time

posts_links_futures = client.map(get_post_links_per_page, pages_futures)
crawling = as_completed(posts_links_futures)

dfs_data = []
for future in crawling:
    list_links = future.result() # list of links per page
    df_data = []
    for link in list_links:
        fut_data = client.submit(get_data, link) 
        df_data.append(fut_data)

    dfs_data.append(df_data)
_ = wait(dfs_data)

Key:       get_data-800c4edc9996df3101cbd4a1e24f3f7c
Function:  get_data
args:      ('https://stackoverflow.com/questions/52374936/xarray-dask-limiting-the-number-of-threads-cpus')
kwargs:    {}
Exception: 'TypeError("\'NoneType\' object is not subscriptable")'

Key:       get_data-90099324ebfeb0e990f5c496b4004b3c
Function:  get_data
args:      ('https://stackoverflow.com/questions/50569171/how-do-i-find-the-length-of-a-dataframe-in-dask')
kwargs:    {}
Exception: 'AttributeError("\'NoneType\' object has no attribute \'text\'")'

Key:       get_data-e54752318527dae568662431776af9c4
Function:  get_data
args:      ('https://stackoverflow.com/questions/50809462/sorting-in-dask')
kwargs:    {}
Exception: 'AttributeError("\'NoneType\' object has no attribute \'text\'")'

Key:       get_data-0a78ffdfa2dfde5226f3ba1da59169db
Function:  get_data
args:      ('https://stackoverflow.com/questions/37444943/dask-array-from-dataframe')
kwargs:    {}
Exception: 'AttributeError("\'NoneType\' object ha

At this point, we have the data to build each page dataframe. 

In [None]:
dfs_data[0][:3]

To get a dataframe per page, we can do:

In [None]:
df_futures = client.map(pd.DataFrame, dfs_data)

In [None]:
df_futures[0]

We can do multiple operations on these dataframes using `futures` but at this point since wer are working with dataframes we can use `dask.dataframes`. 

In [None]:
ddf_so = dd.from_delayed(df_futures)

In [None]:
ddf_so

### Dask dataframes API

Now we are on dataframe world, we can do pandas-like operations, for example.

In [None]:
ddf.columns

We can check which of the user tht got a best answer, has the most "best answers"

In [None]:
ddf.best_answer_usrname.value_counts().compute()[:6]

In [None]:
We can also check how many votes, this users got:

In [None]:
ddf.groupby("best_answer_usrname")['best_answer_votes'].sum().compute()

## Exercise:

Modify the following code, to get a different tag (e.g `tag="python"`) and re-run the experiment

```python
pages_futures = client.map(get_page_html_links, range(1,3))
wait(pages_futures)
```

Hint: You can use `client.map()` with `lambda` functions


In [None]:
###Solution
pages_futures = client.map(lambda p: get_page_html_links(p, tag="python"), range(1, 5))
wait(pages_futures)

### Useful links

- https://tutorial.dask.org/05_futures.html

**Useful links**

* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)

### Next lesson

In the next lesson, you will get better at `dask.Dataframes`. We will re-cap the basics, but dive deeper into data formats (csv vs parquet),  learn about `pyarrow-strings`, shuffle operations, and other useful content that is not usually covered in the introductory material
