# Parallelize your python code

In this lesson you will learn how to paralellize custom python code using Dask. You will learn about the Futures and Delayed APIs, and how to use them to parallelize customs functions.

The example we will be tackling consist of scrapping and cleaning some data from Stack Overflow website. But first let's do a quick recap on Futures and Delayed objects.

## Futures and Delayed: low-level collections

Dask low-level collections are the best tools when you need to have fine control control to build custom parallel and distributed computations.

**NOTE:** For an introductory lesson on futures and delayed revisit:
- https://tutorial.dask.org/03_dask.delayed.html
- https://tutorial.dask.org/05_futures.html


## Recap the Basics
### Futures

Submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. 

The `futures` interface (derived from the built-in `concurrent.futures`) provide fine-grained real-time execution for custom situations. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. The call returns immediately, giving one or more *futures*, whose status begins as "pending" and later becomes "finished". There is no blocking of the local Python session. With futures, as soon as the inputs are available and there is compute available, the computation starts. 

In [None]:
import dask
from dask.distributed import Client

In [None]:
client = Client(n_workers=4)
client

Let's make a toy functions, `inc`  that sleep for a while to simulate work. We'll then time running these functions normally.

In [None]:
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

We can run these locally

In [None]:
inc(1)

**`client.submit()`**

Or we can submit them to run remotely with Dask. This immediately returns a future that points to the ongoing computation, and eventually to the stored result.

In [None]:
future = client.submit(inc, 1)  # returns immediately with pending future
future

If you wait a second, and then check on the future again, you’ll see that it has finished.

In [None]:
future

You can block on the computation and gather the result with the `.result()` method.

In [None]:
future.result()

**`client.map()`**

In [None]:
futures = client.map(inc, range(8))  # returns immediately with pending list of futures
futures

In [None]:
future_sum = client.submit(sum, futures)
future_sum.result()

**NOTE:** For an introductory lesson on futures revisit:
- https://tutorial.dask.org/05_futures.html

**Useful links: futures**
* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)

### Delayed 

Similarly to `futures`, `delayed` can be used to support arbitrary task scheduling, but in the case of `delayed` this happens **lazily**. This is the important difference between futures and delayed, `delayed` constructs a graph while `futures` are eager.


In [None]:
@dask.delayed
def inc(x):
    sleep(1)
    return x + 1

In [None]:
%%time
x = inc(1)
x

This run immediately, since nothing has really happened yet.

To get the result, call `compute`. Notice that this runs faster than the original code.

In [None]:
x.compute()

The same example we did with `client.map()`, using `delayed` would be

In [None]:
%%time
results = []
for x in range(8):
    y = inc(x)
    results.append(y)
    
total = sum(results)

In [None]:
total.compute()

**NOTE:** For an introductory lesson on delayed revisit:
- https://tutorial.dask.org/03_dask.delayed.html

**Useful links: delayed**

* [Delayed documentation](https://docs.dask.org/en/latest/delayed.html)
* [Delayed screencast](https://www.youtube.com/watch?v=SHqFmynRxVU)
* [Delayed API](https://docs.dask.org/en/latest/delayed-api.html)
* [Delayed examples](https://examples.dask.org/delayed.html)
* [Delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)

## Grown-up example: Scrapping and cleaning SO data

In the re-cap as well as in plenty of introductory tutorials we use toy examples. In this section, we graduate to a grownup example. You will learn how to parallelize a scrapping and cleaning workflow.

You'll be scraping multiple pages from https://stackoverflow.com/questions/ , cleaning the data you are receiving and doing some computations with it. You'll first see how the sequential code works, and then you'll use `futures` and `delayed` to do this in parallel

Note about throttling: 

Stack exchange has a throttling limit.

> If an application does not have an access_token, then the application shares an IP based quota with all other applications on that IP. This quota is based on the key being passed by the applications; it is the max of the daily request limit for the applications involved, which by default is 10,000. This quota scheme is essentially unchanged from earlier versions of the API.

In the following code we will be working well within those limits, but if you want to explore more, keep in mind that limitations you'll run into. https://api.stackexchange.com/docs/throttle

In [None]:
import pandas as pd
import re
import requests
from requests_html import HTML

### Stack Overflow walk around

Before we start let's quickly see what kind of data are we going to be retrieving.

In [None]:
#Example url
base_url = "https://stackoverflow.com/questions/tagged/"
tag = "dask"
query_filter = "Newest"
url = f"{base_url}{tag}?tab={query_filter}"
url

In [None]:
r = requests.get(url)
html_str = r.text
html = HTML(html=html_str)


`.s-post-summary`: Base parent container for a post summary

https://stackoverflow.design/product/components/post-summary/

`html.find(".s-post-summary")` returns list of 50 elements, let's explore one of them.

In [None]:
html.find(".s-post-summary")[0].text

In [None]:
print(html.find(".s-post-summary")[0].text)

In [None]:
question = html.find(".s-post-summary")[0]

In [None]:
question.text

In [None]:
question.find(".s-post-summary--content-title", first=True).text

In [None]:
question.find(".s-post-summary--stats", first=True).text

In [None]:
question.find(".s-post-summary--meta-tags", first=True).text

## Scrape and clean 

With this information, we wrote some functions, that get a page and clean the data from all the posts in that page and returns it as a list of dictionaries.

In [None]:
def extract_url_and_parse_html(url):
    """
    Given a SO url gets questions summary and cleans data.
    Returns a list of dictionaries with the clean data.
    
    see also: clean_scraped_data
    """
    # function that will parse a single page
    r = requests.get(url)
    
    if r.status_code not in range(200, 299):
        return []
    
    #get html
    html = HTML(html=r.text)
        
    questions = html.find(".s-post-summary")
    
    key_class_dict = {"title": ".s-post-summary--content-title" ,
                  "stats": ".s-post-summary--stats",
                  "tags": ".s-post-summary--meta-tags"}

    datas = []

    for q_el in questions:
        q_data = {}
        for k, v in key_class_dict.items():
            q_data[k] = clean_scraped_data(q_el.find(v, first=True).text, k)
        
        q_data["votes"] = q_data["stats"][0]
        q_data["answers"] = q_data["stats"][1]
        q_data["views"] = q_data["stats"][2]
        datas.append(q_data)
    return datas

In [None]:
def clean_scraped_data(text, keyname=None):
    """
    Cleans the scraped data once in text format
    """
    if keyname == "stats":
        # '12415 votes\n51 answers\n3.1m views' -> ['12415 votes', '51 answers', '3.1m views']
        # t.split()[0] grabs what is before the space and we apply the str_to_num func
        
        text = [str_to_num(t.split()[0]) for t in text.split("\n")]

    elif keyname == "tags":
        text  = text.split("\n")
        
    return text

In [None]:
def str_to_num(x):
    """
    Converts strings of the form '1.5k' or '3.1m' into numbers
    """
    
    mult = {'k': 1e3, 'm': 1e6}

    if ('k' in x) or ('m' in x):
        num =int(float(x[:-1])*mult[x[-1]])
    else:
        num = int(float(x))
        
    return num

## Let's scrape 

The following function will scrape a page with a specific `tag` and `query_filter` and return a `pandas.Dataframe`

In [None]:
def scrape_tag(page_num, tag="dask", query_filter="Newest", pagesize=50):
    
    base_url = "https://stackoverflow.com/questions/tagged/"
    url = f"{base_url}{tag}?tab={query_filter}&page={page_num}&pagesize={pagesize}"
    
    datas_page = extract_url_and_parse_html(url)

    return pd.DataFrame(datas_page)

## Sequential

Let's get some data for the first 32 pages, where we have 50 posts per page. 

In [None]:
%%time
pd_res = []
for page in range(1, 33):
    pd_res.append(scrape_tag(page))

##pd_res

In [None]:
len(pd_res)

In [None]:
pd_res[0].head()

## Parallel
This took ~12 s, let's see how long it takes if we do it in parallel using futures and delayed.

### `client.submit`

Notice that the code changes are minimal compare to the sequential version

In [None]:
%%time
fut = []
for page_num in range(1, 33):
    future = client.submit(scrape_tag, page_num)
    fut.append(future)

res = client.gather(fut)

In [None]:
fut[0]

In [None]:
res[0].head(3)

### `client.map()`

We can do this in a different way using `client.map()`, and in fewer lines of code. We first will delete the futures in `fut` as dask is smart enough to realize we already computed this.

In [None]:
del fut

In [None]:
%%time
fut_map = client.map(scrape_tag, range(1, 33)) #this returns a list of futures
res_map = client.gather(fut_map)

In [None]:
res_map[0].head(3)

### `client.map()` with lambda functions

Let's suppose that we want to explore a different `tag`. We can use `client.map` along with a `lambda` function to pass the variable along. 

In [None]:
%%time
fut_py = client.map(lambda p: scrape_tag(p, tag="python"), range(1, 33))
res_py = client.gather(fut_py)

In [None]:
res_py[0].head(3)

## `delayed` 

Another approach to this problem is to use delayed. Let's create a separate function which is the **exact same** as `scrap_tag` but we will use now the `@dask.delayed` decorator. 

For simplicity, I will rename the function, but keep in mind that nothing has changed

In [None]:
@dask.delayed
def scrape_tag_delayed(page_num, tag="dask", query_filter="Newest", pagesize=50):
    
    base_url = "https://stackoverflow.com/questions/tagged/"
    url = f"{base_url}{tag}?tab={query_filter}&page={page_num}&pagesize={pagesize}"
    
    datas_page = extract_url_and_parse_html(url)

    return pd.DataFrame(datas_page)

In [None]:
%%time
res = []
for page in range(1, 33):
    res.append(scrape_tag_delayed(page))

Notice that nothing happened in the dashboard, and thing run very fast. This is because no computation has happened yet. Remember, `delayed` is **lazy**

In [None]:
res[:3]

In [None]:
%%time
r = dask.compute(res)

## Waiting on Futures: `wait` and `as_completed` 

You can wait on a future or collection of futures using the `wait` function, which blocks until all futures are finished or have erred. This is useful when you need the all the futures to be completed to proceed with your computations. 

In other situations you might need to iterate over the futures as they complete, to do so you will use the `as_completed` function.


In [None]:
from dask.distributed import as_completed, wait

### `wait()`

Notice that the cell below will not block until the futures end, it will trigger the computation and let you proceed.

In [None]:
fut_example = client.submit(scrape_tag, 100)
fut_example

If you want to wait:

In [None]:
%%time
some_futures = client.map(scrape_tag, range(50, 90))
wait(some_futures)
some_futures[-1]

In [None]:
#clear all the futures that are around
client.restart()

In [None]:
%%time
futures = client.map(scrape_tag, range(1, 33))

tot_views= 0 

for future in as_completed(futures):
    views = future.result()['views']
    tot_views += views.sum()

tot_views

### `from_delayed`

In our example we have a list of `futures` or `delayed`, where each element of the of the list is a `pandas.Dataframe`. This looks like a great opportunity to move from a list of `pandas.Dataframes` to a `dask.Dataframe`, and exploit the advantages of the dataframe API. 


In [None]:
import dask.dataframe as dd

In [None]:
my_ddf = dd.from_delayed(futures)

In [None]:
my_ddf

In [None]:
my_ddf['views'].sum()

In [None]:
%%time
my_ddf['views'].sum().compute()

In [None]:
%%time
my_ddf.groupby('votes').views.sum().compute()

In [None]:
client.shutdown()

### Useful links

- https://tutorial.dask.org/05_futures.html
- https://tutorial.dask.org/03_dask.delayed.html

**Useful links**

* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)
* [Delayed documentation](https://docs.dask.org/en/latest/delayed.html)
* [Delayed screencast](https://www.youtube.com/watch?v=SHqFmynRxVU)
* [Delayed API](https://docs.dask.org/en/latest/delayed-api.html)
* [Delayed examples](https://examples.dask.org/delayed.html)
* [Delayed best practices](https://docs.dask.org/en/latest/delayed-best-practices.html)

### Next lesson

In the next lesson, you will get better at `dask.Dataframes`. We will re-cap the basics, but dive deeper into data formats (csv vs parquet),  learn about `pyarrow-strings`, shuffle operations, and other useful content that is not usually covered in the introductory material
