<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">
     
# Parallelize your python code

In this lesson you will learn how to paralellize custom python code using Dask. You will learn about the Futures and Delayed APIs, and how to use them to parallelize customs functions.

The example we will be tackling consist of scrapping and cleaning some data from Stack Overflow website. But first let's do a quick recap on Futures and Delayed objects.

## Futures: a low-level collection.

Dask low-level collections are the best tools when you need to have fine control control to build custom parallel and distributed computations.

**NOTE:** For an introductory lesson on futures revisit:
- https://tutorial.dask.org/05_futures.html


## Recap the Basics
### Futures

Submit arbitrary functions for computation in a parallelized, eager, and non-blocking way. 

The `futures` interface (derived from the built-in `concurrent.futures`) provide fine-grained real-time execution for custom situations. We can submit individual functions for evaluation with one set of inputs, or evaluated over a sequence of inputs with `submit()` and `map()`. The call returns immediately, giving one or more *futures*, whose status begins as "pending" and later becomes "finished". There is no blocking of the local Python session. With futures, as soon as the inputs are available and there is compute available, the computation starts. 

In [None]:
from dask.distributed import Client

In [None]:
client = Client(n_workers=4)
client

Let's make a toy functions, `inc`  that sleep for a while to simulate work. We'll then time running these functions normally.

In [None]:
from time import sleep

def inc(x):
    sleep(1)
    return x + 1

We can run these locally

In [None]:
inc(1)

**`client.submit()`**

Or we can submit them to run remotely with Dask. This immediately returns a future that points to the ongoing computation, and eventually to the stored result.

In [None]:
future = client.submit(inc, 1)  # returns immediately with pending future
future

If you wait a second, and then check on the future again, you’ll see that it has finished.

In [None]:
future

You can block on the computation and gather the result with the `.result()` method.

In [None]:
future.result()

**`client.map()`**

In [None]:
futures = client.map(inc, range(8))  # returns immediately with pending list of futures
futures

In [None]:
future_sum = client.submit(sum, futures)
future_sum.result()

**`as_completed()`**

In [None]:
from dask.distributed import as_completed
import numpy as np

In [None]:
def random_score(x):
    return np.random.uniform(low=0.5, high=10.0)

In [None]:
score_futures = client.map(random_score, range(20))

best_score = 0
for future in as_completed(score_futures):
    score = future.result()
    if score > best
        best = score

In [None]:
print(best)

**Useful links: futures**
* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)

## Grown-up example: Scrapping and cleaning SO data

In the re-cap as well as in plenty of introductory tutorials we use toy examples. In this section, we graduate to a grownup example. You will learn how to parallelize a scrapping and cleaning workflow.

You'll be scraping multiple pages from https://stackoverflow.com/questions/, then you will be finding all the links for every post in each page and finally, getting some data from each post. You'll first see how the sequential code works, and then you'll use `futures` to do this in parallel

Note about throttling:

When scrapping directly from the pages and not using the API, it is not clear what are the throttling limitations, but from experience we run into them pretty quickly.

The following examples, work as they are, if you change the number of pages you will likely hit a limit and be banned for few minutes. We will work around this towards the end, in the meantime avoid changing the number of pages

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
import time

## Scrape, crawl, get data

We wrote some functions that get a page and clean the data from all the posts in that page and returns it as a list of dictionaries.

In [None]:
def request_html_page(url):
    req = requests.get(url)
    html = bs(req.text, "html.parser")
    return html

In [None]:
def get_page_html_links(page_num, tag="dask", query_filter = "MostVotes"):
    base_url = "https://stackoverflow.com/questions/tagged/"
    
    page_url = f"{base_url}{tag}?sort={query_filter}&page={page_num}"

    page_html = request_html_page(page_url)
    
    return page_html

In [None]:
def get_post_links_per_page(html_page):
    question_href = html_page.find_all("a", class_="s-link")[3:-1]
    
    question_link = [f"https://stackoverflow.com{q['href']}" for q in question_href]
    
    return question_link

In [None]:
def get_data(post_link):
    html_post = request_html_page(post_link)
    post_info = {}
    
    
    post_info["title"] = html_post.title.text
    post_info["question"] = html_post.find("div", class_="s-prose js-post-body").text
    
    answ = html_post.find("div", class_="answer") #this will gets us the first/most voted answer
    
    if answ:
        post_info["best_answer_votes"] = int(answ["data-score"])
    
        best_answer_author_obj = answ.find("span", itemprop="name")
        
        if best_answer_author_obj:
            best_answer_author = best_answer_author_obj.text
        else:
            best_answer_author = "comunity_post"

        post_info["best_answer_usrname"] = best_answer_author
    else:
        post_info["best_answer_votes"] = 0
        post_info["best_answer_usrname"] = "no-answer"

    
    return post_info

## Serial

In [None]:
%%time

df_list =[]
for page_num in range(1, 3): #more than 2 pages and get trhottling issues
    page_html = get_page_html_links(page_num)
    posts_links = get_post_links_per_page(page_html)
    list_post_data = []
    
    for link in posts_links:
        p_data = get_data(link)
        list_post_data.append(p_data)

    df = pd.DataFrame(list_post_data)
    df_list.append(df)

In [None]:
df_list[0].head()

## Parallel

In [None]:
from dask.distributed import wait, as_completed

### Get pages and links of posts

In [None]:
%%time
pages_futures = client.map(get_page_html_links, range(1,3))
wait(pages_futures)

**`wait()`**

Notice that here we used `wait()`, you can wait on a future or collection of futures using the `wait` function, which blocks until all futures are finished or have erred. This is useful when you need the all the futures to be completed to proceed with your computations. 

**`as_completed()`**

In other situations you might need to iterate over the futures as they complete, to do so you will use the `as_completed` function.

In [None]:
%%time

posts_links_futures = client.map(get_post_links_per_page, pages_futures)
crawling = as_completed(posts_links_futures)

dfs_data = []
for future in crawling:
    list_links = future.result() # list of links per page
    df_data = []
    for link in list_links:
        fut_data = client.submit(get_data, link) 
        df_data.append(fut_data)

    dfs_data.append(df_data)
_ = wait(dfs_data)

## Avoid throttling using a cluster. 

In the example above, if we try to work with more pages, you will hit throttling issues. Sometimes you can avoid this problem by setting a sleep in the `requests` function, but in our case it is not clear what that time should be as there is not enough documentation. Besides, this will make things slower.

However, we have a nice solution to be able to scrape more pages. When using a cluster each worker has its own public ip-address so it is like we are requesting from different machines. 

Let's create a coiled cluster and scrape a bigger number of pages:

In [None]:
#Shutdown LocalCluster
client.shutdown()

In [None]:
import coiled

In [None]:
cluster = coiled.Cluster(n_workers=10, 
                        package_sync=True)

In [None]:
client =  Client(cluster)
client

In [None]:
%%time
pages_10_futures = client.map(get_page_html_links, range(1,11))
wait(pages_10_futures)

posts_links_futures = client.map(get_post_links_per_page, pages_10_futures)
crawling = as_completed(posts_links_futures)

dfs_data = []
for future in crawling:
    list_links = future.result() # list of links per page
    df_data = []
    for link in list_links:
        fut_data = client.submit(get_data, link) 
        df_data.append(fut_data)

    dfs_data.append(df_data)
_ = wait(dfs_data)

In [None]:
len(dfs_data) #10 pages

At this point, we have the data to build each page dataframe. 

In [None]:
dfs_data[0]

To get a dataframe per page, we can do:

In [None]:
df_futures = client.map(pd.DataFrame, dfs_data)

In [None]:
df_futures[0]

We can do multiple operations on these dataframes using `futures` but at this point since wer are working with dataframes we can use `dask.dataframes`. 

In [None]:
import dask.dataframe as dd

In [None]:
ddf_so = dd.from_delayed(df_futures)

In [None]:
ddf_so

### Dask dataframes API

Now we are on dataframe world, we can do pandas-like operations, for example.

In [None]:
ddf_so.columns

We can check which of the user tht got a best answer, has the most "best answers"

In [None]:
ddf_so.best_answer_usrname.value_counts().compute()[:6]

In [None]:
We can also check how many votes, this users got:

In [None]:
##sort? 
ddf_so.groupby("best_answer_usrname")['best_answer_votes'].sum().compute()

## Exercise:

Modify the following code, to get a different tag (e.g `tag="python"`) and re-run the experiment

```python
pages_futures = client.map(get_page_html_links, range(1,3))
wait(pages_futures)
```

Hint: You can use `client.map()` with `lambda` functions


In [None]:
###Solution
pages_py_futures = client.map(lambda p: get_page_html_links(p, tag="python"), range(1, 5))
wait(pages_py_futures)

### Useful links

- https://tutorial.dask.org/05_futures.html

**Useful links**

* [Futures documentation](https://docs.dask.org/en/latest/futures.html)
* [Futures screencast](https://www.youtube.com/watch?v=07EiCpdhtDE)
* [Futures examples](https://examples.dask.org/futures.html)

### Next lesson

In the next lesson, you will get better at `dask.Dataframes`. We will re-cap the basics, but dive deeper into data formats (csv vs parquet),  learn about `pyarrow-strings`, shuffle operations, and other useful content that is not usually covered in the introductory material
