# Chapter 3 - Why isn’t distributed computing built into my favorite language? 

## Language extensions for distributed computing

Paul E. Anderson

## Ice Breaker

If you were stranded on a desert island, what two items do you bring with you?

While this text can be viewed as PDF, it is most useful to have a Jupyter environment. I have an environment ready for each of you, but you can get your own local environment going in several ways. One popular way is with Anaconda (<a href="https://www.anaconda.com/">https://www.anaconda.com/</a>. Because of the limited time, you can use my server.

In [25]:
%load_ext autoreload
%autoreload 2
    
import os
from pathlib import Path
home = str(Path.home())

import pandas as pd

## Motivation: What about programming Languages?
For the sake of the discussion, let's focus on imperative programming languages such as Python and Java. These programming languages provide a way to algorithmicly specify your program. We're comfortable and familiar with this paradigm. There are other approaches of course. Functional and logic programming for example. Most of our languages are designed for software engineers in mind. I'm oversimplying the world here, but our main programming languages are used to build software. 

Consider the Python programming language: Python have multiprocessor libraries (https://docs.python.org/2/library/multiprocessing.html). So why isn't that good enough? Among other things, this library does not efficiently solve these common problems of distributed computing:
* Running the same code on more than one machine.
* Building microservices and actors that have state and can communicate.
* Gracefully handling machine failures.
* Efficiently handling large objects and numerical data.

There are languages such as Julia that are gaining popularity. These languages provide better support for **distributed computing at a lower level of abstraction** than we need in this course. Still there are arguments for their adoption. We will not unpack those arguments at the moment. We are building towards domain specific tools such as:
* TensorFlow for neural network training
* Spark for data processing and SQL
* Flink for stream processing

These provide **higher level abstractions** for neural networks, datasets, and streams. One of the biggest hurtles for their adoption is that these often require rewriting a lot of our code and our thinking about how to compose solutions. This leads to an obvious question:

What about something in between? **Can't we just improve our favorite languages?**

We will consider two popular distributed computing platforms for Python: Ray and Dask.

## Python Library 1: Ray
Ray is a Python library that attempts to translate the traditional ideas of functions and classes to distributed settings such as **tasks** and **actors**. We've already seen the idea of tasks with our GNU Parallel work. Ray is designed to take similar ideas and bring them into Python as easily as possible.

### Starting Ray

In [2]:
import ray
ray.init()

2021-10-06 09:03:29,718	INFO services.py:1092 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'node_ip_address': '129.65.17.223',
 'raylet_ip_address': '129.65.17.223',
 'redis_address': '129.65.17.223:6379',
 'object_store_address': '/tmp/ray/session_2021-10-06_09-03-29_261391_16049/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2021-10-06_09-03-29_261391_16049/sockets/raylet',
 'webui_url': '127.0.0.1:8265',
 'session_dir': '/tmp/ray/session_2021-10-06_09-03-29_261391_16049',
 'metrics_export_port': 55995,
 'node_id': '5e84d48289fb04962e31f0a31446a4a893f00f6f'}

If you were connecting to a cluster, then you would pass additional arguments with connection specifics.

init() does the following:
* Starts a number of worker processes for executing Python calls in parallel (approximately one per core in our environment)
* Starts a scheduler process for assigning tasks to workers
* Starts a shared memory object store for sharing objects without making copies
* An in-memory database for storing metadata needed to rerun tasks in the even of failure

A **task** is the unit of work schedule by Ray and corresponds to a function call.

Ray workers are separate processes as opposed to threads

### Simple parallel example
All we need is a little decoration

**Detour:** <a href="https://realpython.com/primer-on-python-decorators/">For more information on decorators in Python</a>

In [3]:
import time

In [4]:
@ray.remote
def f(x):
    time.sleep(1)
    return x

A common pattern when performing distributed computing is that we get a handle to the eventual output immediate. This is because we don't want our main program to stop and wait unless we explicitly need this to be the case. In most settings we are setting up the distributed computing, but we need to decide ourselves how to grab the data. In Ray, we get back futures (https://en.wikipedia.org/wiki/Futures_and_promises).

### Start 4 tasks in parallel.

In [5]:
result_ids = []
for i in range(4):
    result_ids.append(f.remote(i))

### Wait for the tasks to complete and retrieve the results.
**With at least 4 cores, this will take 1 second.**

In [6]:
results = ray.get(result_ids)  # [0, 1, 2, 3]
results

[0, 1, 2, 3]

### Class exercise
Using Ray, create a simple parallel search for Prime numbers. Take 5-10 minutes to design and test your solution. We'll run each solution for 30 seconds to see who's solution finds the most prime numbers. Here is naive code to test for prime numbers:

In [7]:
def is_prime(n):
    if n <= 1:
        return False
    for i in range(2,n):
        if n % i == 0:
            return False
    return True

In [8]:
is_prime(6),is_prime(7)

(False, True)

In [9]:
import signal

class TimeoutException(Exception):   # Custom exception class
    pass

def timeout_handler(signum, frame):   # Custom signal handler
    raise TimeoutException
    
# Your solution/code here!

In [10]:
result_ids = [find.remote(3)]
result_ids

[ObjectRef(88866c7daffdd00effffffff0100000001000000)]

In [11]:
import itertools
import numpy as np
all_results = set(itertools.chain(*ray.get(result_ids)))

In [12]:
max(all_results)

19457

In [13]:
len(all_results)

2205

In [14]:
result_ids = [find.remote(3),find.remote(19428)]
result_ids

[ObjectRef(d251967856448cebffffffff0100000001000000),
 ObjectRef(3bf0c856ace5a4d8ffffffff0100000001000000)]

In [15]:
import itertools
import numpy as np
all_results = set(itertools.chain(*ray.get(result_ids)))

In [16]:
max(all_results)

27961

In [17]:
len(all_results)

3051

### Details (and order) make a big difference
Consider the problem of adding a large set of integers together. If we structure our aggregation incorrectly, we don't see the benefits of our distributed environment. 
<img src="https://miro.medium.com/max/1400/1*vHz3troEmr4uLns0V8VmdA.jpeg">
* Both executions result in the same, but the one on the right shows how to turn linear into logarithmic!

**Slow approach in action**

In [18]:
@ray.remote
def add(x, y):
    time.sleep(1)
    return x + y

values = [1, 2, 3, 4, 5, 6, 7, 8]
while len(values) > 1:
    #import pdb; pdb.set_trace()
    values = [add.remote(values[0], values[1])] + values[2:]
result = ray.get(values[0])
result

36

**STOP and THINK:** This code produces the execution on the left (figure above). How would you change it to work like the execution on the right?

**Fast approach in action**

In [19]:
values = [1, 2, 3, 4, 5, 6, 7, 8]
while len(values) > 1:
    values = values[2:] + [add.remote(values[0], values[1])] # remove 2 and put at end
result = ray.get(values[0])
result

36

## Python Library 2: Dask (and Pandas)

One of the most widely used structured data processing tools available is a library in Python known as Pandas. We won't cover all of Pandas or Dask in this chapter. We will cover some very common use cases of the two. Hopefully, enough to know **when Pandas+Dask might be appropriate for your problem.**

### Pandas in 10 minutes
There are many resources out there for this purpose. I am being inspired by: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

In [20]:
import pandas as pd

**pd.Series**
* a list with more power
* elements have an integer 0-based index and a named index
* a series has a name

In [21]:
s1 = pd.Series(["A","B","C",3.14],index=["i","ii","iii","iv"],name="first_example")
s1

i         A
ii        B
iii       C
iv     3.14
Name: first_example, dtype: object

**You can access in either a 0-based index way (iloc) or a named index manner (loc)**

In [22]:
s1.loc["ii"],s1.iloc[1]

('B', 'B')

But a lot of other things still work as expected

In [23]:
len(s1),s1.shape

(4, (4,))

So the main concept (at this moment) is that Pandas provides extended functionality for common things such as lists and dictionaries. We'll see this in more detail as we examine **pd.DataFrame** by loading COVID19 data.

> The data set contains daily reports of Covid-19 cases and deaths in countries worldwide. The data also shows the country’s population and the number of cases per 100,000 people on a rolling 14 day average.

https://corgis-edu.github.io/corgis/python/covid/

In [27]:
import sys
sys.path.insert(0,f'{home}/csc-369-student/data/covid')
import covid

**Let's see how it looks**

In [28]:
report = covid.get_report()
report

[{'Date': {'Day': 5, 'Month': 11, 'Year': 2020},
  'Data': {'Cases': 121,
   'Deaths': 6,
   'Population': 38041757,
   'Rate': 3.74588377},
  'Location': {'Country': 'Afghanistan', 'Code': 'AFG', 'Continent': 'Asia'}},
 {'Date': {'Day': 4, 'Month': 11, 'Year': 2020},
  'Data': {'Cases': 86,
   'Deaths': 4,
   'Population': 38041757,
   'Rate': 3.78268543},
  'Location': {'Country': 'Afghanistan', 'Code': 'AFG', 'Continent': 'Asia'}},
 {'Date': {'Day': 3, 'Month': 11, 'Year': 2020},
  'Data': {'Cases': 95,
   'Deaths': 3,
   'Population': 38041757,
   'Rate': 3.78794281},
  'Location': {'Country': 'Afghanistan', 'Code': 'AFG', 'Continent': 'Asia'}},
 {'Date': {'Day': 2, 'Month': 11, 'Year': 2020},
  'Data': {'Cases': 132,
   'Deaths': 5,
   'Population': 38041757,
   'Rate': 3.76691329},
  'Location': {'Country': 'Afghanistan', 'Code': 'AFG', 'Continent': 'Asia'}},
 {'Date': {'Day': 1, 'Month': 11, 'Year': 2020},
  'Data': {'Cases': 76,
   'Deaths': 0,
   'Population': 38041757,
   'Ra

That's not great. It's a dictionary that is very machine readable, but not very useful for us to easily ask questions about. Let's throw it into a DataFrame and see what happens.

In [29]:
report_df = pd.DataFrame(report)
report_df

Unnamed: 0,Date,Data,Location
0,"{'Day': 5, 'Month': 11, 'Year': 2020}","{'Cases': 121, 'Deaths': 6, 'Population': 3804...","{'Country': 'Afghanistan', 'Code': 'AFG', 'Con..."
1,"{'Day': 4, 'Month': 11, 'Year': 2020}","{'Cases': 86, 'Deaths': 4, 'Population': 38041...","{'Country': 'Afghanistan', 'Code': 'AFG', 'Con..."
2,"{'Day': 3, 'Month': 11, 'Year': 2020}","{'Cases': 95, 'Deaths': 3, 'Population': 38041...","{'Country': 'Afghanistan', 'Code': 'AFG', 'Con..."
3,"{'Day': 2, 'Month': 11, 'Year': 2020}","{'Cases': 132, 'Deaths': 5, 'Population': 3804...","{'Country': 'Afghanistan', 'Code': 'AFG', 'Con..."
4,"{'Day': 1, 'Month': 11, 'Year': 2020}","{'Cases': 76, 'Deaths': 0, 'Population': 38041...","{'Country': 'Afghanistan', 'Code': 'AFG', 'Con..."
...,...,...,...
53585,"{'Day': 25, 'Month': 3, 'Year': 2020}","{'Cases': 0, 'Deaths': 0, 'Population': 146454...","{'Country': 'Zimbabwe', 'Code': 'ZWE', 'Contin..."
53586,"{'Day': 24, 'Month': 3, 'Year': 2020}","{'Cases': 0, 'Deaths': 1, 'Population': 146454...","{'Country': 'Zimbabwe', 'Code': 'ZWE', 'Contin..."
53587,"{'Day': 23, 'Month': 3, 'Year': 2020}","{'Cases': 0, 'Deaths': 0, 'Population': 146454...","{'Country': 'Zimbabwe', 'Code': 'ZWE', 'Contin..."
53588,"{'Day': 22, 'Month': 3, 'Year': 2020}","{'Cases': 1, 'Deaths': 0, 'Population': 146454...","{'Country': 'Zimbabwe', 'Code': 'ZWE', 'Contin..."


That is not bad, but there are still dictionaries inside each cell of this DataFrame. 
* A DataFrame is a table-like data structure. 
* A DataFrame has column names (e.g., report_df.columns) and can be indexed using iloc or loc.
* You can also access columns by name (e.g., report_df['Date'])
Let's try to process one of those columns.

In [30]:
report_df['Date'] # By name

0        {'Day': 5, 'Month': 11, 'Year': 2020}
1        {'Day': 4, 'Month': 11, 'Year': 2020}
2        {'Day': 3, 'Month': 11, 'Year': 2020}
3        {'Day': 2, 'Month': 11, 'Year': 2020}
4        {'Day': 1, 'Month': 11, 'Year': 2020}
                         ...                  
53585    {'Day': 25, 'Month': 3, 'Year': 2020}
53586    {'Day': 24, 'Month': 3, 'Year': 2020}
53587    {'Day': 23, 'Month': 3, 'Year': 2020}
53588    {'Day': 22, 'Month': 3, 'Year': 2020}
53589    {'Day': 21, 'Month': 3, 'Year': 2020}
Name: Date, Length: 53590, dtype: object

In [31]:
type(report_df['Date'])

pandas.core.series.Series

So now we can say a DataFrame is made up of a collection of pd.Series objects

Because each column is a Series that is made up of individual dictionaries, let's throw each column into a DataFrame and see what happens. Notice that I cast this to a generic List object.

In [32]:
date_df = pd.DataFrame(list(report_df['Date']))
date_df

Unnamed: 0,Day,Month,Year
0,5,11,2020
1,4,11,2020
2,3,11,2020
3,2,11,2020
4,1,11,2020
...,...,...,...
53585,25,3,2020
53586,24,3,2020
53587,23,3,2020
53588,22,3,2020


This looks much better. We can now see clear what data we have. (Trust me. We are getting to distributed computing). But let's process the ``data`` and ``location``.

In [33]:
data_df = pd.DataFrame(list(report_df['Data']))
location_df = pd.DataFrame(list(report_df['Location']))
data_df

Unnamed: 0,Cases,Deaths,Population,Rate
0,121,6,38041757,3.745884
1,86,4,38041757,3.782685
2,95,3,38041757,3.787943
3,132,5,38041757,3.766913
4,76,0,38041757,3.575019
...,...,...,...,...
53585,0,0,14645473,0.000000
53586,0,1,14645473,0.000000
53587,0,0,14645473,0.000000
53588,1,0,14645473,0.000000


In [34]:
location_df

Unnamed: 0,Country,Code,Continent
0,Afghanistan,AFG,Asia
1,Afghanistan,AFG,Asia
2,Afghanistan,AFG,Asia
3,Afghanistan,AFG,Asia
4,Afghanistan,AFG,Asia
...,...,...,...
53585,Zimbabwe,ZWE,Africa
53586,Zimbabwe,ZWE,Africa
53587,Zimbabwe,ZWE,Africa
53588,Zimbabwe,ZWE,Africa


Finally, we can join all of this together:

In [35]:
report_df2 = date_df.join(data_df).join(location_df)
report_df2

Unnamed: 0,Day,Month,Year,Cases,Deaths,Population,Rate,Country,Code,Continent
0,5,11,2020,121,6,38041757,3.745884,Afghanistan,AFG,Asia
1,4,11,2020,86,4,38041757,3.782685,Afghanistan,AFG,Asia
2,3,11,2020,95,3,38041757,3.787943,Afghanistan,AFG,Asia
3,2,11,2020,132,5,38041757,3.766913,Afghanistan,AFG,Asia
4,1,11,2020,76,0,38041757,3.575019,Afghanistan,AFG,Asia
...,...,...,...,...,...,...,...,...,...,...
53585,25,3,2020,0,0,14645473,0.000000,Zimbabwe,ZWE,Africa
53586,24,3,2020,0,1,14645473,0.000000,Zimbabwe,ZWE,Africa
53587,23,3,2020,0,0,14645473,0.000000,Zimbabwe,ZWE,Africa
53588,22,3,2020,1,0,14645473,0.000000,Zimbabwe,ZWE,Africa


There is a lot more to show about Pandas and related data science tools, but let's consider the following:
* This sample dataset has 50,000+ rows. 
* How large could this dataset be if we approached collecting all of the data in the world? 
* Could that fit into your computer's memory?
* Even if it could fit into your computer's memory, processing it with pure Python would only make use of a single core unless you did some major coding to use the multiprocessing.

**And here comes the Dask library**

### Dask DataFrame
* Implements a blocked parallel DataFrame object that mimics a large subset of the Pandas DataFrame API. 
* One Dask DataFrame is comprised of many in-memory pandas DataFrames separated along the index. 
* One operation on a Dask DataFrame triggers many pandas operations on the constituent pandas DataFrames 
* This is done to exploit potential parallelism and memory constraints

In [36]:
from dask import dataframe as dd 
report_dd2 = dd.from_pandas(report_df2,npartitions=3)
report_dd2

Unnamed: 0_level_0,Day,Month,Year,Cases,Deaths,Population,Rate,Country,Code,Continent
npartitions=3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,int64,int64,int64,int64,int64,int64,float64,object,object,object
17864,...,...,...,...,...,...,...,...,...,...
35728,...,...,...,...,...,...,...,...,...,...
53589,...,...,...,...,...,...,...,...,...,...


In [39]:
report_dd2.head()

Unnamed: 0,Day,Month,Year,Cases,Deaths,Population,Rate,Country,Code,Continent
0,5,11,2020,121,6,38041757,3.745884,Afghanistan,AFG,Asia
1,4,11,2020,86,4,38041757,3.782685,Afghanistan,AFG,Asia
2,3,11,2020,95,3,38041757,3.787943,Afghanistan,AFG,Asia
3,2,11,2020,132,5,38041757,3.766913,Afghanistan,AFG,Asia
4,1,11,2020,76,0,38041757,3.575019,Afghanistan,AFG,Asia


In [40]:
report_dd2.tail()

Unnamed: 0,Day,Month,Year,Cases,Deaths,Population,Rate,Country,Code,Continent
53585,25,3,2020,0,0,14645473,0.0,Zimbabwe,ZWE,Africa
53586,24,3,2020,0,1,14645473,0.0,Zimbabwe,ZWE,Africa
53587,23,3,2020,0,0,14645473,0.0,Zimbabwe,ZWE,Africa
53588,22,3,2020,1,0,14645473,0.0,Zimbabwe,ZWE,Africa
53589,21,3,2020,1,0,14645473,0.0,Zimbabwe,ZWE,Africa


## What do you think the following returns and why?
Remember Ray

In [41]:
report_dd2['Rate'].max()

dd.Scalar<series-..., dtype=float64>

### And just like Ray, we can evaluate and get a result

In [42]:
report_dd2['Rate'].max().compute()

1900.83621

**So what is this doing behind the scenes?**

<img src="https://camo.githubusercontent.com/349ed6d3048da7d324ef6fa8b07f66f5072ad2eb/687474703a2f2f6461736b2e7079646174612e6f72672f656e2f6c61746573742f5f696d616765732f6461736b2d646174616672616d652e737667" width=400>

Let's assume our dataset was large enough that finding the mean took some time. Let's also assume we aren't 100% sure what we want to calculate. We need to be able to quickly experiment with a dataset that "can't" fit into memory. This is where Dask can really shine. Think of Dask if:
* You are already using Pandas
* Your data may or may not fit into memory
* You want to make better use of distributed computing (i.e., cores or processors)

**Example: Calculate the mean rate for each country**

In [43]:
report_dd2.groupby('Country')['Rate'].mean()

Dask Series Structure:
npartitions=1
    float64
        ...
Name: Rate, dtype: float64
Dask Name: truediv, 12 tasks

In [44]:
country_means = report_dd2.groupby('Country')['Rate'].mean().compute()
country_means_df = country_means.sort_values(ascending=False).to_frame()
country_means_df

Unnamed: 0_level_0,Rate
Country,Unnamed: 1_level_1
Andorra,342.905588
Aruba,255.999914
Bahrain,222.996526
Qatar,212.499388
Holy_See,180.409691
...,...
United_Republic_of_Tanzania,0.051734
Laos,0.014750
Marshall_Islands,0.000000
Wallis_and_Futuna,0.000000


I want to merge the continent back into the analysis

In [45]:
continent_group = report_dd2.groupby('Country')['Continent'].apply(lambda s: s.iloc[0], meta="").compute()
continent_group_df = continent_group.sort_values(ascending=False).to_frame()
continent_group_df.columns = ["Continent"]
continent_group_df

Unnamed: 0_level_0,Continent
Country,Unnamed: 1_level_1
Cases_on_an_international_conveyance_Japan,Other
Northern_Mariana_Islands,Oceania
Wallis_and_Futuna,Oceania
Australia,Oceania
Marshall_Islands,Oceania
...,...
Lesotho,Africa
Eswatini,Africa
Malawi,Africa
Mauritius,Africa


* ``apply`` is a function that will apply a function of our specification to each object. It's basically a for loop
* ``meta`` is a Dask parameter that is needed because we are running a custom function. This tells Dask we are planning on returning a string.
* ``lambda`` is a Python anonymous function.

Note: Both ``continent_group_df`` and ``continent_means_df`` have the same index, so we can join them, and then reset the index. The right mindset to get into in Pandas is to think like a database.

In [46]:
plot_df = country_means_df.join(continent_group_df).reset_index()
plot_df

Unnamed: 0,Country,Rate,Continent
0,Andorra,342.905588,Europe
1,Aruba,255.999914,America
2,Bahrain,222.996526,Asia
3,Qatar,212.499388,Asia
4,Holy_See,180.409691,Europe
...,...,...,...
208,United_Republic_of_Tanzania,0.051734,Africa
209,Laos,0.014750,Asia
210,Marshall_Islands,0.000000,Oceania
211,Wallis_and_Futuna,0.000000,Oceania


In [47]:
import altair as alt
g=alt.Chart(plot_df).mark_bar().encode(
    y='Country',
    x='Rate',
    color='Continent',
    row='Continent'
)

In [48]:
g

That's a lot and basically a great big mess. What if we want to radically change the algorithm and calculation? This is where a good distributed framework can start to pay off.
* Running something one time in a serial manner is often ok.
* We almost never need to run something a single time. 
* More than not development is going to take 100's of iterations before we are happy with our results. 
* Strategy: prototype on small datasets, and use a flexible and easy to use distributed system

## Wrapping up
There is much more to Ray and Dask than we presented in this chapter. They are both flexible and wonderful additions to the Python language. You can get a long way with these two libraries without leaving the comfort of a language you know. But in both cases, you need to be aware you are executing your programs in a distributed manner and structure them accordingly or you will not see the benefits.

## Questions

### 1. What is a decorator and what do we use them in Ray to accomplish?


### 2. Can you call a function that has a Ray decorator in the normal Python way?


### 3. Explain what futures are in Ray and why they are important?


### 4. How would you compare Ray and Dask?


### 5. Does Dask use futures? How do you compare them to Ray futures? How do you tell Dask to execute and produce a result?


**Thank you!**