# Chapter 6: Utilizing Parallel Python	

* Parallel Programming with Python
* 김무성

# Contents

* Understanding interprocess communication	
    - Exploring named pipes	
    - Using named pipes with Python	
        - Writing in a named pipe	
        - Reading named pipes	
* Discovering PP	
* Using PP to calculate the Fibonacci series term on SMP architecture	
* Using PP to make a distributed Web crawler	
* Summary

# Understanding interprocess communication

* Exploring named pipes
* Using named pipes with Python

### 기본 개념

* Interprocess communication (<font color="red">IPC</font>) consists of mechanisms that allow the exchange
of information among processes.
* When processes are physically distributed in clusters, for instance, we could use sockets and Remote Procedure
Call (<font color="red">RPC</font>).
* In Chapter 5, Using Multiprocessing and ProcessPoolExecutor, we verified the use
of <font color="red">regular pipes</font> among other things. 
* We also studied the communication among processes that have a <font color="red">common parent process</font>. * But, sometimes it is necessary to perform <font color="red">communication between unrelated processes</font> (processes with different parent processes). 
* we must use mechanisms called <font color="red">named pipes</font>

## Exploring named pipes

* Within the POSIX systems, such as Linux, <font color="red">we should keep in mind that everything, absolutely everything, can be summed up to files</font>. For each task we perform, there is a file somewhere, and we can also find a <font color="red">file descriptor</font> attached to it, which allows us to manipulate these files.
* Named pipes are nothing but mechanisms that allow IPC communication through the use of file descriptors associated with special files that implement, for instance, a First-In, First-Out (<font color="red">FIFO</font>) scheme <font color="red">for writing and reading the data</font>.
* While the <font color="red">named pipes make use of the file descriptors and special files in a file system</font>, regular pipes are created in memory.

## Using named pipes with Python

* Writing in a named pipe
* Reading named pipes

### Writing in a named pipe

In [9]:
import os

In [10]:
def write_message(input_pipe, message):
    fd = os.open(input_pipe, os.O_WRONLY)
    s = "%s from pid [%d]" %(message, os.getpid())
    os.write(fd, (s)) 
    os.close(fd)

In [11]:
named_pipe = "my_pipe"
if not os.path.exists(named_pipe):
    os.mkfifo(named_pipe)

In [12]:
%ls

ch6_Utilizing_Parallel_Python.ipynb  read_from_pipe.py
[0m[40;33mmy_pipe[0m|                             write_to_named_pipe.py


### Reading named pipes

In [13]:
def read_message(input_pipe):
    fd = os.open(input_pipe, os.O_RDONLY)
    message = (
        "I pid [%d] received a message => %s"
        % (os.getpid(), os.read(fd, 22)))
        
    os.close(fd)
    
    return message

<img src='figures/eg_pipe.png' width=600 />

###  run !

In [1]:
%cat write_to_named_pipe.py

import os

def write_message(input_pipe, message):
    fd = os.open(input_pipe, os.O_WRONLY)
    s = "%s from pid [%d]" %(message, os.getpid())
    os.write(fd, (s)) 
    os.close(fd)


if __name__ == '__main__' :
    named_pipe = "my_pipe"
    if not os.path.exists(named_pipe):
        os.mkfifo(named_pipe)

    write_message(named_pipe, 'hello')





In [3]:
%cat read_from_pipe.py

import os

def read_message(input_pipe):
    fd = os.open(input_pipe, os.O_RDONLY)
    message = (
        "I pid [%d] received a message => %s"
        % (os.getpid(), os.read(fd, 22)))
        
    os.close(fd)
    
    return message


if __name__ == '__main__' :
    named_pipe = "my_pipe"

    msg = read_message(named_pipe)
    print msg
    




# Discovering PP

* Now, we will use a Python module, PP, to establish IPC communication not only <font color="red">among
local processes</font>, but also <font color="red">physically distributed throughout a computer network</font>.

### PP(Parallel Python)

* Parallel Python - http://www.parallelpython.com/
* The most important advantage of using PP is the abstraction that this module
provides. Some important features of PP are as follows:
    - Automatic detection of number of processors to improve load balance
    - Many processors allocated can be changed at runtime
    - <font color="red">Load balance at runtime
    - <font color="red">Auto-discovery resources throughout the network
* The PP module implements the execution of <font color="red">parallel code in two ways</font>. 
    - The first way considers the <font color="red">SMP architecture</font>, where there are multiple processors/cores in <font color="red">the same machine</font>. 
    - The second alternative would be <font color="red">distributing the tasks through machines in a network</font>, configuring, and thus <font color="red">forming a cluster</font>. 

#### 참고 : Quick start guide

* http://www.parallelpython.com/content/view/15/30/

### 참고 : install

> pip install http://www.parallelpython.com/downloads/pp/pp-1.6.4.tar.gz

# Using PP to calculate the Fibonacci series term on SMP architecture

In [17]:
import os, pp

In [18]:
input_list = [4, 3, 8, 6, 10]
result_dict = {}

In [19]:
def fibo_task(value):
    a, b = 0, 1
    for item in range(value):
        a, b = b, a + b
    message = "the fibonacci calculated by pid %d was %d" \
        % (os.getpid(), a)
    return (value, message)

In [20]:
def aggregate_results(result):
    print "Computing results with PID [%d]" % os.getpid()
    result_dict[result[0]] = result[1]

In [21]:
job_server = pp.Server()
for item in input_list:
    job_server.submit(fibo_task, (item,), modules=('os',), callback=aggregate_results)
  
job_server.wait()

Computing results with PID [10604]
Computing results with PID [10604]
Computing results with PID [10604]
Computing results with PID [10604]
Computing results with PID [10604]


In [22]:
print "Main process PID [%d]" % os.getpid() 
for key, value in result_dict.items():
    print "For input %d, %s" % (key, value)

Main process PID [10604]
For input 8, the fibonacci calculated by pid 11054 was 21
For input 10, the fibonacci calculated by pid 11053 was 55
For input 3, the fibonacci calculated by pid 11053 was 2
For input 4, the fibonacci calculated by pid 11052 was 3
For input 6, the fibonacci calculated by pid 11052 was 8


# Using PP to make a distributed Web crawler

### 책의 예제에서 분산 서버 환경

* Iceman-Thinkad-X220: Ubuntu 13.10
* Iceman-Q47OC-500P4C: Ubuntu 12.04 LTS
* Asgard-desktop: Elementary OS

### code

In [2]:
import os, re, requests, pp

In [3]:
url_list = ['http://www.google.com/', 'http://gizmodo.uol.com.br/',
            'https://github.com/', 'http://br.search.yahoo.com/',
           ]

In [4]:
result_dict = {}

In [5]:
def aggregate_results(result):
    print "Computing results in main process PID [%d]" % os.getpid()
    message = "PID %d in hostname [%s] the following links were found: %s"\
        % (result[2], result[3], result[1])
    result_dict[result[0]] = message

In [6]:
def crawl_task(url):
    html_link_regex = \
        re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')
    
    request_data = requests.get(url)
    #limit to the first 03 links
    links = html_link_regex.findall(request_data.text)[:3]
    return (url, links, os.getpid(), os.uname()[1])

### 실습을 위해 우선 단일 컴퓨터 환경에서.

In [None]:
# ppservers = ("192.168.25.21", "192.168.25.9")
ppservers = ('*',) 
job_dispatcher = pp.Server(ncpus=1, ppservers=ppservers, socket_timeout=60000)
for url in url_list:
    job_dispatcher.submit(crawl_task, (url,),
        modules=('os', 're', 'requests',),
            callback=aggregate_results)
    
job_dispatcher.wait()

for key, value in result_dict.items():
    print "** For url %s, %s\n" % (key, value)
    
print job_dispatcher.print_stats()

### 책의 예제에 나온 환경일 경우

<img src="figures/eg_crawler_1.png" />

<img src="figures/eg_crawler_2.png" />

<img src="figures/eg_crawler_3.png" />

# run !

In [2]:
%run web_crawler_pp_cluster.py

Computing results in main process PID [12035]
Computing results in main process PID [12035]
Computing results in main process PID [12035]
Computing results in main process PID [12035]
** For url http://br.search.yahoo.com/, PID 12201 in hostname [moosung-com] the following links were found: [u'https://br.yahoo.com/', u'https://mail.yahoo.com/?.intl=br&.lang=pt-BR', u'https://br.noticias.yahoo.com/']

** For url http://gizmodo.uol.com.br/, PID 12201 in hostname [moosung-com] the following links were found: [u'http://trivela.uol.com.br/', u'http://extratime.uol.com.br/', u'http://gizmodo.uol.com.br/']

** For url https://github.com/, PID 12201 in hostname [moosung-com] the following links were found: [u'#start-of-content', u'https://github.com/', u'/join']

** For url http://www.google.com/, PID 12201 in hostname [moosung-com] the following links were found: [u'http://www.google.co.kr/imghp?hl=ko&tab=wi', u'http://maps.google.co.kr/maps?hl=ko&tab=wl', u'https://play.google.com/?hl=ko&tab

# Summary

# 참고자료

* [1] Parallel Programming with Python - http://www.amazon.com/Parallel-Programming-Python-Jan-Palach/dp/1505492092
* [2] Parallel Python - http://www.parallelpython.com/