# Background

Over the last few newsletters, I have been showing you how the "multiprocessing" module makes it possible not only to use processes in a way that feels like you're using threads, but also to share data using a Queue. That's right -- while we normally think of queues as having to do with threads, and giving us a thread-safe mechanism for passing data among threads, it turns out that "multiprocessing" offers its own version of a queue, which we can use quite similarly.

Queues depend on Python's "pickle" module, which allows us to turn any Python object into a string. In other languages, this is known as "serialization" or "marshalling," but for reasons that are somewhat beyond me, Python people talk about "pickling."  All built-in data structures (not including things like functions and files) can be pickled and unpickled. We can store a pickled object to a string, or we can store it to disk.  And of course, we can unpickle a string or disk file into an object. Moreover, pickle works with your own classes, unless they do truly strange things. 

The good news, then, is that you can basically send any kind of Python-based data structure through a queue.

The bad news is that if you're passing lots of data, then the process of pickling and unpickling might incur some overhead. Also, queues are designed for sending data from one process to another. What if more than two processes want to share data?  What if you have lots of pieces of data that need to be updated or monitored regularly?

If we were working within a single process, we could use a shared piece of data to handle such things. But when we have multiple processes, the point is that each process has its own variables and memory space, and is insulated from the other processes.

One solution is to use the "shared memory" feature of the "multiprocessing" module. This feature relies on shared memory capabilities within the operating system, allowing processes to communicate with one another.

Since we're going through the operating system, we won't have the flexibility of using Python data types, as provided by our queue. Instead, we'll be restricted to several low-level data types, corresponding to individual values and arrays at the C level. The types that are available to us are defined in the "ctypes" library, which describes the ways in which Python can use C-compatible data structures.

For example, let's assume that we want to count the size of a number of files. As we've seen before, we can launch a new process from "multiprocessing" to open a file and get its size. When we've done this in the past, we've placed each file's size in a queue; when all of the processes are done, we can then retrieve and add all of the values from the queue.

Using shared memory, we can have each process increment the value. No counting will be necessary afterwards, although we will want to "join" the processes in order to ensure that all of them have finished.

In this example, I create a new "Value" object in the main process that launches everything. The "Value" object needs to be given a type and a value, much like a variable declaration in C or any other static language. In this case, I've given it an "i" type (i.e., integer) and an initial value of 0.  To us, "word_count" is an integer like any other. However, "multiprocessing" keeps track of it across processes, such that it's visible elsewhere.

We can then update it from each individual process, as follows: 


In [None]:

#!/usr/bin/env python3

from multiprocessing import Process, Value
import glob

def count_words(counter, filename):
    try:
        counter.value += len(open(filename).read())
    except IOError:
        print("\tProblem with {}".format(filename))

if __name__ == '__main__':
    word_count = Value('i', 0)

    processes = [ ]
    for filename in glob.glob('/etc/*.conf'):
        print(filename)
        p = Process(target=count_words, args=(word_count,filename))
        p.start()
        processes.append(p)

    for one_process in processes:
        one_process.join()

    print("Total = {}".format(word_count.value))




  


Notice that our main program code is in an "if __name__ == '__main__'" block, which is there in part to keep things organized, and in part because Windows and Unix have such different approaches to launching processes that this is necessary to work on both.  We create our new "Value" object, we create a list into which we can store our processes, we launch the processes (one for each matching file in "glob"), and then join them up.  In the end, we have the final value -- without the need for queues.

But wait a second: How do we know that the processes aren't somehow colliding when they try to update the word_count variable?  It's probably a good idea to use a mutex or other form of lock to ensure that only one person can modify the variable at once -- especially since multiple processes have the potential to do lots of trouble.

You might think that += provides a sort of lock. But it doesn't: +=, like other operators in Python, is actually transformed into a method call.  In this case, that call would be to the __iadd__ method, which doesn't lock our Value object throughout its changes. This means that if enough processes are updating our count at the same time, you might find that the count isn't working correctly. 

We can get around this by adding a single line: 

    with counter.get_lock():
to the start of count_words. Our values, it would seem, are more versatile than simple numeric storage would be.  They have the "get_lock" method, which -- when put in a context manager ("with") block, automatically do precisely what we would like: Make sure that no one else is using the variable, then update it atomically, and then release the lock.

The updated software can thus look like this: 

In [1]:
#!/usr/bin/env python3

from multiprocessing import Process, Value
import glob

def count_words(counter, filename):
    with counter.get_lock():
        try:
            counter.value += len(open(filename).read())
        except IOError:
            print("\tProblem with {}".format(filename))

if __name__ == '__main__':
    word_count = Value('i', 0)

    processes = [ ]
    for filename in glob.glob('/etc/*.conf'):
        print(filename)
        p = Process(target=count_words, args=(word_count,filename))
        p.start()
        processes.append(p)

    for one_process in processes:
        one_process.join()

    print("Total = {}".format(word_count.value))

Total = 0


If you want, you can have an entire "Array" of values, rather than a single "Value" object. The "ctypes" manual page lists many types you can specify.

    https://docs.python.org/3/library/ctypes.html#module-ctypes

Next time, we'll talk about process pools -- and then finally wrap up this very (!) long series I've been writing about threads and processes.

Until then,

Reuven