# 17) Data in Space - Networks <a class="tocSkip">

Notebook 15 taught us concurrency: how to do more than one thing at a time. Now we will try to do things in more than one place: distributed computing or networking. There are many good reasons to do this:

    Performance
        Your goal is to keep fast components busy, not waiting for slow ones.
        
    Robustness
        There is safety in numbers, so you want to duplicate tasks to work around hardware and software failures.
        
    Simplicity
        It is best practice to break complex tasks into many little ones that are easier to create, understand and fix.

### Networking patterns

You can build networking applications from some basic patterns:

    The most common pattern is request-reply (client-server). This pattern is synchronous: the client waits until the server responds. A web browser is a client, making an HTTP request to a web server ,which returns a reply.
    
    Another common pattern is push or fanout: you send data to any available worker in a pool of processes. An example is a web server behind a load balancer.
    
    The opposite of push is pull or fanin: you accept data from one or more sources. An example would be a logger that takes text messages from multiple processes and writes them to a single log file.
    
    One pattern is similar to television broadcasting: publish-subscribe. With this pattern, a publisher sends out data. In  a simple pub-sub system all subscribers would receive a copy. More often, subscribers can indicate that they are interested only in certain types of data, and the publisher will send just those. Unlike the push pattern, more than one subscriber might receive a given piece of data.

#### Publish-Subscribe pattern

Publish-Subscribe is not a queue but a broadcast. One or more processes publish messages. Each subscriber process indicates what type of messages it would like to receive. A copy of each message is sent to each subscriber that matched its type. Thus, a given message might be processed once, more than once, or not at all. The Redis package contains a pub-sub system. The publisher emits a message with a topic and a value, and subscribers say which topics they want to receive.

In [1]:
import redis
import random

In [2]:
# Define a publisher that broadcasts letters and numbers

def publisher():
    conn = redis.Redis()
    letters = ['A', 'B', 'C', 'D']
    numbers = [1, 2, 3, 4]
    
    for msg in range(10):
        letter = random.choice(letters)
        number = random.choice(numbers)
        print('Publish: {} and {}'.format(letter, number))
        conn.publish(letter, number)

In [3]:
# Define a subscriber that is only interested in certain letters

def subscriber():
    conn = redis.Redis()
    topics = ['B', 'D']
    sub = conn.pubsub()
    sub.subscribe(topics)
    
    for msg in sub.listen():
        if msg['type'] == 'message':
            letter = msg['channel']
            number = msg['data']
            print('Subscribe: {} and {}'.format(letter, number))

Here the subscriber wants all messages for letters B and D and no others. The listen() method returns a dictionary. If its type is 'message', it was sent by the publisher and matches our criteria. The 'channel' key is the topic (letter), and the 'data' key contains the message (number). You can have as many subscribers and publishers as you want. If there is no subscriber for a message, it disappears from the Redis server. However, if there are subscribers, the messages stay in the server until all subscribers have retrieved them. To run servers and clients, start the server and client in separate terminal windows or put the server in the background with a final "&".

### Web services and APIs

If data is published only on a website, anyone who wants to access and structure the data needs to write scrapers and rewrite them each time a page format changes. In contrast, if a website offers an API to its data, the data becomes directly available to client programs. APIs change less often than web page layouts, so client rewrites are less common. APIs are especially useful for mining well-known social media sites such as Twitter, Facebook and LinkedIn. All these sites provide APIs that are free to use, but they require you to register and get a key to use when connecting.

### Remote management tools

As Google and other internet companies grew, they found that traditional computing solutions did not scale. Software that worked for single machines, or even a few dozen, could not keep up with thousands. Disk storage for databases and files involved too much seeking, which requires mechanical movement of disk heads. But you could stream consecutive segments of the disk more quickly. Developers found that it was faster to distribute and analyze data on many networked machines than on individual ones. Big data often means "data too big to fit on my machine": data that exceeds the disk, memory, CPU time or all of the above. A selection of packages for remote management include:

    - Hadoop
    - Spark
    - Disco
    - Dask

### Clouds

Not so long ago, you would buy your own servers, bolt them into racks in data centers and install layers of software on them: operating systems, device drivers, filesystems, databases, web servers, email servers, name servers, load balancers, monitors and more. Many hosting services offered to take care of your servers for a fee, but you still leased the physical devices and had to pay for your peak load configuration at all times. Instead of building, you can rent servers in the cloud. By adopting this model, maintenance is someone else's problem and you can concentrate on your service. Using web dashboards and APIs, you can spin up servers with whatever configuration you need, quickly and easily. You can monitor their status and be alerted if some metric exceeds a given threshold. The big cloud vendors are:

    - Amazon (AWS) (pip install boto3)
    - Google
    - Microsoft Azure

### Containers

Containers are much lighter than virtual machines and a bit heavier than Python virtualenvs. They allow you to package an application separately from other applications on the same machine, sharing only the operating system kernel. To install Docker's Python client library run pip install docker. Containers caught on and spread through the computing world. Eventually, people needed ways to manage multiple containers and wanted to automate some of the manual steps that have been usually required in large distributed systems, including failover, load balancing and scaling up and down. Kubernetes is leading the pack in container orchestration.