### FUTURES 

A future, or promise, is something that represents a pending opearion and returns straight away. One can then query their state of completion, or register callbacks to be called on successful completion or error.

Adapted from Fluent Python.

### NOTES

yield was this concept of ...

Concept of futures: If you have a computation or process that will take a long time. Instead of blocking or waiting for it to complete, return the token to you right now even if process is not done yet. Ask for tokens to indicate whether complete or not. Or register a callback and when done, call a function that maybe puts stuff into database or changes global variable to change whatever. 

Context of coroutine or thread having finished its work. Notion of having completed the work. 

There's a module called concurrent. 

Way to doing concurrent processes and threads. 

Multiple processes going on, etc. Or multiple websites. There's some unit of I/O or computation. 

So let's start with threads. We have a vague notion of thread - some notion of level of process, but can have multiple threads. 

If you do time.sleep, will give up lock and lets other threads happen. Good thing to play around

This is what happens when go to website? 

### NOTES - K

Future - a value that will be available at some point in the future

Don't conflate with the Python implementation of Future 

Often use future to represent the result of an asyncronous process.

Sort of explicit form of lazy evaluations

Allows you to do other useful work  before you block

In [1]:
import uuid, time
def get_thing(it):
    sleep(1)
    return uuid.uuid4()

In [2]:
# Examples from Fluent Python 
# get updated code from Rahul

import time, uuid, functools
def get_thing_maker(secs, items):
    time.sleep(secs)
    return str(uuid.uuid4())+str(items)
get_thing = functools.partial(get_thing_maker, 1)
def get_many(lots):
    counter = 0
    for t in lots: 
        thing = get_thing(t)
        counter += 1
    return counter

def serial_main(it):
    t0 = time.time()
    count = get_many(it)
    elapsed = time.time() - t0
    msg = '\n{} things got in {:.2f}s'
    print(msg.format(count, elapsed))

#### Serial sleeping

In [3]:
serial_main(range(20))


20 things got in 20.08s


#### concurrent sleeping using threads

### NOTES

creates 10 threads for us usin this and stored in executor 

10 threats, 20 functions to run. It will take care of all the details of spawning and joining threads. 

Some of these functions 

By doing these threads, because when running thread, any other thread cannot run. But all this thread is doing is sleeping. As I/O goes on, and then tell threads to go, and then pass along locks. 

That's the game that's happening over here. 

May think that global interpter lock (GIL) is a problem, but it's not. For most part of doing I/O type stuff, it's not a problem at all. Can always get away with doing it. 

In [4]:
from concurrent import futures
def get_many_threaded1(it):
    workers = 10
    with futures.ThreadPoolExecutor(max_workers=workers) as executor:
        # TODO This is a cool pattern, parallelism, can map pretty much anthing that blocks
        # this will nicel parallezlie and schedule for ou
        
        res = executor.map(get_thing, it)
    return len(list(res))
def threaded_main1(it):
    t0 = time.time()
    count = get_many_threaded1(it)
    elapsed = time.time() - t0
    msg = '\n{} things got in {:.2f}s' 
    print(msg.format(count, elapsed))

In [5]:
threaded_main1(range(20))


20 things got in 2.01s


One might think that the concurrent IO (or sleeping) case is limited by the GIL, but in both cases, the GIL is yielded. Thus there is no waiting around.

The GIL is harmless if code is being run in the context of python library IO or code running in properly coded C extensions like numpy.  The time.sleep() function also releases the GIL. Python threads are totally usable in I/O-bound applications.

### NOTES - Threads vs. Processes

This is how all processes work in all unix machines work. It just keeps forking to create more processes and memory space keeps duplicating. 

parent waits on child, then child finishes, then exits status. 

What's the characteristic of process? Stack, registers allocated, memory allocation. 

A process is a container. Every process has at least one thread. Could ahve mlutiple threads. What does this mean? Could have multiple units of execution and to have multiple units of execution, each one has its own stack to have its own call stack and possibly to ahve regisers. Needs to have a program counter to tell you where you are in. These are things that are properties of processes. All share the same memory address as all the threads. So this thread idea can think of idea of decoupling of this notion of what is a process here and what the resources that a process has. 

These are the things that a process has: address space, global variables, etc. 

Thread level: registers. 

Can have user vs. kernel threads .... kernrel threads are actually managed by OS for you. Don't have to worry about these things. Generally, as a writer of a program, have a user space library. 

### Threads

threads vs processes

On linux

- processes created by fork()
- have a primary thread
- thread is the unit of execution
- process is a container, can have more threads
- can be scheduled across different cores/cpus

```c
int pid;
int status = 0;
/* fork returns pid of child to parent and 0 to child*/
if (pid = fork()) {
    /* parent code */
    pid = wait(&status);
    /*wait returns child pid and status*/
} else {
    /* child  code*/
    exit(status);
} 
```

- threads in a process share same address space (share it entirely)
- thread abstraction decouples resource allocation from control
- defines a single sequential execution stream with PC, stack, register values
- process handles: address space, global variables, open files, child processes, pending alarms, signals and signal handlers, accounting info
- thread handles program counter, registers, stack, and state
- user vs kernel threads

In [13]:
# here's our old fib as one liner. This is worst way...exponential complexity. Expemplar of CPU bound process.
# Burgeoninng large stack. Not using any other memory
def fib(n):
    return fib(n - 1) + fib(n - 2) if n > 1 else n

In [14]:
from threading import Thread
from time import sleep
from time import time


# sleeps for 3 seconds. Goes in a loop. Loop runs 10 times, sleeps 3 seconds each time. 
# Print statement with flush = True because it flushes the standard out first. Want to do that because there's 
# a cell to run but not much to do, but wondering why the other cell hasn't gone yet... has * because trying to 
# send to stdout but waiting for other process to do. Flush will force out stdout
# stdout, don't want to be waiting to write stuff out all the time. 

# IO bound: spending most of program getting stuff from disk or network, not computationally heavy. 
# CPU bound (compute-bound): fibonacci - long running computational process and long chain of stuff
# if have program that is IO bound, just use threads (for asyncronous, less bug prone) or coroutines (for thread-based), 
# will not be a problem
# In python, shoudl not use threads to write compute-bound program, separate threads.. will hold onto GIL and not let 
# other threads go
def sleepy(): #like io
    i=0
    while i < 10:
        print("{} -- {} Sleepy!".format(i, int(time())), flush=True)
        sleep(3)
        i += 1

# cpuy and cpuy2 are identical, just wanted to separate out
def cpuy():
    for i in range(35):
        val = fib(i)
        print("fib({}) is {}".format(i, val))

def cpuy2():
    for i in range(35):
        val = fib(i)
        print("cpuy2 fib({}) is {}".format(i, val))
        

# let's see how much time it takes to run cpuy 2 threads one after the other
def main():
    # Second thread will print the hello message. Starting as a daemon means
    # the thread will not prevent the process from exiting.
    
    # run these calculations in serial
    start = time()
    cpuy()
    cpuy2()
    print("serial elapsed:", time() - start)
    
    
    # now try running as threads 
    # TODO understand what this is doing
    start=time()
    #t = Thread(target=sleepy)
    #t.start()
    # start a thread... simple to do 
    t2 = Thread(target=cpuy2)
    t2.start()
    # Main thread will read and process input
    cpuy()
    print("thread elapsed:", time() - start)
    
if __name__ == '__main__':
    main()
    
# NOTES
# there was no advantage to running as 2 threads. Actaully took longer than running in serial
# naively, thought this would take time to run in any one functions... like 10 seconds which is half of running in serial
# the GIL is forcing the threads to run in serial, that's why it's taking 20 seconds 
# the way this works in CPU bound process, doesn't give up any GIL 
# in IO bound processes, it gives up the GIL 
# in Python, about every 100-200 instructions of byte code, it will give up GIL to see if anything else runs
# there's only one runing at any time. So you can't really do anything

# Sklearn - njobs, njobs tells you how many jobs you can run. Does not use threads. Uses multiprocessing 

fib(0) is 0
fib(1) is 1
fib(2) is 1
fib(3) is 2
fib(4) is 3
fib(5) is 5
fib(6) is 8
fib(7) is 13
fib(8) is 21
fib(9) is 34
fib(10) is 55
fib(11) is 89
fib(12) is 144
fib(13) is 233
fib(14) is 377
fib(15) is 610
fib(16) is 987
fib(17) is 1597
fib(18) is 2584
fib(19) is 4181
fib(20) is 6765
fib(21) is 10946
fib(22) is 17711
fib(23) is 28657
fib(24) is 46368
fib(25) is 75025
fib(26) is 121393
fib(27) is 196418
fib(28) is 317811
fib(29) is 514229
fib(30) is 832040
fib(31) is 1346269
fib(32) is 2178309
fib(33) is 3524578
fib(34) is 5702887
cpuy2 fib(0) is 0
cpuy2 fib(1) is 1
cpuy2 fib(2) is 1
cpuy2 fib(3) is 2
cpuy2 fib(4) is 3
cpuy2 fib(5) is 5
cpuy2 fib(6) is 8
cpuy2 fib(7) is 13
cpuy2 fib(8) is 21
cpuy2 fib(9) is 34
cpuy2 fib(10) is 55
cpuy2 fib(11) is 89
cpuy2 fib(12) is 144
cpuy2 fib(13) is 233
cpuy2 fib(14) is 377
cpuy2 fib(15) is 610
cpuy2 fib(16) is 987
cpuy2 fib(17) is 1597
cpuy2 fib(18) is 2584
cpuy2 fib(19) is 4181
cpuy2 fib(20) is 6765
cpuy2 fib(21) is 10946
cpuy2 fib(22) is 177

### Processes with concurrent futures.

CPU based processing wont release the gil, and is thus best done in a separate process. For illustration, we show what this looks like.

### NOTES

Sklearn - njobs, njobs tells you how many jobs you can run. Does not use threads. Uses multiprocessing 
Using multiple processes here shouldn't be useful. 

Multiple processes and sleeping 1 second

Don't really need to do multiple processes. 

Because threads will give up the gil. Don't have to do multiple processes. Just doing this to show it's the same. 

In [15]:
import time 

def get_many_process(it, workers=None):
    if workers:
        with futures.ProcessPoolExecutor(max_workers=workers) as executor:
            res = executor.map(get_thing, it)
    else:
        with futures.ProcessPoolExecutor() as executor:
            res = executor.map(get_thing, it)
    return len(list(res))

def process_main(it, workers=None):
    t0 = time.time()
    count = get_many_process(it, workers)
    elapsed = time.time() - t0
    msg = '\n{} things got in {:.2f}s' 
    print(msg.format(count, elapsed))

In [16]:
process_main(range(20))


20 things got in 3.05s


In [17]:
# for the same reasons. 
process_main(range(20), workers=10)


20 things got in 2.05s


In [18]:
# this is just demosntrating name is __main__
print(__name__)

__main__


In [19]:
# identical example with threads
# instead of multi-threading 
# doing multi-processing 

# now it takes 14 seconds, not 10 seconds because of startup and overhead
# in cpubound process, this woul otherwise tak 20 seconds, reduce to 14 seconds. 

# Takeaway 
# cpu bound - use multiprocessing, because it will be faster 
# io bound - use multithreading 
import multiprocessing
start = time.time()
p=multiprocessing.Process(target=cpuy2)
p.start()
cpuy()
p.join()
print("mp elapsed:", time() - start)

fib(0) is 0
fib(1) is 1
fib(2) is 1
fib(3) is 2
fib(4) is 3
fib(5) is 5
fib(6) is 8
fib(7) is 13
fib(8) is 21
fib(9) is 34
fib(10) is 55
fib(11) is 89
fib(12) is 144
fib(13) is 233
fib(14) is 377
fib(15) is 610
fib(16) is 987
fib(17) is 1597
fib(18) is 2584
fib(19) is 4181
fib(20) is 6765
fib(21) is 10946
fib(22) is 17711
fib(23) is 28657cpuy2 fib(0) is 0
cpuy2 fib(1) is 1
cpuy2 fib(2) is 1
cpuy2 fib(3) is 2
cpuy2 fib(4) is 3
cpuy2 fib(5) is 5
cpuy2 fib(6) is 8
cpuy2 fib(7) is 13
cpuy2 fib(8) is 21
cpuy2 fib(9) is 34
cpuy2 fib(10) is 55
cpuy2 fib(11) is 89
cpuy2 fib(12) is 144
cpuy2 fib(13) is 233
cpuy2 fib(14) is 377
cpuy2 fib(15) is 610
cpuy2 fib(16) is 987
cpuy2 fib(17) is 1597
cpuy2 fib(18) is 2584
cpuy2 fib(19) is 4181
cpuy2 fib(20) is 6765
cpuy2 fib(21) is 10946
cpuy2 fib(22) is 17711

fib(24) is 46368
fib(25) is 75025cpuy2 fib(23) is 28657
cpuy2 fib(24) is 46368

fib(26) is 121393cpuy2 fib(25) is 75025

fib(27) is 196418cpuy2 fib(26) is 121393

fib(28) is 317811cpuy2 fib(27) is 

TypeError: 'module' object is not callable

### NOTES Takeaway 
cpu bound - use multiprocessing, because it will be faster 

io bound - use multithreading 

If I wanted to write a process that kept doing some long standing computation, what's the simplest code? 

While 1: 
    ... 
   
That front end process, which is basically this repl loop, you could run in main process 

Background computation could do... 

Problem with this is that ... 

What happens when you wait for input? Can't go forward and do any computations. That * means your' elocked until you do anything. What's going to happen is going to have these callbacks, but your main thread is blocked. Nothing can be done with this. Need to find a way to unlocking this, by using threads to give up GILs or something, or do something asyncronous. It's not enough just to have it run in process in back. This is a process you'll be implementing in your careers. It requires two things: 1) multi-threaded front end with process in back or requires 2) coroutine front end with process in back end. Doesn't have to be coroutine just anything that can be asyncronous. 

Otherwise callback would get stuck. 

In [29]:
input('>')

>a


'a'

### NOTES sockets

very simple notion, there's a notion of opening a pipe to get data from that. 

For sockets, distinction between client and server

client socket -- browser opens to get data from elsewhere. Loop exists which allowed to exist beacuse of server socket. Server socket waits for requests from client sockets. 

Again do that using threads or fork to handle threads to handle client socket on server side. Or can do asyncronous. 3 different ways to do it. Apache used to do it in old asycronous fashion. Apache added in way to do with threads. Node.js is completely asycnronous in one process. 

There's a disticntion between client and server socket. A stream and internet domain socket. 

Can have non blocking sockets using select system call. 

### sockets

- distinction between "client socket" and "server socket"
- default `socket.socket(family=AF_INET, type=SOCK_STREAM, proto=0, fileno=None)`
- server socket sits and creates client sockets
- non-blocking sockets and the `select` system call

Read: https://docs.python.org/3.5/howto/sockets.html

### Writing a web page fetcher

We'll eventually use the asyncio module to play with web page fetching and crawling, but lets build up to that by writing a simple fetcher. We'll start with blocking, then move to non-blocking, and finally to co-routines, and even more finally to `yield from` based co-routines.

Adapted from http://aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html

#### Blocking fetch

In [30]:
import socket
def fetch(host, url):
    sock = socket.socket()
    sock.connect((host, 80))
    request = 'GET {} HTTP/1.0\r\nHost: {}\r\n\r\n'.format(url, host)
    sock.send(request.encode('ascii'))
    response = b''
    chunk = sock.recv(4096)
    # if chunk comes back empty, then nothing else wmore to do
    while chunk:
        response += chunk
        chunk = sock.recv(4096)
    return response

In [31]:
from IPython.display import HTML, IFrame
HTML(str(fetch("www.example.com","/")))
# converting a bunch of bytes to string, the passing it through Ipython HTML

In [None]:
# TODO stopped paying attention here until rest

#### Basic non-blocking

In [32]:
# underlying operating system, they look at a bunch of file descriptors, actual files have file descriptors 
# basically do I have any data on this? 
# 
host="www.example.com"
url="/"
request = 'GET {} HTTP/1.0\r\nHost: {}\r\n\r\n'.format(url, host)
encoded = request.encode('ascii')
sock = socket.socket()
sock.setblocking(False)
try:
    sock.connect(('xkcd.com', 80))
except BlockingIOError:
    pass
while True:
    try:
        sock.send(encoded)
        break  # Done.
    except OSError as e:
        pass

print('sent')

sent


This has only been implemented partially. Notice how the `sock.send` spins in a loop.

This eats cycles. the solution is to use select/kqueue/epoll from a small number of connections to a large number of them. The basic idea behind `select` is to wait for an event to occur on a small set of non-blocking sokets.

We'll use python's `DefaultSelector`, an addition from python 3.4 that automatically chooses the "best" select like implementation on your system.


In [33]:
from selectors import DefaultSelector, EVENT_WRITE

# set up selector here 
selector = DefaultSelector()
host="www.example.com"
sock = socket.socket()
sock.setblocking(False)
try:
    sock.connect((host, 80))
except BlockingIOError:
    pass

# 
def connected():
    selector.unregister(sock.fileno())
    print('connected!', flush=True)

# connected can be a function, class, or None or anything
# finding some way to communicate to select loop that if had an event that something was written, then give connected, 
# basically prints 'connected'
selector.register(sock.fileno(), EVENT_WRITE, connected)

SelectorKey(fileobj=56, fd=56, events=2, data=<function connected at 0x1072c3620>)

`connected` is the **callback** run when the connection happens.

In [38]:

def loop():
    start = time.time()
    while True:
        if time.time() - start > 10:
            break
        events = selector.select()
        # if any events, then this for loop will happen. Mask is not useful now. Key is main important thing. 
        # this is called an event loop where just sit and wait for something to happen
        # 100 sockets and websites, then call some function 
        # now there's this notion of callbacks
        for event_key, event_mask in events:
            callback = event_key.data
            callback()

Such a loop is called an "event loop". An async frameworkhas two parts: (a) such an event loop and (b) non-blocking sockets. It all runs on one thread. This is a system, it should be obvious for I/O bound problems.

What have we demonstrated already? We showed how to begin an operation and execute a callback when the operation is ready. An async framework builds on the two features we have shown—non-blocking sockets and the event loop—to run concurrent operations on a single thread.

Guido:
>We have achieved "concurrency" here, but not what is traditionally called "parallelism". What asynchronous I/O is right for, is applications with many slow or sleepy connections with infrequent events.

In [39]:
loop() #loop will destruct after 10 secs

connected!


#### async with response reading

In [6]:
from selectors import DefaultSelector, EVENT_READ, EVENT_WRITE

selector = DefaultSelector()
class Fetcher:
    def __init__(self, url, host):
        self.response = b''  # Empty array of bytes.
        self.host = host
        self.url = url
        self.sock = None
        
    # Method on Fetcher class.
    def fetch(self):
        self.sock = socket.socket()
        self.sock.setblocking(False)
        try:
            self.sock.connect((self.host, 80))
        except BlockingIOError:
            pass

        # Register next callback.
        selector.register(self.sock.fileno(),
                          EVENT_WRITE,
                          self.connected)

    def connected(self, key, mask):
        print('connected!', flush=True)
        selector.unregister(key.fd)
        request = 'GET {} HTTP/1.0\r\nHost: {}\r\n\r\n'.format(self.url, self.host)
        self.sock.send(request.encode('ascii'))

        # Register the next callback.
        selector.register(key.fd,
                          EVENT_READ,
                          self.read_response)
        
    def read_response(self, key, mask):
        global stopped
        
        chunk = self.sock.recv(128)  # USUALLY 4k chunk size, here small
        if chunk:
            print("read chunk", flush=True)
            self.response += chunk
        else:
            print("all read", flush=True)
            selector.unregister(key.fd)  # Done reading.
            stopped=True
            
stopped = False

def loop():
    while not stopped:
        events = selector.select()
        for event_key, event_mask in events:
            callback = event_key.data
            callback(event_key, event_mask)

In [7]:
fetcher = Fetcher('/353/', 'xkcd.com')
fetcher.fetch()
loop()

NameError: name 'socket' is not defined

You can see how the control-flow is chained together by having the connected callback do the resposing. Beyond a 2-3 ladder, this gets confusing and onerous (see some node.js code). As compared to a blocking program, where the continuation of the program is stored and adressed via the instruction pointer in a sequential fashiom, here the cintinuation is stored by registering the callbacks.'

Since the current frame is popped out of the stack, exceptions have a hard time figuring the origin This is called stack-ripping.

So, even apart from the long debate about the relative efficiencies of multithreading and async, there is this other debate regarding which is more error-prone: threads are susceptible to data races if you make a mistake synchronizing them, but callbacks are stubborn to debug due to stack ripping. And within a bit, we get callback soup.

https://thesynchronousblog.wordpress.com/tag/stack-ripping/

Threads seem to offer a more natural way of programming as the programmer with all state in thread’s single stack.


So why not use them. As we said last time: synchronization and overhead. 

But we can do better with Coroutines!

Guido:
>We entice you with a promise. It is possible to write asynchronous code that combines the efficiency of callbacks with the classic good looks of multithreaded programming. This combination is achieved with a pattern called "coroutines". Using Python 3.4's standard asyncio library, and a package called "aiohttp", fetching a URL in a coroutine is very direct7:

    @asyncio.coroutine
    def fetch(self, url):
        response = yield from self.session.get(url)
        body = yield from response.read()
        
In 3.5 its even more clear:

async def fetch(self, url):
        response = await self.session.get(url)
        body = await response.read()

### Back to the Future with co-routines

In [22]:
from selectors import DefaultSelector, EVENT_READ, EVENT_WRITE
import socket
selector = DefaultSelector()

The future, as you might expect is something with callbacks...

In [43]:
class MyFuture:
    def __init__(self):
        self.result = None
        self._callbacks = []

    def add_done_callback(self, fn):
        self._callbacks.append(fn)

    def set_result(self, result):
        self.result = result
        for fn in self._callbacks:
            fn(self)

We need a "main" to yield to.

In [44]:
class Fetcher:
    
    def __init__(self, url, host):
        self.url = url
        self.host = host
        self.response = b''  # Empty array of bytes.

        
    def fetch(self):
        global stopped
        sock = socket.socket()

        sock.setblocking(False)
        try:
            sock.connect((self.host, 80))
        except BlockingIOError:
            pass

        f = MyFuture()

        #resolves the future by setting a result on it
        def on_connected():
            print('on connected cb ran', flush=True)
            f.set_result(None)
        
        
        
        selector.register(sock.fileno(),
                          EVENT_WRITE,
                          on_connected)
        print("about to yield connection future", flush=True)
        yield f#this makes it look like fetch has returned the "future"
        #bit we have not lost the state (or have to have carried it in obj)
        #a send in will continue us here
        print('we were connected! now back in gen', flush=True)
        selector.unregister(sock.fileno())
        request = 'GET {} HTTP/1.0\r\nHost: {}\r\n\r\n'.format(self.url, self.host)
        sock.send(request.encode('ascii'))
        while True:
            print("in loop")
            #now create a new future for the data-recieving call
            f = MyFuture()
            def on_response():
                chunky = sock.recv(4096)  # 4k chunk size.
                f.set_result(chunky)
            selector.register(sock.fileno(),
                              EVENT_READ,
                              on_response)
            #now to restart the gen, we will from the main
            #throw the data right back in
            chunk = yield f
            selector.unregister(sock.fileno())
            if chunk:
                print("len(chunk)",len(chunk))
                self.response += chunk
            else:
                print("all read")
                stopped= True
                break


        
    

In [45]:
#But when the future resolves, what resumes the generator? We need a coroutine driver. Let us call it "task":
#(this is our main)
class Task:
    def __init__(self, coro):
        self.coro = coro
        f = MyFuture()
        print(">>sending none to initial future",f)
        f.set_result(None)
        print("...stepping")
        self.step(f)
        print(">>>after priming")

    def step(self, future):
        try:
            print("sending", type(future.result))
            next_future = self.coro.send(future.result)
            print('got next future', next_future)

        except StopIteration:
            print("si")
            return None
        next_future.add_done_callback(self.step)

In [46]:
stopped=False
def loop():
    while not stopped:
        events = selector.select()
        for event_key, event_mask in events:
            callback = event_key.data
            callback()

In [47]:
fetcher = Fetcher('/353/', 'xkcd.com')
Task(fetcher.fetch())
stopped=False
loop()

>>sending none to initial future <__main__.MyFuture object at 0x107284550>
...stepping
sending <class 'NoneType'>
about to yield connection future
got next future <__main__.MyFuture object at 0x107284e10>
>>>after priming
on connected cb ran
sending <class 'NoneType'>
we were connected! now back in gen
in loop
got next future <__main__.MyFuture object at 0x107284550>
sending <class 'bytes'>
len(chunk) 1396
in loop
got next future <__main__.MyFuture object at 0x107284588>
sending <class 'bytes'>
len(chunk) 1396
in loop
got next future <__main__.MyFuture object at 0x107284d30>
sending <class 'bytes'>
len(chunk) 4096
in loop
got next future <__main__.MyFuture object at 0x107284278>
sending <class 'bytes'>
len(chunk) 626
in loop
got next future <__main__.MyFuture object at 0x107284470>
sending <class 'bytes'>
all read
si


#### Refactoring using generators

In [48]:
#But when the future resolves, what resumes the generator? We need a coroutine driver. Let us call it "task":
#(this is our main)
class Task:
    def __init__(self, coro):
        self.coro = coro
        f = MyFuture()
        print(">>sending none to initial future",f)
        f.set_result(None)
        print("...stepping")
        self.step(f)
        print(">>>after priming")

    def step(self, future):
        try:
            print("sending", type(future.result))
            next_future = self.coro.send(future.result)
            print('got next future', next_future)

        except StopIteration:
            print("si")
            return None
        next_future.add_done_callback(self.step)

In [49]:
def read(sock):
    f = MyFuture()

    def on_readable():
        f.set_result(sock.recv(4096))

    selector.register(sock.fileno(), EVENT_READ, on_readable)
    chunk = yield f  # Read one chunk.
    selector.unregister(sock.fileno())
    return chunk

In [50]:
def read_all(sock):
    global stopped
    response = []
    # Read whole response.
    chunk = yield from read(sock)
    while chunk:
        response.append(chunk)
        chunk = yield from read(sock)
    stopped=True
    return b''.join(response)

>If you squint and make the yield from statements disappear it looks like  conventional functions doing blocking I/O. But in fact, read and read_all are coroutines. Yielding from read pauses read_all until the I/O completes. While read_all is paused, asyncio's event loop does other work and awaits other I/O events; read_all is resumed with the result of read on the next loop tick once its event is ready.

In [51]:
class Fetcher:
    
    def __init__(self, url, host):
        self.url = url
        self.host = host
        self.response = b''  # Empty array of bytes.

        
    def fetch(self):
        global stopped
        sock = socket.socket()

        sock.setblocking(False)
        try:
            sock.connect((host, 80))
        except BlockingIOError:
            pass

        f = MyFuture()

        def on_connected():
            print('on connected cb ran')
            f.set_result(None)
        
        
        
        selector.register(sock.fileno(),
                          EVENT_WRITE,
                          on_connected)
        print("about to yield connection future")
        yield f
        print('connected!')
        selector.unregister(sock.fileno())
        request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(self.url)
        sock.send(request.encode('ascii'))
        yield from read_all(sock)

In [52]:
fetcher = Fetcher('/353/', 'xkcd.com')
Task(fetcher.fetch())
stopped = False
loop()

>>sending none to initial future <__main__.MyFuture object at 0x107293a90>
...stepping
sending <class 'NoneType'>
about to yield connection future
got next future <__main__.MyFuture object at 0x107293ac8>
>>>after priming
on connected cb ran
sending <class 'NoneType'>
connected!
got next future <__main__.MyFuture object at 0x107293f60>
sending <class 'bytes'>
got next future <__main__.MyFuture object at 0x107293630>
sending <class 'bytes'>
si


![](http://aosabook.org/en/500L/crawler-images/yield-from.png)

There is one yield left amongst the yield froms. For consistency, this can be fixed...it also lets us change implementations under the hood..

In [53]:
def read(sock):
    f = MyFuture()

    def on_readable():
        f.set_result(sock.recv(4096))

    selector.register(sock.fileno(), EVENT_READ, on_readable)
    chunk = yield from f  # Read one chunk.
    selector.unregister(sock.fileno())
    return chunk

In [54]:
class MyFuture:
    def __init__(self):
        self.result = None
        self._callbacks = []

    def add_done_callback(self, fn):
        self._callbacks.append(fn)

    def set_result(self, result):
        self.result = result
        print("cblist", self._callbacks)
        for fn in self._callbacks:
            fn(self)
            
    def __iter__(self):
        yield self
        return self.result

In [55]:
class Fetcher:
    
    def __init__(self, url, host):
        self.url = url
        self.host = host
        self.response = b''  # Empty array of bytes.

        
    def fetch(self):
        global stopped
        sock = socket.socket()

        sock.setblocking(False)
        try:
            sock.connect((self.host, 80))
        except BlockingIOError:
            pass

        f = MyFuture()

        def on_connected():
            print('on connected cb ran')
            f.set_result(None)
        
        
        
        selector.register(sock.fileno(),
                          EVENT_WRITE,
                          on_connected)
        print("about to yield connection future")
        yield from f
        print('connected!')
        selector.unregister(sock.fileno())
        request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(self.url)
        sock.send(request.encode('ascii'))
        yield from read_all(sock)

In [56]:
fetcher = Fetcher('/353/','xkcd.com')
Task(fetcher.fetch())
stopped = False
loop()

>>sending none to initial future <__main__.MyFuture object at 0x1072b07b8>
cblist []
...stepping
sending <class 'NoneType'>
about to yield connection future
got next future <__main__.MyFuture object at 0x1072b0ef0>
>>>after priming
on connected cb ran
cblist [<bound method Task.step of <__main__.Task object at 0x1072a9be0>>]
sending <class 'NoneType'>
connected!
got next future <__main__.MyFuture object at 0x1072b0dd8>
cblist [<bound method Task.step of <__main__.Task object at 0x1072a9be0>>]
sending <class 'bytes'>
got next future <__main__.MyFuture object at 0x1072b0d68>
cblist [<bound method Task.step of <__main__.Task object at 0x1072a9be0>>]
sending <class 'bytes'>
got next future <__main__.MyFuture object at 0x1072b08d0>
cblist [<bound method Task.step of <__main__.Task object at 0x1072a9be0>>]
sending <class 'bytes'>
got next future <__main__.MyFuture object at 0x1072b0dd8>
cblist [<bound method Task.step of <__main__.Task object at 0x1072a9be0>>]
sending <class 'bytes'>
got nex

## Lab

Implement a URL fetcher using Beautiful Soup in the callback version. We will implement a similar one using coroutines on wednesday. 

The implimentation will extend the read_response method by parsing for URL's using `bs4` . Start by creating globals:
```
urls_todo = set(['/'])
seen_urls = set(['/'])
```

then:

```
links = self.parse_links()#write this
```
(using self.response)

Then use the set `difference` method  to add new links to `urls_todo` and recursively set up a `Fetcher` instance.

Now update the `seen_urls` and `urls_todo` thus:
```
seen_urls.update(links)
urls_todo.remove(self.url)
if not urls_todo:
    stopped = True
```

In [29]:
from selectors import DefaultSelector, EVENT_READ, EVENT_WRITE
from bs4 import BeautifulSoup
import socket

selector = DefaultSelector()
class Fetcher:
    def __init__(self,  url, host):
        self.response = b''  # Empty array of bytes.
        self.host = host
        self.url = url
        self.sock = None
        
    # Method on Fetcher class.
    def fetch(self):
        self.sock = socket.socket()
        self.sock.setblocking(False)
        try:
            self.sock.connect((self.host, 80))
        except BlockingIOError:
            pass

        # Register next callback.
        selector.register(self.sock.fileno(),
                          EVENT_WRITE,
                          self.connected)

    def connected(self, key, mask):
        print('connected!', flush=True)
        selector.unregister(key.fd)
        request = 'GET {} HTTP/1.0\r\nHost: {}\r\n\r\n'.format(self.url, self.host)
        self.sock.send(request.encode('ascii'))

        # Register the next callback.
        selector.register(key.fd,
                          EVENT_READ,
                          self.read_response)
        
    def read_response(self, key, mask):
        global stopped
        global counter
        
        chunk = self.sock.recv(128)  # USUALLY 4k chunk size, here small
        if chunk:
            print(".", end="", flush=True)
            self.response += chunk
        else:
            print("all read", flush=True)
            selector.unregister(key.fd)  # Done reading. Still want to keep this
            # remove completed url
            urls_todo.remove(self.url)
            
            # create things update globals
            links = self.parse_links()
            
            # filter already seend
            # Then use the set difference method to add new links to urls_todo and recursively set up a Fetcher instance.
            # loop through the links in to do
            for link in links.difference(seen_urls):
                counter += 1
                if counter < maxit: 
                    urls_todo.add(link)
                    # create new Fetcher and call fetch
                    Fetcher(link, self.host).fetch()  # <- New Fetcher.
                else: 
                    break

            seen_urls.update(links)
            # do not stopped to True until urls_to_do is empty
            if not urls_todo:
                stopped = True                                
                
    def parse_links(self):
        # empty list of URLS
        # look through self.response to get a list of URLs, add to the list
        # return the list
        links = []
        soup = BeautifulSoup(self.response,  "lxml")
        links = [link['href'] for link in soup.find_all('a', href=True)]
        return set(links)
            
stopped = False

def loop():
    while not stopped:
        events = selector.select()
        for event_key, event_mask in events:
            callback = event_key.data
            callback(event_key, event_mask)

In [30]:
# url added here when Fetcher created.  Better name would be urls_fetching
# set up globals
urls_todo = set(['/'])
# superset of urls_todo
seen_urls = set(['/'])
maxit = 10  
counter = 0

fetcher = Fetcher('/', 'xkcd.com')
fetcher.fetch()
loop()

connected!
......................................................all read
connected!
connected!
connected!
connected!
connected!
connected!
connected!
connected!
connected!
.................................all read
..............................all read
....................all read
....................................................................................................................................................................................all read
......................................................................all read
....all read
all read
...all read
all read


In [None]:
# Not that long, basically start at some page 
# extract links from page 
# use this callback system to fetch those links 
# parse response and create new links 
# let the whole thing complete

# at the end of it, will have a depth1 parser, don't want to keep going down and won
# might want to add level argument to this class to tell you what level ought to be 
# say that level = 0, as a default, in the Fetcher, if level = 0, then let them go into it 
# class Fetcher:
#    def __init__(self, host, url, level = 0):

# don't worry about getting all the urls, just get the ones from xkcd, within one domain

# There are 3 versions of the Fetcher class in the lecture notebook.  One of them (the first one) registers the 
# callbacks inside it, so the instructions mean rewrite the first Fetcher example (the callback one) using Beautiful 
# Soup to implement a URL fetcher.

# See "Programming with Callbacks" section: 
# http://aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html

In [None]:
# call seen_urls.difference(links) to see what links I haven't seen yet 
# add them to urls_to_do
# TODO recursively set up a fetcher instance??? 
# want some max # of active Fetchers at any given time, then add more Fetchers if Urls_to_do
# naive thing to do is to ...?

# urls_fetching or urls_acting
# every time add to urls_to_do, you should also construct a Fetcher and call its Fetch method, only do that if it's not 
# already in urls_to_do

# seen_urls is superset of urls to do, 
# always add all the links to seen_urls. 