# Iterators, iterables, generators

Python has three concepts which are often confused: **iterator, iterable**, and **generator**.  

They're actually fairly straightforward.  Confusion arises because examples of each can be 
be created in a variety of ways, and because their relationahip is not clearly spelled out
in the python type system.

## Iterators

Let's start with the most basic concept, an iterator.  An iterator is an object that produces 
the next member of a sequence 
of values one by one by one.  It does this by
supporting a `__next__` method that produces the next member of the sequence.  This
means it has to remember where it currently is in the sequence, which is called remembering **state**.
We illustrate this with a string, which is not an iterator, but can be used to create an iterator with
its `__iter__` method. 

In [113]:
from string import ascii_lowercase
from collections.abc import Iterator

it1 = ascii_lowercase.__iter__()
print(it1.__next__())
print(it1.__next__())
print(it1.__next__())
print(next(it1))       # Can equivalently use the Python builtin function `next`
print(it1.__next__())  # Using `next()` has same effect on state as using .__next__()

a
b
c
d
e


Want to start reciting the alphabet again?  Create another iterator.

In [34]:
it2 = ascii_lowercase.__iter__()
print(it2.__next__())
print(it2.__next__())
print(it2.__next__())

a
b
c


What happens when we reach the end of the sequence?

In [39]:
it3 = ascii_lowercase.__iter__()
for _i in range(26):
    print(it3.__next__(), end = ' ')
print()
it3.__next__()



a b c d e f g h i j k l m n o p q r s t u v w x y z 


StopIteration: 

An empty iterator raises the `StopIteration` exception when `next` is called.
Another way to exhaustively enumerate the sequence is to call `list` on it;
this creates a list of all the elements in the sequence.  Again the iterator, having been enumerated,
raises the `StopIteration` exception if called upon to produce a
next element.

In [133]:
it4 = ascii_lowercase.__iter__()
print(list(it4))
print()
print('Calling next')
next(it4)

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

Calling next


StopIteration: 

There is no reason why an iterator has to produce a finite sequence.  This is very convenient when an unknown number of values may be needed, all appropriately sequenced.

In [67]:
class Counter:
    
    def __init__(self, val = -1):
        self.val = val

    def __next__ (self):
        self.val += 1
        return self.val

    def __iter__ (self):
        return self

In [134]:
ctr = Counter()
print(next(ctr))
print(next(ctr))
print(ctr.val)  # Look up val without incrementing
print(next(ctr))


0
1
1
2


Note that the `Counter` instance formally qualifies as an iterator. An instance qualifies as an iterator if implements the right methods, namely `__next__`, and also `__iter__` (which will be discussed below).   This is the so-called **iterator protocol**.

In [None]:
from collections.abc import Iterator
isinstance(ctr,Iterator)

## Iterables

The behavior we illustrated with the string `ascii_string` could have been illustrated with any container: sets, lists, tuples, dictionaries, sets.  All support an `__iter__` method capable of stepping through the elements of the container.  In the case of sequences like lists, strings, and tuples, the elements will be produced in sequence order.  In the case of sets and dictionaries, the order is arbitrary.

In fact, the way Python `for`-loops are implemented is with iterators.  Let C be the object we want to loop through. 

```
for x in C:
   code line 1
   code line 2
   ...
```

Under the hood we call the `__iter__` method of `C` to produce an iterator.  That then is used to reset the  value of the loop variable `x` each time we want to execute the codeblock in the loop.  This means that any object we want to be able to loop through has to support an `__iter__` method.  

The concept is so important we have a name for it. Objects that support an `__iter__` method are called **Iterables**.  Notice that iterables are not the same thing as iterators.  We illustrated that above
in the `ascii_lowercase` example.  There was the iterable string `ascii_lowercase` and there were distinct iterators produced from it, `it1`, `it2`, and `it3`.  Each iterators independently kept track of its own state (`it1` still knows that the next letter it needs to produce is `'f'`).  Once iterated through an iterator is empty and can not be re-used. The iterable `ascii_lowercase` has no state.  It is the alphabet.  We can loop through it as many times as needed.

A somewhat subtle but important point is that iterators can be looped through too.

The idea is that when we are at a certain position in a sequence, we want to compute something that involves looping through the rest of the sequence.

For example, we are in the state represented by `it1`, where the next letter is `'f'`, 
and we want to loop through `'f'` through `'z'`.  We do:

In [114]:
for let in it1:
    print(let, end = ' ')
print(next(it1))

f g h i j k l m n o p q r s t u v w x y z 

StopIteration: 

In order to loop through the iterator, an iterator had to be generated to provide values for the loop variable.  Isn't that too many iterators?  Yes, so when the `__iter__`-method of `it1` is called it returns `it1`, and the loop simply steps through all the next states of `it1`.

As the final line shows, after being looped through, the iterator is then empty and raises
a `StopIteration` error if asked for a next element.

We can summarize the discussion thus far as follows:

1. An **Iterator** is a state-preserving object that produces the next member of a sequence of values on demand.  In particular,  an iterator produces the next member in the sequence when its `__next__` method is called,  If the last member of the sequence has already been produced, an iterator raises a `StopIteration` exception when `__next__` is called.  
2.  An **iterable** is an object that supports an `__iter__` method, which is the same as saying it can be looped through in a `for`-loop.  All containers are iterables.  For containers, each time the `__iter__` method is called, a fresh iterable is created.
3. Because iterators support an `__iter__` method, they are iterables too. They may themselves be looped through.  In contrast to containers, the `__iter__` method of an iterator just returns the iterator, and therefore an iterator can be looped through only once.

We conclude this preliminary discussion by noting that there are other kinds of iterables beside containers.
This was shown in our original picture of the Python type hierarchy, displayed again below.

![python_type_tree.png](attachment:python_type_tree.png)

The picture shows probably the most important case of non-container iterables, input-output streams such as file streams, but it leaves many other instances out. 

Among the standard Python iterables not shown are iterators such as the container
-generated iterators discussed above, and `zip` and `enumerate`
instances.  We discuss these below.

## Relationship of iterables to their iterators

In general, iterators will lack or significantly alter many of properties of the iterables they are derived from.  For example all containers are iterables that have a length and support the `in` test.  Their iterators either do not or exhibit somewhat surprising state-changing behavior.

In [149]:
print(len(it1))

TypeError: object of type 'str_iterator' has no len()

In [79]:
it6 = ascii_lowercase.__iter__()
for let in 'abcdef':
    it6.__next__()
print('x' in it6)
next(it6)

True


'y'

What is going on in the `in` case  is that, under the hood, the test requires an iteration through the
iterator which is interrupted when the requested element is found.  Hence the next call to `next` produces `'y'`. This also means that when an `in` test is `False`, the iterator is exhausted.

In [100]:
print('aa' in it6)
next(it6)

False


StopIteration: 

The general moral here is that when using iterators explicitly in code (as opposed to the implicit iterators of a `for`-loop) one must take into account any operations that will change the state of the iterator.

#### Special facts about dictionary iterators

Some iterables logically connect to more than one kind of iterator. In the case of dictionaries, for example, one may want to loop through the keys, the values, or paired keys and values. Of course, only of these ways of
looping can be the definition of a dictionary's `__iter__` method.  Python chooses to let looping through
keys be the definition of a dictionary's `__iter__` method, consistent with how `in` is defined for dictionaries, and to define separate container types for dictionary keys, dictionary values, and key-value pairs.  

In [110]:
from collections.abc import Iterator
dd = dict(zip(ascii_lowercase, range(26)))
for l in dd:
    print(l, l in dd, end = ' ')
print()
print()
dd_k,dd_v,dd_i = dd.keys(),dd.values(),dd.items()
print(type(dd_k),type(dd_v), type(dd_i))
print(isinstance(dd_k,Iterator),isinstance(dd_v,Iterator),isinstance(dd_i,Iterator))
print()
print(len(dd_k),len(dd_v), len(dd_i))
print('a' in dd_k, 0 in dd_v, ('a',0) in dd_i)

a True b True c True d True e True f True g True h True i True j True k True l True m True n True o True p True q True r True s True t True u True v True w True x True y True z True 

<class 'dict_keys'> <class 'dict_values'> <class 'dict_items'>
False False False

26 26 26
True True True


Each of these types can be looped through, meaning each supports its own `__iter__` method.

In [90]:
# Loop through the dict_keys instances once
for l in dd_k:
    print(l, end = ' ')
print()
# Loop through dd_k a second time
for l in dd_k:
    print(l, end = ' ')
print()
# loop through the dict_values instance
for v in dd_v:
    print(v, end = ' ')
print()
# loop through the dict_items instance
for (k,v) in dd_i:
    print(k, v, end = '  ')
print()

a b c d e f g h i j k l m n o p q r s t u v w x y z 
a b c d e f g h i j k l m n o p q r s t u v w x y z 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
a 0  b 1  c 2  d 3  e 4  f 5  g 6  h 7  i 8  j 9  k 10  l 11  m 12  n 13  o 14  p 15  q 16  r 17  s 18  t 19  u 20  v 21  w 22  x 23  y 24  z 25  


Importantly, these dictionary-derived container types are not themselves iterators.  They support no notion of state.  Thus as illustrated above with the `dict_keys` object, the same same `dict_keys` instance can be looped
through multiple times.  

Like the basic Python container types they support looping by having an `__iter__` method that creates a fresh
iterator each time they are looped through.

An important motivation for having these special dictionary-derived containers is that they are **views** of the original dictionary.  That is, they represent the same data, and when the dictionary changes, they change.

In [97]:
dd['aa'] = 100
print(dd_k)
print(dd_v)
print(dd_i)

dict_keys(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'aa'])
dict_values([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 100])
dict_items([('a', 0), ('b', 1), ('c', 2), ('d', 3), ('e', 4), ('f', 5), ('g', 6), ('h', 7), ('i', 8), ('j', 9), ('k', 10), ('l', 11), ('m', 12), ('n', 13), ('o', 14), ('p', 15), ('q', 16), ('r', 17), ('s', 18), ('t', 19), ('u', 20), ('v', 21), ('w', 22), ('x', 23), ('y', 24), ('z', 25), ('aa', 100)])


Finally, note that the iterator of a dictionary is no longer usable if the dictionary changes size after the iterator was created.  This is essentially because a dictionary is a set of key value pairs, and if the underlying set has been changed, all bets are off on what `next` should mean.

In [108]:
it200 = dd.__iter__()
dd['bb'] = 200
next(it200)

RuntimeError: dictionary changed size during iteration

This is the same error will be raised if a dictionary's size changes while looping through it.

In [109]:
for k in dd:
    dd[k+k] = 100 * dd[k]

RuntimeError: dictionary changed size during iteration

Because the dictionary derived containers are views of the same data, they will raise the same
error if the original dictionary size changes during looping:

In [111]:
# Looping the keys object
for k in dd_k:
    dd[k+k] = 100 * dd[k]

RuntimeError: dictionary changed size during iteration

The moral here: don't do this.  

Size-changing updates of a dictionary can be done other ways, for example through the `update` method, or by a loop that doesn't loop through the dictionary or a derived container, but some copy.

## Generators

We are now ready to discuss **generators**.  

Generators are a special case of iterators distinguished by how they are created.  There are two
principle ways of creating a generator:

1.  A generator can be created by calling a function which contains a `yield` statement.
2.  A generator can be created by a **generator expression**, which is a generalization of a list comprehension.

For simplicity, we illustrate with the `ascii_lowercase` example.  We describe a more compelling use case below.

In [155]:
from string import ascii_lowercase
from collections.abc import Iterator

def get_elem_generator (seq):
    for x in seq:
        yield x

def get_elem_function (seq):
    for x in seq:
        return x
 

let_generator = get_elem_generator(ascii_lowercase)
print(let_generator, type(let_generator))
print()
X = get_elem_function(ascii_lowercase)
print(X, type(X))
print()
print(f'Is a generator an iterator? {isinstance(let_generator, Iterator)}')
print(f'next(let_generator): {next(let_generator)}')
print(f'next(let_generator): {next(let_generator)}')
print(f'next(let_generator): {next(let_generator)}')
print()
print(f"Make a string from the generator: {''.join(list(let_generator))}")
print()
try:
  print(next(let_generator))
except StopIteration as e:
  print(f'next(let_generator): ***Error*** (StopIteration)')

<generator object get_elem_generator at 0x7fbd998bd040> <class 'generator'>

a <class 'str'>

Is a generator an iterator? True
next(let_generator): a
next(let_generator): b
next(let_generator): c

Make a string from the generator: defghijklmnopqrstuvwxyz

next(let_generator): ***Error*** (StopIteration)


We see that that the value obtained by calling the function defined with `return` is a single character,
while the value obtained  by calling the function defined with `yield` produces a `generator` instance,
`let_generator`.  We can then call the `next` function on `let_generator` to produce successive letters of the alphabet.  

The problem with the example above is that much of the appeal of using a generator to produce the elements of
a sequence is lost when the entire sequence is already in memory, as is the case with the string `ascii_lowercase`

A better use-case for a generator is looping through the contents of a **very** large file, say a Wikipedia dump,
or a Google ngram dump.  In this caae we might very well want to write a function like the following:

```
def read_very_large_file (file_path):
    with open(file_path, 'r') as fh:
        for line in fh:
            yield line

reader = read_very_large_file('googleplex.txt')
```

`reader` is now a generator that generates the file line by line.
Here are the first 1000 lines (a mere scratch on the surface of this
monstrous stream):

```
first_1000_lines = [next(reader) for _ in range(1000)]
```

Do some computing, have lunch, return, and do:

```
second_1000_lines = [next(reader) for _ in range(1000)]
```

and that is the second 1000 lines.  Want to sample from the middle?

```
elsewhere_1000_lines = [line for i in range(100001) if (line:=next(reader)) and i > 99000]
```

Note that executing the above code positions the generator at line 100,001, and from now sampling with this generator starts from there.

I tested the above code by executing the following cell, which you can test by finding a suitably large local file to use.

In [156]:

def read_very_large_file (file_path):
    with open(file_path, 'r') as fh:
        for line in fh:
            yield line

# Call the generator function on a file.
file_path = '/Users/gawron/ext/corpus/bnc/written.txt'
reader = read_very_large_file(file_path)
print(reader)

## Read and check first 2000 lines 
first_1000_lines = [next(reader) for _ in range(1000)]
second_1000_lines = [next(reader) for _ in range(1000)]
print(len(first_1000_lines),first_1000_lines[100])
print()
print(len(second_1000_lines),second_1000_lines[100])

## Read the chunk from the middle and check
elsewhere_1000_lines = [line for i in range(100001) if (line:=next(reader)) and i > 99000]
print(len(elsewhere_1000_lines),elsewhere_1000_lines[100])

<generator object read_very_large_file at 0x7fbd998bd900>
1000 Race and Racism


1000 Over the last year we have moved therefore from care support to urgent training of Romanian people who can then become effective educators .

1000 The powerful balance of these figure compositions is highlighted when they are transposed into tubes and sheets of metal .



Everything done above can also be done with generator expressions, which are written like list-comprehensions, but enclosed in parentheses instead of square  brackets.

The code is somewhat shorter and has much to recommend it.

In [136]:
# Replaces defining [and executing] `read_very_large_file`
reader = (line for line in open(file_path,'r'))
reader

<generator object <genexpr> at 0x7fbd99989c80>

Now, as before, we can use the generator we just created for whatever number of steps meets our needs,
in a list comprehension.

In [137]:
first_1000_lines = [next(reader) for _ in range(1000)]
second_1000_lines = [next(reader) for _ in range(1000)]
## Check out results thus far
print(len(first_1000_lines),first_1000_lines[100])
print()
print(len(second_1000_lines),second_1000_lines[100])

## Do the chunk from the middle and check
elsewhere_1000_lines = [line for i in range(100001) if (line:=next(reader)) and i > 99000]
print(len(elsewhere_1000_lines),elsewhere_1000_lines[100])

1000 Race and Racism


1000 Over the last year we have moved therefore from care support to urgent training of Romanian people who can then become effective educators .

1000 The powerful balance of these figure compositions is highlighted when they are transposed into tubes and sheets of metal .



The choice between writing a generator function and a generator expression is a little like the choice between a `for`-loop and a list comprehension.  A generator function definition is more flexible.  

But the generator expression is often far more convenient.

## Zip, enumerate, and range

We turn to some important examples of builtin Python iterators and iterables not discussed above --- and one important non-example --- namely
`zip` instances, `enumerate` instances, and `range` instances.  The first two are iterators; the third, while similar in many respects to the first two, is not an iterator.   This discussion assumes you 
you basically know about the operation of `zip`, `enumerate`, and `range`.

`zip`, `enumerate`, and `range` instances are created by calling the respective Python functions:

In [5]:
R,Z,E = range(4,20,2), zip('abcde',range(5)), enumerate('abcde')
print(R,Z,E)
print()
print('Enumerate contents')
print('R: ', list(R))
print('Z: ', list(Z))
print('E: ', list(E))

range(4, 20, 2) <zip object at 0x7ff46882a388> <enumerate object at 0x7ff467fd6c60>

Enumerate contents
R:  [4, 6, 8, 10, 12, 14, 16, 18]
Z:  [('a', 0), ('b', 1), ('c', 2), ('d', 3), ('e', 4)]
E:  [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e')]


In each case we have sequential information which can be enumerated on demand and is not computed until a demand is made.  We created such a demand above by calling `list`.

We illustrate the basic memory saving with a `range` instance.
Like `zip` and `enumerate` objects, a `range` object does not immediately expand the entire sequence it represents into memory.  A range with 1 million elements has the same size as a range 100,000 elements.

In [141]:
import sys
N = 100000
R = range(N)
L = list(R)

print(f'{N:9,} integers List size: {sys.getsizeof(L):9,} Range size: {sys.getsizeof(R)}')

N = 1000000
R = range(N)
L = list(R)

print(f'{N:9,} integers List size: {sys.getsizeof(L):9,} Range size: {sys.getsizeof(R)}')
print()
R,Z,E = range(4,20,2), zip('abcde',range(5)), enumerate('abcde')



  100,000 integers List size:   800,056 Range size: 48
1,000,000 integers List size: 8,000,056 Range size: 48



However, the `range` object is not quite like the other two.  For one thing, it supports indexing
and `len`.  For many purposes, it behaves exactly like a list while being
more memory efficient. 

We illustrate:

In [2]:
from collections.abc import Iterator
R,Z,E = range(4,20,2), zip('abcde',range(5)), enumerate('abcde')
print('Range, Zip, Enumerator Iterators?')
print([isinstance(x, Iterator) for x in (R,Z,E)])
print()
print('Demoing len, integer-index access for range')
print(len(R), R[-3])
print()
print('Demoing len zip obj')
try:
    len(Z)
except TypeError as e:
    print('**Type Error**: ', e)
print('Demoing index zip obj')
try:
    Z[0]
except TypeError as e:
    print('**Type Error**: ', e)
print()
print('Demoing len enumerate obj')
try:
    len(E)
except TypeError as e:
    print('**Type Error**: ', e)
print('Demoing index enumerate obj')
try:
    E[0]
except TypeError as e:
    print('**Type Error**: ', e)

Range, Zip, Enumerator Iterators?
[False, True, True]

Demoing len, integer-index access for range
8 14

Demoing len zip obj
**Type Error**:  object of type 'zip' has no len()
Demoing index zip obj
**Type Error**:  'zip' object is not subscriptable

Demoing len enumerate obj
**Type Error**:  object of type 'enumerate' has no len()
Demoing index enumerate obj
**Type Error**:  'enumerate' object is not subscriptable


Unlike the other iterators, the range instance can also be looped through more than once.

In [143]:
print('Iterate on Zip obj try 1')
for (let,num) in Z:
    print(let,num)
print('Iterate on Zip obj try 2')  
for (let,num) in Z:
    print(let,num)
print('Iterate on enumerate obj try 1')    
for (i,let) in E:
    print(i,let)
print('Iterate on enumerate obj try 2')    
for (i,let) in E:
    print(i,let)
print('Iterate on range obj try 1')    
for i in R:
    print(i)
print('Iterate on range obj try 2')   
for i in R:
    print(i)

Iterate on Zip obj try 1
a 0
b 1
c 2
d 3
e 4
Iterate on Zip obj try 2
Iterate on enumerate obj try 1
0 a
1 b
2 c
3 d
4 e
Iterate on enumerate obj try 2
Iterate on range obj try 1
4
6
8
10
12
14
16
18
Iterate on range obj try 2
4
6
8
10
12
14
16
18


Consistent with the idea of allowing multiple iterations and being an indexable sequence, there is no notion of state:

In [3]:
for i in R[:3]:
    print(i)

4
6
8


We have looped through part of the range.  If we start a second loop
on the same range, it starts from the beginning.

In [4]:
for i in R:
    print(i)

4
6
8
10
12
14
16
18


Hence `next` does not work

In [12]:
next(R)

TypeError: 'range' object is not an iterator

This error message says it all. In contrast to `Z` and `E`, there is no `__next__` method for `R`. Hence it is not an interator.

In [19]:
'__next__'  in dir(R)

False

In [18]:
'__next__'  in dir(Z)

True