### Credits:

<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is created by Zhuo Chen based on the notebooks created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />

Reused and modified for internal use at Università Cattolica del Sacro Cuore di Milano, by Deborah Grbac, email deborah.grbac@unicatt.it and Valentina Schiariti, email valentina.schiariti-collaboratore@unicatt.it, released under CC BY License.

This repository is founded on **Constellate notebooks**. The original Jupyter notebooks repository was designed by the educators at **ITHAKA's Constellate project**. The project was sunset on July 1, 2025. This current repository uses and resuses Constellate notebooks as Open Educational Resources (OER), free for re-use under a Creative Commons CC BY License.
___


# Python Intermediate 5

**Description:** This notebook describes:
* What is a generator
* How to write a generator comprehension
* The advantages of using a generator
 

This is part 5 of 5 in the series *Python Intermediate* that will prepare you to do text analysis using the Python programming language. 

**Note**: Running this notebook locally will give you full control to test, modify, and save your work. We strongly recommend downloading it before you begin.
___

# What is a generator?

## A quick review of iterables in Python
Any Python object that allows its members to be iterated over in a for-loop is an **iterable**. Strings, lists, sets and dictionaries are all iterables. 

In [2]:
# Use a for loop to iterate over a list
ls = [1, 2, 3]
for num in ls:
    print(num)

1
2
3


In [3]:
# Use a for loop to iterate over a string
s = 'abc'
for l in s:
    print(l)

a
b
c


## Iterator
An **iterator** is an object that can be created from an iterable that allows access to the elements of the iterable **one at a time**, remembering its position as you go.

Python has a built-in function `iter()` which takes an **iterable** and returns an **iterator**. The iterator can be used to iterate over the input iterable.

In [7]:
# Use the built-in iter function to create an iterator out of the list stored in ls
my_ls = iter(ls)
type(my_ls)

list_iterator

To access the values in the original list from this iterator, we need to use the `next()` function to get one value at a time.

In [8]:
# Use next() to get the first element from the list
next(my_ls)

1

In [9]:
# Use next() to get the second element from the list
next(my_ls)

2

# Generator
We can create an iterator by defining a **generator** in the following way:

In [13]:
# Define a very simple generator
def simple_gen():
    yield 1
    yield 2
    yield 3

You can use a for loop to iterate through the items in an iterator created by a generator. In this sense, an iterator is also an iterable in Python. 

In [14]:
# Use a for loop to iterate through the items 
# and print them out
for i in simple_gen():
    print(i)

1
2
3


You can use the `next()` function to see that this simple generator actually yields one item at a time. 

In [15]:
# Assignt the iterator to a variable
gen = simple_gen()

In [16]:
# yield the first item
next(gen)

1

In [17]:
# yield the second item
next(gen)

2

In [18]:
# yield the third item
next(gen)

3

## Lists and generators

On the surface, generators look like ordinary functions, but they are actually very different. Let's use a simple example to understand the difference. 

In [19]:
# Create a Python function which takes a list of numbers

def two_times(ls):
    """takes in a list of numbers and return a list of numbers, each of which
    is two times of the numbers in the input list"""
    new_ls = []
    for n in ls:
        new_ls.append(2*n) #returns a list of numbers, each of which is two times the number of the original list
    return new_ls 

two_times([1, 2, 3])

[2, 4, 6]

If we feed a list of numbers to this function, we get a new list back. Most importantly, the entire new list of numbers is stored in the memory.


We can also create a Python generator to give us the same sequence of values. Note that a generator uses the `yield` statement. 

In [20]:
# Create a Python generator

def gen(ls): 
    """takes in a list of numbers and create a generator which produces a list of numbers, 
    each of which is two times of the numbers in the input list"""
    for n in ls:
        yield 2*n 
        
my_gen = gen([1, 2, 3]) 

Since a generator creates an iterator, the values will be yielded one at a time. 

In [21]:
# Use next () to yield one element from the iterable at a time
next(my_gen) 

2

In [22]:
# Use next () to yield one element from the iterable at a time
next(my_gen)

4

In [23]:
# Use next () to yield one element from the iterable at a time
next(my_gen)

6

The generator is exhausted when all the items have been used. If we use `next()` function again, Python returns a `StopIteration` error.

In [24]:
# Use next () to yield one element from the iterable at a time
next(my_gen)

StopIteration: 

### Take aways
When we used a for loop to create a new list, the new list **stores all of its members**. We can access any of its members via indexing. 

A generator, however, does not store any items. What it stores are the instructions for how to generate each of its members as well as the iteration state (noticable when we use the `next( )` function). For example, if a generator has generated its first member, it knows that it should generate its second member the next time. 


## The built-in generators
Python has some built-in generators. You may not be aware of it, but you have actually used a built-in generator in the Python Basics series. It is the `enumerate()` function. 

In [26]:
# An example from Python basics 3 
# which uses enumerate()
staff = ['Tara Richards',
 'John Smith',
 'Justin Douglas',
 'Lauren Marquez',
 'John Smith']

# Use the enumerate function
for index, name in enumerate(staff):
    if name == 'John Smith': #this is a tuple
        print(index)

1
4


In [27]:
# Confirm that enumerate() is a generator
staff_gen = enumerate(staff)

In [28]:
# yield the first item
next(staff_gen) 

(0, 'Tara Richards')

In [29]:
# yield the second item
next(staff_gen) 

(1, 'John Smith')

# Generator comprehension

Python provides a shorter way to define a generator function, that is, generator comprehensions.
Generator comprehensions basically have the same syntax as list comprehensions, except that they use parentheses `()` instead of hard brackets `[]`.

Let's first quickly review how to write a list comprehension.

In [30]:
# Create a list comprehension using hard brackets []
numbers = [5,6,7,8,9]
new_list = [num for num in numbers if num > 5]
print(new_list)

[6, 7, 8, 9]


Then, let's create a generator which will generate the same sequence of values as the new list above, but only one at a time.  

In [31]:
# Create a generator using parentheses
new_gen = (num for num in numbers if num > 5)

In [32]:
# Yield the values one at a time
next(new_gen)

6

In [33]:
next(new_gen)

7

In [34]:
next(new_gen)

8

In [35]:
next(new_gen)

9

Recall that list comprehension can create a list based on any kind of iterables in Python. This is true for generator comprehension as well. In the previous example, we created a generator based on a list. In the code cell, let's create a generator based on a dictionary using generator comprehension.

In [37]:
# Create a generator based on a dictionary using 
# generator comprehension
contacts ={
 'Amanda Bennett': 'Engineer, electrical',
 'Bryan Miller': 'Radiation protection practitioner',
 'Christopher Garrison': 'Planning and development surveyor',
 'Debra Allen': 'Intelligence analyst',
 'Donna Decker': 'Architect',
 'Heather Bullock': 'Media planner',
 'Jason Brown': 'Energy manager',
 'Jason Soto': 'Lighting technician, broadcasting/film/video',
 'Marissa Munoz': 'Further education lecturer',
 'Matthew Mccall': 'Chief Technology Officer',
 'Michael Norman': 'Translator',
 'Nicole Leblanc': 'Financial controller',
 'Noah Delgado': 'Engineer, land',
 'Rachel Charles': 'Physicist, medical',
 'Stephanie Petty': 'Architect'}

contact_gen = (name for name, occupation in contacts.items() if 'Engineer' in occupation)

In [38]:
# Yield the first item
next(contact_gen)

'Amanda Bennett'

In [39]:
# Yield the second item
next(contact_gen)

'Noah Delgado'

# The advantages of generators

Generators do not hold the entire result in the memory. It yields one item at a time. Because a generator only has to yield one item at a time, it can lead to significant savings in memory usage. 

In [40]:
# Demonstrate the memory size difference of 
# a list comprehension vs generator comprehension

# Import getsizeof which measures memory usage in bytes
from sys import getsizeof
  
list_comprehension = [i for i in range(10000)]
generator_comprehension = (i for i in range(10000))
  
# Print the size of the list comprehension
print('List comprehension memory usage: ', getsizeof(list_comprehension))

# Print the size of the generator comprehension
print('Generator comprehension memory usage: ', getsizeof(generator_comprehension))

List comprehension memory usage:  85176
Generator comprehension memory usage:  192


Since a generator occupies less memory, using a generator instead of a normal iterable like a list can lead to a performace boost. This advantage in performance is especially helpful when you have a really big dataset with hundreds of thousands of items or even millions of items to loop through. 

In [41]:
# import the time module to calculate the processing time
import time

In [42]:
# Calculate the processing time in milliseconds when we create a list with 1m items
def ml(n):
    ls = []
    for i in range(n):
        ls.append(i)
    return ls

start = time.process_time()*1000
ml(1000000)
end = time.process_time()*1000
print(end - start)

62.5


In [43]:
# Calculate the processing time in milliseconds when we create a generator with 1m items
def ml_gen(n):
    for i in range(n):
        yield i
        
start = time.process_time()*1000
ml_gen(1000000)
end = time.process_time()*1000
print(end - start)

0.0


Using a generator makes sense in scenarios where loading an entire list, dictionary, or set could fill all available memory. This could be because each item is large, the list is large, or both. 

If you want to take one item at a time, do a lot of calculations based on that item, and then move on to the next item, then use a generator. 

## Using a generator on files
We can look at the following variations of code in which we create generators in order to read through a sample.txt file that we have created in a subfolder called data, using the following poem as its content: 

The sun dips low, the sky turns gold, <br />
Whispers of evening softly unfold. <br />
Shadows stretch, the world grows still, <br />
Night descends over the quiet hill. <br />

In [65]:
from pathlib import Path #import files
def my_file_gen(path): #define a generation fuction
    file_path = Path(path)
    with file_path.open() as f:
        for line in f:
            yield line
sample_file_path = Path("./data/sample.txt") #creating a path object
sample_file_gen = my_file_gen(sample_file_path)
next(sample_file_gen) #result in the first line

'The sun dips low, the sky turns gold,\n'

In [66]:
next(sample_file_gen) #result in the second line

'Whispers of evening softly unfold.\n'

In [67]:
sample_file_path = Path("./data/sample.txt") #creating a path object
gen_poem = (line for line in sample_file_path.open())

while True: 
    try:
        print(next(gen_poem))
    except StopIteration:
        print("the end")
        break

The sun dips low, the sky turns gold,

Whispers of evening softly unfold.

Shadows stretch, the world grows still

Night descends over the quiet hill.
the end


___
## Lesson Complete