### The zip() Function

Let's see some simple examples first.

In [1]:
l = [1, 2, 3, 4, 5]
t = (10, 20, 30)

In [2]:
result = zip(l, t)

In [3]:
result

<zip at 0x7fbf606c41c0>

As you can see, what gets returned is a `zip` object - which is an **iterator**.

This means we can call `next()` on it:

In [4]:
next(result)

(1, 10)

In [5]:
next(result)

(2, 20)

In [6]:
next(result)

(3, 30)

And now of course, there are no more elements in that iterator (the shortest sequence `t` had 3 elements), so if we call `next()` again we shoudl get a `StopIteration` exception:

In [7]:
try:
    next(result)
except StopIteration:
    print('StopIteration')

StopIteration


So, just like any iterator, if we want to re-iterate over that result, we have to **re-create** the iterator:

In [8]:
for t in zip(l, t):
    print(t)

(1, 10)
(2, 20)
(3, 30)


In [9]:
combo = list(zip(l, t))

In [10]:
combo

[(1, 3), (2, 30)]

Now `combo` is a list, so an iterable, and we can iterate over that multiple times.

But creating a `zip` object has almost zero cost associated with it since the sequence of tuples is not actually created - they are produced one at a time when we iterate through the zip object.

In [11]:
from time import perf_counter

In [12]:
start = perf_counter()
l1 = range(100_000_000_000)
l2 = range(100_000_000_000)
combo = zip(l1, l2)
end = perf_counter()
print(f'elapsed: {end - start}')

elapsed: 0.0001346449999999333


As you can see, even though our range objects are huge, they do not create the values ahead of time (they yield them one by one), and zip does the same - so the creation time for all three was extremely fast.

If we were to convert this zip to a list on the other hand, things would be very different. First we would have to iterate over the entire sequence of tuples, and then create a list out of that:

I don't want to be sitting here the entire day, so I'm going to cut back on those numbers a bit:

In [13]:
start = perf_counter()
l1 = range(10_000_000)
l2 = range(10_000_000)
combo = list(zip(l1, l2))
end = perf_counter()
print(f'elapsed: {end - start}')

elapsed: 1.098045639


As you can see, that took one second - and it will get worse as those number increase:

We'll run across `zip()` frequently in this course, but let's see at least one practical example right now.

`zip()` basically provides us an easy mechanism for iterating through two or more iterables in parallel.

Remember how we could create dictionaries using the `dict()` function and passing an iterable containing tuples of `(key, value)` to it?

In [14]:
d = dict([('a', 1), ('b', 2), ('c', 3)])

In [15]:
d

{'a': 1, 'b': 2, 'c': 3}

Suppose we have some service somewhere that provides us data in tuple format:

In [16]:
data = [
    ('item1', 10, 100.0),
    ('item2', 5, 25.0),
    ('item3', 100, 0.25)
]

And suppose that the **schema** of this data is

In [17]:
schema = ('widget', 'num_sold', 'unit_price')

Now our goal is to turn this `data` and `schema` into a dictionary whose keys are the `widget` names, and corresponding values are themselves dictionaries containing keys for `num_sold` and `unit_price`, so something like this:

In [18]:
d = {
    'item1': {'num_sold': 10, 'unit_price': 100.0},
    'item2': {'num_sold': 10, 'unit_price': 25.0},
    'item3': {'num_sold': 100, 'unit_price': 0.25}
}

Suppose furthermore that over time this schema may change, the only constant we have is that the first item in the tuple is guaranteed to be the widget's name.

So, 3 months from now, we may be getting this schema instead:

In [19]:
('widget', 'manufacturer', 'num_sold', 'unit_price', 'discount')

('widget', 'manufacturer', 'num_sold', 'unit_price', 'discount')

So we can't "hardcode" our schema and do it this way:

In [20]:
d = {}
for item in data:
    d[item[0]] = {'num_sold': item[1], 'unit_price': item[2]}
print(d)

{'item1': {'num_sold': 10, 'unit_price': 100.0}, 'item2': {'num_sold': 5, 'unit_price': 25.0}, 'item3': {'num_sold': 100, 'unit_price': 0.25}}


Although this works, if the schema changes we will have to change our code - not a good design.

Instead we are going to leverage the `schema`, and only modify this if it ever changes, and let our code handle the rest automatically, without changes.

To do this, we are going to `zip` each row with the schema:

In [21]:
for row in data:
    print(list(zip(schema, row)))

[('widget', 'item1'), ('num_sold', 10), ('unit_price', 100.0)]
[('widget', 'item2'), ('num_sold', 5), ('unit_price', 25.0)]
[('widget', 'item3'), ('num_sold', 100), ('unit_price', 0.25)]


As you can see we now know what each value in the data represents, based on the schema.

Since we are guaranteed that the first element of each row is the item name, we're not really interested in zipping that one up:

In [22]:
for row in data:
    widget_name = row[0]
    remaining = zip(schema[1:], row[1:])
    print(widget_name, list(remaining))

item1 [('num_sold', 10), ('unit_price', 100.0)]
item2 [('num_sold', 5), ('unit_price', 25.0)]
item3 [('num_sold', 100), ('unit_price', 0.25)]


Let's make the items in that zip into a dictionary:

In [23]:
for row in data:
    widget_name = row[0]
    sub_dict = dict(zip(schema[1:], row[1:]))
    print(widget_name, sub_dict)

item1 {'num_sold': 10, 'unit_price': 100.0}
item2 {'num_sold': 5, 'unit_price': 25.0}
item3 {'num_sold': 100, 'unit_price': 0.25}


So now we are ready to actually populate a dictionary, starting with an empty one, and adding each `widget_name` as a key, with the remaining values transformed into a dictionary (again all based on the `schema` and `data` which may change over time):

In [24]:
data_dict = {}
for row in data:
    widget_name = row[0]
    sub_dict = dict(zip(schema[1:], row[1:]))
    data_dict[widget_name] = sub_dict
print(data_dict)

{'item1': {'num_sold': 10, 'unit_price': 100.0}, 'item2': {'num_sold': 5, 'unit_price': 25.0}, 'item3': {'num_sold': 100, 'unit_price': 0.25}}


Of course, we can simplify this by not using temporary variables for `widget_name` and `sub_dict`:

In [25]:
data_dict = {}
for row in data:
    data_dict[row[0]] = dict(zip(schema[1:], row[1:]))
print(data_dict)

{'item1': {'num_sold': 10, 'unit_price': 100.0}, 'item2': {'num_sold': 5, 'unit_price': 25.0}, 'item3': {'num_sold': 100, 'unit_price': 0.25}}


And you should realize that we can actually use a dictionary comprehension for this!

In [26]:
data_dict = {row[0]: dict(zip(schema[1:], row[1:])) for row in data}
print(data_dict)

{'item1': {'num_sold': 10, 'unit_price': 100.0}, 'item2': {'num_sold': 5, 'unit_price': 25.0}, 'item3': {'num_sold': 100, 'unit_price': 0.25}}


There is a "pretty-printing" function available in Python that can print this dictionary in a more human-readable format:

In [27]:
from pprint import pprint

In [28]:
pprint(data_dict)

{'item1': {'num_sold': 10, 'unit_price': 100.0},
 'item2': {'num_sold': 5, 'unit_price': 25.0},
 'item3': {'num_sold': 100, 'unit_price': 0.25}}


So the nice thing about our solution is that it is extensible in that if the data (and the corresponding schema) changes, we can still handle it with no code changes except for updating the schema:

In [29]:
data = [
    ('item1', 'manuf-1', 10, 100.0, 0.2),
    ('item2', 'manuf-2', 5, 25.0, 0),
    ('item3', 'manuf-3', 100, 0.25, 0.025)
]

In [30]:
schema = ('widget', 'manufacturer', 'num_sold', 'unit_price', 'discount')

In [31]:
data_dict = {row[0]: dict(zip(schema[1:], row[1:])) for row in data}

In [32]:
pprint(data_dict)

{'item1': {'discount': 0.2,
           'manufacturer': 'manuf-1',
           'num_sold': 10,
           'unit_price': 100.0},
 'item2': {'discount': 0,
           'manufacturer': 'manuf-2',
           'num_sold': 5,
           'unit_price': 25.0},
 'item3': {'discount': 0.025,
           'manufacturer': 'manuf-3',
           'num_sold': 100,
           'unit_price': 0.25}}


As you can see our code was able to handle the new schema with no code changes.