# Data Science Toolbox - Part 2

## Iterators

List, strings, dictionaries, file connections, etc are all iterables. That is, they are objects with an associated iter() method. Which means we can iterate through them using a for loop for instance. 
We can create an iterator using the **iter()** function, and then call the **next()** function to iterate through the iterator. 

In [2]:
# Create a list of strings: flash
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen']

# Print each list item in flash using a for loop
for item in flash:
    print(item)

jay garrick
barry allen
wally west
bart allen


In [3]:
# Create an iterator for flash: superspeed
superspeed = iter(flash)

# Print each item from the iterator
print(next(superspeed))
print(next(superspeed))
print(next(superspeed))
print(next(superspeed))

jay garrick
barry allen
wally west
bart allen


### Playing with iterators

**Enumerate:** takes any iterable as an argument and return a special enumerate object, which contains the key-value pair with the index and the value of that item as tuples.
**Zip:** Receive an arbitrary number of iterables and create an iterator of tuples. 


In [5]:
# Create a list of strings: mutants
mutants = ['charles xavier', 
            'bobby drake', 
            'kurt wagner', 
            'max eisenhardt', 
            'kitty pryde']

# Create a list of tuples: mutant_list
mutant_list = list(enumerate(mutants))

# Print the list of tuples
print(mutant_list)

[(0, 'charles xavier'), (1, 'bobby drake'), (2, 'kurt wagner'), (3, 'max eisenhardt'), (4, 'kitty pryde')]


In [6]:
# Unpack and print the tuple pairs
for index1, value1 in enumerate(mutants):
    print(index1, value1)

0 charles xavier
1 bobby drake
2 kurt wagner
3 max eisenhardt
4 kitty pryde


In [7]:
# Change the start index
for index2, value2 in enumerate(mutants, start = 1):
    print(index2, value2)

1 charles xavier
2 bobby drake
3 kurt wagner
4 max eisenhardt
5 kitty pryde


### Loading data in chunks into memory

If we're loading too much data that won't fit into memory we can use iterator to load data in chunks. read_csv function from Pandas does that by using the chunksize parameter.

In [None]:
## Example won't work as the file does not exist

import pandas as pd
# Initialize variable to store the data
result = [] 

for chunk in pd.read_csv('data.csv', chunksize=1000):
    # Chunk is an iterable and each iteration is a dataframe
    # As an example we are summing up a column x for each chunk an storing in a list (silly example, we could just use an int variable and += sum(chunk['x']))
    result.append(sum(chunk['x']))
# Add ups all chunks of data
total = sum(result)


### Generators

It's exactly the same as the list comprehension, with the exception that it does not store the list in memory, instead it produces the data as required.
It can help when working with extremely large sequences. The only difference when building them is the change from [] to () in the definition. 
We can also create generator functions. They are defined with the **def** keyworks and the only difference between a regular function is that instead of using **return** to return a value, we use the keyword **yield**


In [8]:
# Create generator object: result
result = (num for num in range(31))

# Print the first 5 values
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))

# Print the rest of the values
for value in result:
    print(value)


0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


In [9]:
# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Define generator function get_lengths
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""

    # Yield the length of a string
    for person in input_list:
        yield len(person)

# Print the values generated by get_lengths()
for value in get_lengths(lannister):
    print(value)


6
5
5
6
7
