# Python Data Science Toolbox (Part 2)

These are my notes for DataCamp's course [_Python Data Science Toolbox (Part 2)_](https://www.datacamp.com/courses/python-data-science-toolbox-part-2).

This course is presented by Hugo Bowne-Anderson (former data scientist at DataCamp). Collaborators are Yashas Roy and Francisco Castro.

Prerequisite:

- [_Python Data Science Toolbox (Part 1)_](../Python%20Data%20Science%20Toolbox%20Part%201/Python%20Data%20Science%20Toolbox%20Part%201.ipynb)

This course is part of these tracks:

- Data Scientist with Python
- Data Scientist Professional with Python
- Python Fundamentals
- Python Programmer

## Data Sets

| Name | File | Notes |
|:---|:---|:---|
| Tweets | tweets.csv | Scurrilous tweet data |
| World Bank World Development Indicators | world_ind_pop_data.csv | Data from the World Bank |

## Imports

All imports are located here for convenience and clarity.

In [None]:
import csv
import os

import matplotlib.pyplot as plt
import pandas as pd

## Using Iterators in PythonLand

### Introduction to Iterators

An iterable is an object that can return its members one at a time. See https://docs.python.org/3.10/glossary.html#term-iterable.

An iterator is an object representing a stream of data. See https://docs.python.org/3.10/glossary.html#term-iterator. Many objects are already iterators. Create an iterator using Python's iter() function. See https://docs.python.org/3.10/library/functions.html#iter.

For iterator types, see https://docs.python.org/3.10/library/stdtypes.html#typeiter.

Call next() on the iterator to get the next member. If calling next() cannot return another value, the object raises a StopIteration exception.

In [None]:
# A for loop allows iteration of an iterable.
word1 = "Data"
for letter in word1:
    print(letter)

In [None]:
# An iterator allows obtaining the next member using Python's next() builtin.
# However,aA str object is an iterable but not an iterator.
word2 = "Da"
try:
    print(next(word2))
except Exception as ex:
    print("Exception:", ex)

In [None]:
# Create an iterator for word.
it = iter(word2) # class str_iterator
print(type(it))
try:
    print(next(it))
    print(next(it))
    print(next(it)) # Raise exception
except StopIteration as ex:
    # This exception object has a single attribute value that defaults to None.
    # Here it appears to be ''.
    print("A StopIteration exception occurred: " + str(ex) + ".")

In [None]:
# Obtain all values from the iterator with * ("star" or "splat").
it = iter(word1)
print(*it)
# All values have been extracted from the iterator at this point.
# Create a new iterator if you want to iterate again.

In [None]:
# Extract all items into a list.
it2 = iter(word1)
letters = [*it2] # or letters = list(it2)
print(type(letters))
print(len(letters))
print(letters)

In [None]:
it3 = iter(word1)
letters = list(it3) # Not list(*it3)
print(type(letters))
print(len(letters))
print(letters)

In [None]:
# Iterate over a dictionary using its methods.
pythonistas = {"hugo": "bowne-anderson", "francis": "castro"}
for key, value in pythonistas.items():
    print(key, value)
# An alternative way to iterate over a dict.
for key in pythonistas.keys():
    print(key, pythonistas[key])

In [None]:
# Read data from a file connection using its methods.
with open("baseball.csv") as file:
    print(type(file)) # class _io.TextIoWrapper
    # Read individual lines using either of these approaches.
    print(file.readline())
    print(next(file))
    print(next(file))
    # Read all remaining items to a list.
    lines = file.readlines() # or: lines = list(file) or: lines = [*file]
    print(type(lines))
    print(len(lines))

In [None]:
# Use a file connection as an iterator.
with open("baseball.csv") as file2:
    data_lines2 = [*file2] # or data_lines2 = list(file2)
    print(len(data_lines2))

In [None]:
# It is not necessary to call iter on a file connection.
with open("baseball.csv") as file3:
    it3 = iter(file3)
    print(type(it3)) # _io.TextIOWrapper (same as before)
    # These don't work:
    # data_lines = list(*it) # list expects one argument; use list(it)
    # data_lines = tuple(*it) # list expects one argument; use tuple(it)
    # data_lines = (*it) # cannot use starred expression here
    data_lines3 = list(it3) # or data_lines3 = [*it]
    print(len(data_lines3))

In [None]:
# Copy the data into a tuple.
with open("baseball.csv") as file4:
    data_lines4 = tuple(file4) # doesn't work: data_lines4 = (file4)
    print(type(data_lines4))
    print(len(data_lines4))

#### Exercises

In [None]:
# Iterating over iterables.
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen']
# Print each list item in flash using a for loop
for person in flash:
    print(person)

# Create an iterator for flash: superhero
superhero = iter(flash)
# Print each item from the iterator
print(next(superhero))
print(next(superhero))
print(next(superhero))
print(next(superhero))

In [None]:
# Iterating over iterables (2).
# A range is an interable but not an iterator. A range
# is a generator that generates values as needed.
# To iterate over a range when not in a for loop, call iter(range(...)).
# Create an iterator for range(3): small_value
small_value = iter(range(3))

# Print the values in small_value
print(next(small_value))
print(next(small_value))
print(next(small_value))

# Loop over range(3) and print the values
for num in range(3):
    print(num)

In [None]:
# Create an iterator for range(10 ** 100): googol
# This shows that an enormous range can be created without needing to
# allocate massive amounts of memory.
googol = iter(range(10 ** 100))

# Print the first 5 values from googol
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))

In [None]:
# Iterators as function arguments.
# Functions such as list and sum take iterators as arguments.
values = range(10, 21)
values_list = list(values)
print(values_list)
values_sum = sum(values)
print(values_sum)

### Playing with Iterators

#### enumerate()


In [None]:
# Enumerators are objects of class enumerate that return tuples with an index and a value.
avengers = ["hawkeye", "iron man", "thor", "quicksilver"]
e = enumerate(avengers)
print(type(e))
print(e)
print(*e)

In [None]:
# Get the enumerated items in a list.
e = enumerate(avengers)
e_list = list(e)
print(e_list)

In [None]:
# The enumerate object is also an iterable.
# Use enumerate in a for loop:
for index, item in enumerate(avengers):
    print(str(index) + ": " + str(item))

In [None]:
# Change the starting index of an enumerate object.
for index, value in enumerate(avengers, start=10):
    print(str(index) + ": " + str(value))

#### zip()

In [None]:
# Zip accepts an arbitrary number of iterables and joins them piecewise
# to create an iterable of tuples.
names = ["barton", "stark", "odinson", "maximoff"]
z = zip(avengers, names)
print(type(z))
print(z)
print(list(z))

In [None]:
# Use a for loop to iterate over a zip iterable.
for z1, z2 in zip(avengers, names):
    print(str(z1) + ": " + str(z2))

In [None]:
# Use the splat operator to print the elements.
z = zip(avengers, names)
print(*z)

My experimentation here: zip excludes extra elements if the input iterables are not the same length.

In [None]:
avengers = ["hawkeye", "iron man", "thor", "quicksilver"]
names = ["barton", "stark", "odinson"]
z = zip(avengers, names)
print(list(z))

My experimentation here: Emulate enumerate using range and zip.

In [None]:
indexes = range(0, len(avengers))
z = zip(indexes, avengers)
print(list(z))
print(list(enumerate(avengers)))

#### Exercises

In [None]:
# Play with enumerate().
mutants = [
    "charles xavier",
    "bobby drake",
    "kurt wagner",
    "max eisenhardt",
    "kitty pride",
]
mutant_list = list(enumerate(mutants))
print(mutant_list)
for index1, value1 in enumerate(mutants):
    print(index1, value1)
for index2, value2 in enumerate(mutants, start=1):
    print(index2, value2)

In [None]:
# Zip three iterables together.
mutants = [
    "charles xavier",
    "bobby drake",
    "kurt wagner",
    "max eisenhardt",
    "kitty pride",
]
aliases = [
    "prof x",
    "iceman",
    "nightcrawler",
    "magneto",
    "shadowcat"]
powers = [
    "telepathy",
    "thermokinesis",
    "teleportation",
    "magnetokinesis",
    "intangibility",
]
mutant_data = list(zip(mutants, aliases, powers))
print(mutant_data)

In [None]:
# zip the lists and iterate through the tuples in a for loop.
mutant_zip = zip(mutants, aliases, powers)
print(mutant_zip)
for value1, value2, value3 in mutant_zip:
    print(value1, value2, value3)

In [None]:
# "unzip" equivalent.
print()
print("unzip equivalent:")
mutants = (
    "charles xavier",
    "bobby drake",
    "kurt wagner",
    "max eisenhardt",
    "kitty pride",
)
powers = (
    "telepathy",
    "thermokinesis",
    "teleportation",
    "magnetokinesis",
    "intangibility",
)
# Create an iterable of tuples (2-tuples) from the two input tuples.
z1 = zip(mutants, powers)
print(z1)
print(*z1)

# Recreate the zip object because the old one is exhausted.
z1 = zip(mutants, powers)

# 'Unzip' the tuples in z1 by unpacking with *; this produces 5 2-tuples.
# "zip returns a zip object whose .__next__() method returns a tuple where
# the i-th element comes from the i-th iterable argument."
# Calling zip on the 5 2-tuples, each of which is an iterable, used as
# function arguments, returns 2 tuples of length 5,
# where result1 is a tuple equivalent to mutants and result2 is a tuple
# equivalent to powers.
# This is analogous to transposing a 2D array.
z2 = zip(*z1)
print(z2)  # a zip object, which is an iterable.
result1, result2 = z2
print(result1)
print(result2)

# Check if unpacked tuples are equivalent to original tuples
print(result1 == mutants)
print(result2 == powers)

In [None]:
# Another example:
z3 = zip((1, 2), (3, 4), (5, 6), (7, 8), (9, 10))
# Calling next() on z3 returns an iterable where the i-th element of the
# iterable comes from the i-th iterable argument. So next() first returns
# the iterable (1, 3, 5, 7, 9). Calling next again returns (2, 4, 6, 8, 10).
for result in z3:
    print(result)

# Create a list of tuples.
z3 = zip((1, 2), (3, 4), (5, 6), (7, 8), (9, 10))
z3list = list(z3)
print(z3list)

# Create the same list of tuples.
z3 = zip((1, 2), (3, 4), (5, 6), (7, 8), (9, 10))
z3list = [x for x in z3]
print(z3list)

# Create a list of lists.
z3 = zip((1, 2), (3, 4), (5, 6), (7, 8), (9, 10))
z3list2 = [list(x) for x in z3]
print(z3list2)

### Loading Large Files into Memory

The problem is the file is too large to hold in memory. The solution is to process the data in chunks using `pd.read_csv(file, chunksize=1000)`.

An alternative is to use the csv module, open the file, and read it line by line to obtain the data and generate the sum.

In [None]:
# Use the pandas read_csv function, which can do this by specifying
# the chunksize parameter.
# Here the example is to calculate the sum of column 'Weight' from the file.
try:
    result = []
    # Each chunk is a DataFrame object with 1000 rows.
    for chunk in pd.read_csv("baseball.csv", chunksize=1000):
        result.append(sum(chunk["Weight"]))
    total = sum(result)
    print(total)
except Exception as ex:
    print(ex)

In [None]:
# This is a slight modification that creates a running sum.
try:
    result = 0
    for chunk in pd.read_csv("baseball.csv", chunksize=1000):
        result += sum(chunk["Weight"])
    print(total)
except Exception as ex:
    print(ex)

In [None]:
# My work.
# In the DataCamp shell, this is how to list the files in the current working
# directory and get other file information:
files = [f for f in os.listdir(".") if os.path.isfile(f)]
for file in files:
    print(str(file) + ": " + str(os.stat(file).st_size))

# Use shell commands to print file information.
!ls -l tweets.csv
# -rw-r--r-- 1 repl repl 498214 Feb 17 14:51 tweets.csv
!wc -l tweets.csv
# 131 tweets.csv

#### Exercises

In [None]:
# Read a file in chunks and count entries in the 'lang' column.
counts_dict = {}
for chunk in pd.read_csv("tweets.csv", chunksize=10):
    for entry in chunk["lang"]:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1
print(counts_dict)

In [None]:
# Create a function to do the same thing.
# Inputs are path to file, chunk size, and column name.
def count_entries(csvfile, chunksize, colname):
    """
    Return a dictionary with counts of
    occurrences as value for each key.
    """
    counts_dict = {}
    for chunk in pd.read_csv(csvfile, chunksize=chunksize):
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1
    return counts_dict

result_counts = count_entries("tweets.csv", 10, "lang")
print(result_counts)

## List Comprehensions & Generators
### List Comprehensions

List comprehensions collapse for loops for building lists into a single line.

The components of a list comprehension are:
- an iterable
- an iterator variable (representing members of the iterable)
- an output expression

In [None]:
# Take a list and add one to each element inefficiently.
nums = [12, 8, 21, 3, 16]
new_nums = []
for num in nums:
    new_nums.append(num + 1)
print(new_nums)

In [None]:
# Simpler: Use map and a lambda here.
new_nums = list(map(lambda x: x + 1, nums))
print(new_nums)

In [None]:
# Simplest: Use a list comprehension.
new_nums = [x + 1 for x in nums]
print(new_nums)

In [None]:
# We can write a list comprehension over any iterable (e.g., list, tuple,
# range, dict, etc.))
result = [num for num in range(11)]
print(result)

In [None]:
# We can use list comprehensions in the place of nested for loops.
# This is the original code with two loops.
pairs_1 = []
for num1 in range(0, 2):
    for num2 in range(6, 8):
        pairs_1.append((num1, num2))
print(pairs_1)

In [None]:
# We can do this with a list comprehension.
pairs_1 = [(num1, num2) for num1 in range(0, 2) for num2 in range(6, 8)]
print(pairs_1)

#### Exercises

In [None]:
# Print a list containing the first character of each doctor.
doctor = ["house", "cuddy", "chase", "thirteen", "wilson"]
print([doc[0] for doc in doctor])

In [None]:
# Print the squares of 0-9.
squares = [i ** 2 for i in range(0, 10)]
print(squares)

In [None]:
# Use list comprehension to create a matrix.
# This were my solutions:
matrix = [row for rownum in range(0, 5) for row in [[0, 1, 2, 3, 4]]]
print(matrix)
matrix = [row for rownum in range(0, 5) for row in [list(range(0, 5))]]
print(matrix)
matrix = [row for rownum in range(0, 5) for row in list((list(range(0, 5)),))]
print(matrix)

In [None]:
# The correct (and simpler) answer is:
# Create a 5 x 5 matrix using a list of lists: matrix.
matrix = [[colval for colval in range(0, 5)] for rownum in range(0, 5)]
print(matrix)

### Advanced Comprehensions
#### Conditionals in Comprehensions

In [None]:
# Conditionals in comprehensions for the input expression.
nums = [num ** 2 for num in range(10) if num % 2 == 0]
print(nums)

In [None]:
# Conditionals on the output expression.
# Note where if .. else is placed here.
nums = [num ** 2 if num % 2 == 0 else 0 for num in range(10)]
print(nums)

#### Dictionary comprehensions.

In [None]:
# Create dictionaries.
pos_neg = {num: -num for num in range(9)}
print(pos_neg)
print(type(pos_neg))

#### Exercises

In [None]:
fellowship = ["frodo", "samwise", "merry", "aragorn", "legolas", "boromir", "gimli"]
new_fellowship = [member for member in fellowship if len(member) >= 7]
print(new_fellowship)

In [None]:
# This does not work; the else is required after the "if" expression.
# new_fellowship = [member if len(member) >= 7 for member in fellowship]
# print(new_fellowship)

In [None]:
new_fellowship = [member if len(member) >= 7 else "" for member in fellowship]
print(new_fellowship)

In [None]:
new_fellowship = {member: len(member) for member in fellowship}
print(new_fellowship)

### Introduction to Generator Expressions

A list comprehension returns a list. A generator stores the results in memory; it is an iterator from which you can obtain objects one at a time.

See https://realpython.com/introduction-to-python-generators/.

In [None]:
# Start with this list comprehension:
nums = [2 * num for num in range(10)]
print(nums)

In [None]:
# ...and replace the [] with (). This creates a generator object.
# numg is a generator object, which is an iterable and an iterator.
numg = (2 * num for num in range(10))
print(type(numg))
print(numg)
print(next(numg))
print(*numg)

In [None]:
# Anothr generator.
result = (num for num in range(6))
for num in result:
    print(num)

In [None]:
# Create a list from a generator. This stores the results in memory.
result = (num for num in range(6))
print(list(result))

A generator allows lazy evaluation. We can use `next` to get each element
from the generator only when we need it. A list comprehension stores the
elements of the list in memory, which is a problem if the list is very
large. A generator generates each element only when needed.

Generators are useful in data science when working with very large amounts of
data, so much data that it won't fit in memory.

Hugo doesn't mention that the `range` function does the same thing -- it acts like a generator. But when list is called on a range, all data is stored in memory.


In [None]:
# This expression nearly kills my MacBook Pro (10 billion ints).
# [num for num in range(1, 10000000000)]
# But this does not:
(num for num in range(1, 10000000000))

In [None]:
# But a generator can do things a range object can't: it accepts conditionals
# in generator expressions.
even_nums = (num for num in range(10) if num % 2 == 0)
print(list(even_nums))

Generator functions are defined like a regular function using def, but they yield a sequence of values instead of returning a single value. The generator generates a value with the `yield` keyword.

In [None]:
# Build a generator function that yields values from 0 to n.
def num_sequence(n):
    """
    Generate values from 0 to n.
    """
    i = 0
    while i < n:
        yield i
        i += 1

# result is a generator object.
result = num_sequence(5)
print(result)
for item in result:
    print(item)

#### Exercises

In [None]:
# Create a generator that yields 0 .. 30.
result = (num for num in range(31))
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))
#for value in result:
#    print(value)
print(*result)

In [None]:
# Note: The items, keys, and values methods of a dict return generators
# of special types.
mydict = {1: "one", 2: "two"}
my_keys = mydict.keys()
print(my_keys)
my_values = mydict.values()
print(my_values)
my_items = mydict.items()
print(my_items)

In [None]:
# Changing the output in generator functions.
# Here, generate the lengths of the strings in a list.
lannister = ["cersei", "jaime", "tywin", "tyrion", "joffrey"]
lengths = (len(person) for person in lannister)
for value in lengths:
    print(value)

In [None]:
# Create a generator function.
def get_lengths(input_list):
    """
    Yield the length of the strings in input_list.
    """
    for person in input_list:
        yield (len(person))

lannister = ["cersei", "jaime", "tywin", "tyrion", "joffrey"]
for value in get_lengths(lannister):
    print(value)

### Wrapping Up Comprehensions and Generators

A basic list comprehension has this structure:

`[output expression for iterator variable in iterable]`

An advanced list comprehension has this structure:

`[output expression + conditional on output for iterator variable
in iterable + conditional on iterable]`

#### Exercises

In [None]:
# Create a list comprehensions for time-stamped data.
# This uses a Pandas Series data structure.
# The time stamps look like this: "Tue Mar 29 23:40:19 +0000 2016"

df = pd.read_csv("tweets.csv")
tweet_time = df["created_at"]
tweet_clock_time = [entry[11:19] for entry in tweet_time]
print(tweet_clock_time)

In [None]:
# Conditional list comprehensions for time-stamped data.
tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == "19"]
print(tweet_clock_time)

## Case Study

The data set, the World Bank World Development Indicators Data Set, is from http://data.worldbank.org/data-catalog/world-development-indicators. See also https://datacatalog.worldbank.org/search/dataset/0037712.

The data set contains data for 217 countries for more than half a century, 1960-2015. The data set contains hundreds of indicators, including population, electricity consumption, CO2 emissions, literacy rates, unemployment, mortality rates, and more. In the exercises, we will use the techniques learned in this course to wrangle the data. These include zip, functions, comprehensions, and generators.

### Loading a DataFrame from Lists

In [None]:
# The data we need are not in the world_ind_pop_data.csv file provided by
# the course. These data were copied from the IPython shell.
feature_names = [
    "CountryName",
    "CountryCode",
    "IndicatorName",
    "IndicatorCode",
    "Year",
    "Value",
]

row_lists = [
    [
        "Arab World",
        "ARB",
        "Adolescent fertility rate (births per 1,000 women ages 15-19)",
        "SP.ADO.TFRT",
        "1960",
        "133.56090740552298",
    ],
    [
        "Arab World",
        "ARB",
        "Age dependency ratio (% of working-age population)",
        "SP.POP.DPND",
        "1960",
        "87.7976011532547",
    ],
    [
        "Arab World",
        "ARB",
        "Age dependency ratio, old (% of working-age population)",
        "SP.POP.DPND.OL",
        "1960",
        "6.634579191565161",
    ],
    [
        "Arab World",
        "ARB",
        "Age dependency ratio, young (% of working-age population)",
        "SP.POP.DPND.YG",
        "1960",
        "81.02332950839141",
    ],
    [
        "Arab World",
        "ARB",
        "Arms exports (SIPRI trend indicator values)",
        "MS.MIL.XPRT.KD",
        "1960",
        "3000000.0",
    ],
    [
        "Arab World",
        "ARB",
        "Arms imports (SIPRI trend indicator values)",
        "MS.MIL.MPRT.KD",
        "1960",
        "538000000.0",
    ],
    [
        "Arab World",
        "ARB",
        "Birth rate, crude (per 1,000 people)",
        "SP.DYN.CBRT.IN",
        "1960",
        "47.697888095096395",
    ],
    [
        "Arab World",
        "ARB",
        "CO2 emissions (kt)",
        "EN.ATM.CO2E.KT",
        "1960",
        "59563.9892169935",
    ],
    [
        "Arab World",
        "ARB",
        "CO2 emissions (metric tons per capita)",
        "EN.ATM.CO2E.PC",
        "1960",
        "0.6439635478877049",
    ],
    [
        "Arab World",
        "ARB",
        "CO2 emissions from gaseous fuel consumption (% of total)",
        "EN.ATM.CO2E.GF.ZS",
        "1960",
        "5.041291753975099",
    ],
    [
        "Arab World",
        "ARB",
        "CO2 emissions from liquid fuel consumption (% of total)",
        "EN.ATM.CO2E.LF.ZS",
        "1960",
        "84.8514729446567",
    ],
    [
        "Arab World",
        "ARB",
        "CO2 emissions from liquid fuel consumption (kt)",
        "EN.ATM.CO2E.LF.KT",
        "1960",
        "49541.707291032304",
    ],
    [
        "Arab World",
        "ARB",
        "CO2 emissions from solid fuel consumption (% of total)",
        "EN.ATM.CO2E.SF.ZS",
        "1960",
        "4.72698138789597",
    ],
    [
        "Arab World",
        "ARB",
        "Death rate, crude (per 1,000 people)",
        "SP.DYN.CDRT.IN",
        "1960",
        "19.7544519237187",
    ],
    [
        "Arab World",
        "ARB",
        "Fertility rate, total (births per woman)",
        "SP.DYN.TFRT.IN",
        "1960",
        "6.92402738655897",
    ],
    [
        "Arab World",
        "ARB",
        "Fixed telephone subscriptions",
        "IT.MLT.MAIN",
        "1960",
        "406833.0",
    ],
    [
        "Arab World",
        "ARB",
        "Fixed telephone subscriptions (per 100 people)",
        "IT.MLT.MAIN.P2",
        "1960",
        "0.6167005703199",
    ],
    [
        "Arab World",
        "ARB",
        "Hospital beds (per 1,000 people)",
        "SH.MED.BEDS.ZS",
        "1960",
        "1.9296220724398703",
    ],
    [
        "Arab World",
        "ARB",
        "International migrant stock (% of population)",
        "SM.POP.TOTL.ZS",
        "1960",
        "2.9906371279862403",
    ],
    [
        "Arab World",
        "ARB",
        "International migrant stock, total",
        "SM.POP.TOTL",
        "1960",
        "3324685.0",
    ],
]

#### Exercises

For practice (not a course exercise):

Given the feature names in one list and a row of data in a second list, efficiently create a dict for which the feature names are the keys and the row data are the values. (Do not iterate over the columns.)

See https://docs.python.org/3/library/stdtypes.html#mapping-types-dict for various approaches for creating a dictionary.

See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame for the pandas API for creating a DataFrame.

In [None]:
# Use zip to create a list of tuples, where the first value in each tuple
# is the feature name and the second value of each tuple is the value.
# Then call dict on that list.
data_tuples = zip(feature_names, row_lists[0])
data_dict1 = dict(data_tuples)
print(data_dict1)

In [None]:
# As an alternative, use a dict comprehension to create the dict.
# Show that the result is identical to that obtained above.
# This is a little clearer to read but is less efficient.
data_dict2 = {key: val for (key, val) in zip(feature_names, row_lists[0])}
print(data_dict2)
print(data_dict2 == data_dict1)

In [None]:
# Write a function to aid this process and test it.
def lists2dict(list1, list2):
    return dict(zip(list1, list2))

data_dict3 = lists2dict(feature_names, row_lists[0])
print(data_dict3)
print(data_dict3 == data_dict1)

Create a list of dicts, where each dict contains the keys from feature_names and the values represent a row of values. Use the lists2dict function to create each dict. Make this efficient by using a list comprehension.

This is an intuitive way to collect the data; each dict is like an object with attribute names and values, like JSON objects. What is inefficient is that each object contains the attribute names, so if the data are large, there is a great deal of redundant data from storing the keys with each object.

In [None]:
# Create a list of dicts from the inputs.
list_of_dicts1 = [lists2dict(feature_names, sublist) for sublist in row_lists]
# Print the first two dicts from the list.
print(list_of_dicts1[0])
print(list_of_dicts1[1])
# Show that the first dict in the list is identical to what was produced
# above.
print(list_of_dicts1[0] == data_dict1)

In [None]:
# Do the same as above without using the function
# I fell like the function lists2dict is not necessary since it contains
# only one line of code.
list_of_dicts2 = [dict(zip(feature_names, sublist)) for sublist in row_lists]
list_of_dicts2 == list_of_dicts1

In [None]:
# Convert these data to a DataFrame and show the first five rows.
# Take advantage of Jupyter's nice formatting of a pandas DataFrame.
df1 = pd.DataFrame(list_of_dicts2)
df1.head()

In [None]:
# While we're demonstrating the capabilities of pandas, show the tail and
# the dtypes.
df1.tail()

In [None]:
# Note that all values are objects, which is probably not what is wanted.
df1.dtypes

In [None]:
df1.info()

In [None]:
# Convert the dtypes.
df1.infer_objects()
df1["CountryName"] = df1["CountryName"].astype("string")
df1["CountryCode"] = df1["CountryCode"].astype("string")
df1["IndicatorName"] = df1["IndicatorName"].astype("string")
df1["IndicatorCode"] = df1["IndicatorCode"].astype("string")
df1["Year"] = df1["Year"].astype("float64")
df1["Value"] = df1["Value"].astype("float64")
df1.info()

### Using Python Generators for Streaming Data

The problem to solve here is that the data may be too big to load into memory using the techniques given above. So we're going to stream the data in chunks as we learned earlier. We will use a generator to load a file line by line.

#### Exercises

In [None]:
# This is bad: We are parsing a CSV file by splitting on ','.
# The exercise uses word_dev_ind.csv, which the user doesn't have.
# This code contains an improvement to watch for EOF, which the
# original code did not do, resulting in the addition of "" keys
# in the dict if we read more lines than there were in the file.
counts_dict1 = {}
with open("world_ind_pop_data.csv") as file:
    # Skip the header line.
    file.readline()
    # Process the first 1000 rows.
    for j in range(1000):
        dataline = file.readline()
        # Exit the loop if the file is out of data.
        if dataline == "":
            break
        col_values = dataline.split(",")
        first_col = col_values[0]
        if first_col in counts_dict1.keys():
            counts_dict1[first_col] += 1
        else:
            counts_dict1[first_col] = 1
print(counts_dict1)

In [None]:
# An improved script (my work).
# Read the *entire* file using a for loop, using file's generator.
counts_dict2 = {}
with open("world_ind_pop_data.csv") as file:
    # Skip the header line.
    file.readline()
    # Use the file generator that returns data one line at a time.
    for data_line in file:
        col_values = data_line.split(",")
        first_col = col_values[0]
        if first_col in counts_dict2.keys():
            counts_dict2[first_col] += 1
        else:
            counts_dict2[first_col] = 1
print(counts_dict2)

In [None]:
# Write a generator to load data in chunks, one line at a time.
# Demonstrate this by printing the first three lines of the file.
def read_large_file(file_object):
    """A generator function to read a large file lazily."""
    while True:
        # readline returns "" if no more data is available.
        data = file_object.readline()
        if not data:
            break
        yield data

with open("world_ind_pop_data.csv") as file:
    gen_file = read_large_file(file)
    # Use next to get the data line by line.
    # Of course, we can just call next on the file object.
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

In [None]:
# Count the values in the CountryName column.
# Show that we get the same result as before.
counts_dict3 = {}
with open("world_ind_pop_data.csv") as file:
    # Create the generator for obtaining lines from the file.
    file_gen = read_large_file(file)
    header = next(file_gen)
    for dataline in file_gen:
        row = dataline.split(",")
        first_col = row[0]
        if first_col in counts_dict3.keys():
            counts_dict3[first_col] += 1
        else:
            counts_dict3[first_col] = 1
print(counts_dict3)
print(counts_dict3 == counts_dict2)

In [None]:
# Extra credit.
# Now, do this using the csv module. The results are not the same,
# revealing a problem with reading a CSV file and splitting on the ","
# character instead of using the csv module.

# Looking at the keys, we find that the keys from using csv
# include keys with "," characters in them! This is a lesson for
# why you should not split a CSV file using the "," character!

# Grepping the file reveals 770 lines containing '"' characters.
#   grep "\"" world_ind_pop_data.csv | wc -l
#   770
# These are lines such as:
#   "Bahamas, The",BHS,1960,109526.0,59.711999999999996
#   "Congo, Dem. Rep.",ZAR,1960,15248246.0,22.3
#   "Congo, Rep.",COG,1960,1013581.0,31.601

# Using the csv module gives correct results.
counts_dict4 = {}
with open("world_ind_pop_data.csv", newline="") as csv_file:
    csvreader = csv.reader(csv_file)
    header = next(csvreader)
    for row in csvreader:
        first_col = row[0]
        if first_col in counts_dict4.keys():
            counts_dict4[first_col] += 1
        else:
            counts_dict4[first_col] = 1
print(counts_dict4)
print(counts_dict4 == counts_dict2)
# Get the keys from the two dictionaries and use sets to identify
# which keys are different in each set.
keys2 = set(list(counts_dict2.keys()))
keys4 = set(list(counts_dict4.keys()))
diff2 = keys2.difference(keys4)
print(diff2)
diff4 = keys4.difference(keys2)
print(diff4)

In [None]:
# Extra credit.
# Load the data using pandas.read_csv.
# pands.read_csv should give the same results as the csv module.
# pandas.read_csv reads the data into a DataFrame.
# We want to count the distinct items in the CountryName column.
df2 = pd.read_csv("world_ind_pop_data.csv")
df2.head()

In [None]:
# Extra credit. Look at the DataFrame.
df2.tail()

In [None]:
df2.shape

In [None]:
df2.info()

In [None]:
# Change the dtypes.
df2["CountryName"] = df2["CountryName"].astype("string")
df2["CountryCode"] = df2["CountryCode"].astype("string")
df2.info()

In [None]:
# Review the CountryName column of the DataFrame.
df2["CountryName"]

In [None]:
# Extra credit.
# Extract the country names from the DataFrame object produced by
# pd.read_csv and count them. The results should be the same as those
# produced using the csv module (and they are the same).
country_names = list(df2["CountryName"])
counts_dict5 = {}
for name in country_names:
    if name in counts_dict5:
        counts_dict5[name] += 1
    else:
        counts_dict5[name] = 1
print(counts_dict5)
print(counts_dict5 == counts_dict4)

### Using pandas.read_csv() Iterator for Streaming Data

#### Exercises

In [None]:
# Pretend the file is too big to load into memory (because the DataFrame
# won't fit into existing memory). Use pandas' read_csv iterator with the
# chunksize parameter for streaming data, getting sub-DataFrames and
# processing them.

# First, show how to read the first chunk of the data file.
df_reader = pd.read_csv("world_ind_pop_data.csv", chunksize=10)
next(df_reader)

In [None]:
# ...and the next chunk...
next(df_reader)

In [None]:
# Extra credit.
# Chunk the data into DataFrames of size 100 to make the counts.
# Show that the result is the same as before from loading all of the data
# into memory.
counts_dict6 = {}
df_reader2 = pd.read_csv("world_ind_pop_data.csv", chunksize=100)
for df_chunk in df_reader2:
    country_names = list(df_chunk["CountryName"])
    for name in country_names:
        if name in counts_dict6:
            counts_dict6[name] += 1
        else:
            counts_dict6[name] = 1
print(counts_dict6)
print(counts_dict6 == counts_dict5)

In [None]:
# OK, we're back to the exercise again.
# Write an iterator to load data in chunks.
df_reader3 = pd.read_csv("world_ind_pop_data.csv", chunksize=1000)
df_urb_pop = next(df_reader3)
df_urb_pop.head()

In [None]:
# Limit the data to CountryCode CEB.
df_pop_ceb = df_urb_pop[df_urb_pop["CountryCode"] == "CEB"]

# Zip together two columns to prepare the data for plotting.
# Make a list of tuples.
pops = zip(df_pop_ceb["Total Population"], df_pop_ceb["Urban population (% of total)"])
pops_list = list(pops)
print(pops_list)

In the rest of this exercise, given the values from the 'Total Population' column and the 'Urban population (% of total)' column, we can calculate a new column, 'Total Urban Population' by multiplying these values together and dividing by 100. Plot the data.

To avoid subtle problems with references vs. copies, always call the .copy method when subsetting a DataFrame.

DataFrame.append is deprecated; use pd.concat instead, as shown below.

In [None]:
# Write an iterator to load data in chunks (3).
# This extracts the first chunk from the input file, subsets the data for
# country code "CEB", calculates the urban population as an int, adds
# a new "Total Urban Population" to the subset DataFrame, and plots
# the "Total Urban Population" vs. "Year".

# Because we have processed only the first chunk of 1000 lines, we have a
# small subset (years 1960-1964) of the total data.
urb_pop_reader = pd.read_csv("world_ind_pop_data.csv", chunksize=1000)

# Read the first chunk from the file.
df_urb_pop = next(urb_pop_reader)

# Limit the dataset to CountryCode 'CEB' and copy the
# data to a new DataFrame. This eliminates a subtle problem
# with references to original data.
df_pop_ceb = df_urb_pop[df_urb_pop["CountryCode"] == "CEB"].copy()

# Zip the data together for convenience, calculate the Total Urban Population
# value, and add the column to the DataFrame.
pops = zip(df_pop_ceb["Total Population"], df_pop_ceb["Urban population (% of total)"])
pops_list = list(pops)

# Use a list comprehension to calculate the values for the new column.
# "tups" stands for "total urban populations".
tups = [int(tpop * uppct * 0.01) for tpop, uppct in pops_list]

# Add the new column to the DataFrame.
df_pop_ceb.loc[:, 'Total Urban Population'] = tups

# Plot the urban population date.
df_pop_ceb.plot(kind="scatter", x="Year", y="Total Urban Population")
plt.show()

In [None]:
# Writing an iterator to load data in chunks (4).
# Improve the previous code by processing the entire file using a
# for loop.

# Initialize an empty DataFrame for appending results.
data = pd.DataFrame()

urb_pop_reader = pd.read_csv("world_ind_pop_data.csv", chunksize=1000)
for df_urb_pop in urb_pop_reader:
    # Subset on CountryCode == 'CEB'
    df_pop_ceb = df_urb_pop[df_urb_pop["CountryCode"] == "CEB"].copy()

    # Zip DataFrame columns of interest: pops
    pops = zip(
        df_pop_ceb["Total Population"],
        df_pop_ceb["Urban population (% of total)"])

    # Turn zip object into list: pops_list
    pops_list = list(pops)

    # Use list comprehension to create new DataFrame column 'Total Urban Population'.
    # This will create warnings.
    df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]

    # Append DataFrame chunk to data. This is the first time we have done this.
    # data = data.append(df_pop_ceb) # deprecated
    data = pd.concat([data, df_pop_ceb])

# Plot urban population data
data.plot(kind="scatter", x="Year", y="Total Urban Population")
plt.show()

In [None]:
# Writing an iterator to load data in chunks (5).
# Write a single function that will plot the urban population of the
# chosen country.

def plot_pop(filename, country_code):
    # Get an iterator for the input file.
    urb_pop_reader = pd.read_csv(filename, chunksize=1000)

    # Initialize a DataFrame for containing the final results.
    data = pd.DataFrame()
    for df_urb_pop in urb_pop_reader:
        # Get a subset of the input DataFrame for the specified CountryCode.
        df_pop_cc = df_urb_pop[df_urb_pop["CountryCode"] == country_code].copy()

        # Zip the columns together for convenience.
        pops = zip(
            df_pop_cc["Total Population"],
            df_pop_cc["Urban population (% of total)"])
        pops_list = list(pops)

        # Add the new column to the DataFrame.
        df_pop_cc['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]

        # data = data.append(df_pop_cc) # deprecated
        data = pd.concat([data, df_pop_cc])

    # print(data)
    data.plot(kind="scatter", x="Year", y="Total Urban Population", title=country_code)
    plt.show()

# Use the function.
fn = "world_ind_pop_data.csv"
plot_pop(fn, "CEB")
plot_pop(fn, "ARB")
