# Data Types for Data Science in Python

## Introduction

This course is presented by Jason Myers, who is the co-author (with Rick Copeland) of _Essential SQLAlchemy_ and is a software engineer.

Prerequisite:
- Intermediate Python

This course is part of this track:
- Python Programmer

A good resource is section 5 of "The Python Tutorial: Data Structures." See https://docs.python.org/3/tutorial/datastructures.html.

Jason moves quickly through the demonstrations, making his presentations very efficient to watch, but it takes a lot of work to copy the examples and experiment with them. In addition, we are not given the source data files for some of the data used in the examples. The variable `art_galleries`, a dictionary, is used in inconsistent ways. The code for creating timezone-aware datetime objects is incorrect (see detailed notes below). The section titles are sometimes cute but not useful.

I learned a lot, but I think that this course needs revision.

## Datasets

Name | File
| :--- | :--- |
| Baby names | baby_names.csv |
| Chicago crime | crime_sampler.csv |
| CTA daily station totals | cta_daily_station_totals.csv |
| CTA daily summary totals | cta_daily_summary_totals.csv |

## Imports

Imports are collected here for convenience and clarity.

In [None]:
from collections import Counter, defaultdict, namedtuple, OrderedDict
import csv
import datetime
import os
import pprint
import sys
import traceback

import pendulum
import pytz

## Fundamental Data Types

### Introduction

The data type system of a programming language sets the capabilities of the language. Understanding the Python data types empowers you as a data scientist who uses Python.

A container sequence holds a sequence of other data types. These containers are used for aggregation, ordering, sorting, and more. Python's container sequences include lists, sets, and tuples (just to name a few). The containers can be mutable (list, set) or immutable (tuple). These containers are iterable, which enables grouping, aggregating, and processing data.

### Lists

#### Accessing Single Items in a List (Demonstration)

In [None]:
# Add an item to a list. Access a single item in the list.
cookies = ["chocolate chip", "peanut butter", "sugar"]
cookies.append("Tirggel")
print(cookies)
print(cookies[2])

#### Combining Lists (Demonstration)

In [None]:
# Lists can be combined in different ways.
# Use the + operator to concatenate two lists to create a new list.
cakes = ["strawberry", "vanilla"]
desserts = cookies + cakes
print(desserts)

# The list.extend method merges a second list into the first list;
# this does not return a new list.
fruits = ["blueberries", "bananas", "melons"]
desserts.extend(fruits)
print(desserts)

#### Finding Elements in a List (Demonstration)

In [None]:
# Find elements in a list
position = cookies.index("sugar")
print(position)
print(cookies[position])

In [None]:
# Extra: A search that fails raises an exception. The value
# of position does not change.
try:
    position = cookies.index("oatmeal")
except ValueError as exc:
    traceback.print_exception(exc, file=sys.stdout)
print(position)

#### Removing Elements in a List (Demonstration)

In [None]:
# Remove an element from the list (and save it).
position = cookies.index("sugar")
name = cookies.pop(position)
print(name)
print(cookies)

#### Iterating over Lists (Demonstration)

In [None]:
# Iterate over the list.
for cookie in cookies:
    print(cookie)

#### Sorting Lists (Demonstration)

In [None]:
# Sort the list, return a new list with sorted items.
# sorted() returns data in numerical or alphabetical order and returns a new list.
sorted_cookies = sorted(cookies)
print(sorted_cookies)

#### Manipulating Lists (Exercise)

In [None]:
# Create a list containing the names: baby_names
baby_names = ["Ximena", "Aliza", "Ayden", "Calvin"]

# Extend baby_names with 'Rowen' and 'Sandeep'
baby_names.extend(["Rowen", "Sandeep"])

# Print baby_names
print(baby_names)

# Find the position of 'Aliza': position
position = baby_names.index("Aliza")

# Remove 'Aliza' from baby_names
baby_names.pop(position)

# Print baby_names
print(baby_names)

# Extra: By default, pop() removes the last item from the list.
print(baby_names.pop())
print(baby_names)

#### Looping over Lists (Exercise)

In [None]:
# Load the data from the CSV file.
# header: ['BRITH_YEAR', 'GENDER', 'ETHNICTY', 'NAME', 'COUNT', 'RANK']
# data:   ['2011', 'FEMALE', 'HISPANIC', 'GERALDINE', '13', '75']
records = []
with open("baby_names.csv", "r") as csv_file:
    csv_reader = csv.reader(csv_file)
    header = next(csv_reader)
    print("header: {}".format(header))
    # Load the records using a list comprehension.
    records = [record for record in csv_reader]
    # Or get the records one at a time using a for loop.
    # for record in csv_reader:
    #     records.append(record)
print("Read {} records from CSV file.".format(len(records)))
print("The first record:")
print(records[0])

# Collect the baby names in a list.
baby_names = [record[3] for record in records]

# Omit printing all of the 13,962 names.
# for name in sorted(baby_names):
#     print(name)
print("Collected " + str(len(baby_names)) + " baby names.")

### Tuples

Tuples hold data in order, and the data can be accessed using an index. Tuples are immutable; you cannot add to them or delete from them. This makes them more efficient than lists for some operations (read-only).

We can create tuples by pairing up elements. We can unpack tuples into named variables.

#### Zipping Tuples (Demonstration)

In [None]:
# Tuples are commonly created by zipping lists together with zip().
# Often we'll have lists that we want to match up, element by element.
# Use zip to create tuples with elements that belong together.
us_cookies = ["Chocolate Chip", "Brownies", "Peanut Butter", "Oreos", "Oatmeal Raisin"]
in_cookies = ["Punjabi", "Fruit Cake Rusk", "Marble Cookies", "Kaju Pista Cookies", "Almond Cookies"]
top_pairs = list(zip(us_cookies, in_cookies))
# Show object types. zip returns an object that is an iterable.
print(zip(us_cookies, in_cookies))
print(type(zip(us_cookies, in_cookies)))
print(type(top_pairs))
print(top_pairs)

#### Unpacking Tuples (Demonstration)

In [None]:
# Tuple unpacking or tuple expansion
# Assign members of a tuple to named variables for later use.
us_num_1, in_num_1 = top_pairs[0]
print(us_num_1)
print(in_num_1)

#### Unpacking Tuples in Loops (Demonstration)

In [None]:
# Tuple unpacking in loops. Unpacking is especially powerful in loops.
for us_cookie, in_cookie in top_pairs:
    print(in_cookie + " + " + us_cookie)

# The last values linger in the variables.
print(in_cookie)
print(us_cookie)

#### Enumerating Positions (Demonstration)

In [None]:
# Use the enumerate function to provide the index of each element of an
# iterable. Enumeration is used in loops to return the position and the
# data in that position while looping.
for idx, item in enumerate(top_pairs):
    us_cookie, in_cookie = item
    print(idx, us_cookie, in_cookie)

In [None]:
# Extra: Here's another way to unpack the data.
for idx, (us_cookie, in_cookie) in enumerate(top_pairs):
    print(idx, us_cookie, in_cookie)

#### Be Careful When Making Tuples (Demonstration)

In [None]:
# Use zip(), enumerate(), or () to make tuples.
items = ("vanilla", "chocolate")
print(type(items))
print(items)
print()
# It is the comman that creates the tuple.
items2 = "butter", "milk"
print(type(items2))
print(items2)
print()
# The extra comma creates a tuple.
items3 = "butter",
print(type(items3))
print(items3)
print()
# Make an empty tuple.
items4 = ()
print(type(items4))
print(items4)

#### Data Type Usage (Exercise)

The data type to use for ummutable, ordered data is the tuple.

#### Using and Unpacking Tuples (Exercise)

In [None]:
# Zip together two lists of ranked names.
# The lengths of the input lists are not the same; zip produces
# pairs that have a length of the shorter list.
girl_names = [
    'JADA', 'Emily', 'Ava', 'SERENITY', 'Claire', 'SOPHIA', 'Sarah', 'ASHLEY', 'CHAYA',
    'ABIGAIL', 'Zoe', 'LEAH', 'HAILEY', 'AVA', 'Olivia', 'EMMA', 'CHLOE', 'Sophia',
    'AALIYAH', 'Angela', 'Camila', 'Savannah', 'Serenity', 'Chloe', 'Fatoumata',
    'ISABELLA', 'MIA', 'FIONA', 'Skylar', 'Ashley', 'Rachel', 'Sofia', 'Alina',
    'MADISON', 'RACHEL', 'CAMILA', 'CHANA', 'TAYLOR', 'Kayla', 'Miriam', 'Leah',
    'Grace', 'ANGELA', 'Isabella', 'Emma', 'KAYLA', 'SOFIA', 'Madison', 'Aaliyah',
    'Taylor', 'GENESIS', 'Esther', 'MAKAYLA', 'Victoria', 'Chaya', 'Brielle', 'Anna',
    'Samantha', 'ESTHER', 'GRACE', 'Mariam', 'Mia', 'NEVAEH', 'GABRIELLE', 'EMILY',
    'London', 'TIFFANY', 'Chana', 'Valentina', 'OLIVIA', 'LONDON', 'MIRIAM', 'SARAH',
    'ELLA']
boy_names = [
    'JOSIAH', 'ETHAN', 'David', 'Jayden', 'MASON', 'RYAN', 'CHRISTIAN', 'ISAIAH',
    'JAYDEN', 'Michael', 'NOAH', 'SAMUEL', 'SEBASTIAN', 'Noah', 'Dylan', 'LUCAS',
    'JOSHUA', 'ANGEL', 'Jacob', 'Matthew', 'Josiah', 'JACOB', 'Muhammad', 'ALEXANDER',
    'Jason', 'Ethan', 'DANIEL', 'Joseph', 'AIDEN', 'Moshe', 'Jeremiah', 'William',
    'Alexander', 'Sebastian', 'ERIC', 'MOSHE', 'Jack', 'Eric', 'MUHAMMAD', 'Lucas',
    'BENJAMIN', 'Aiden', 'Ryan', 'Liam', 'JASON', 'KEVIN', 'Elijah', 'Angel', 'JAMES',
    'Daniel', 'Samuel', 'Amir', 'Mason', 'Joshua', 'ANTHONY', 'JOSEPH', 'Benjamin',
    'JUSTIN', 'JEREMIAH', 'MATTHEW', 'Carter', 'James', 'TYLER', 'DAVID', 'JACK',
    'ELIJAH', 'MICHAEL', 'CHRISTOPHER']
print("Length of girl names: {}".format(len(girl_names)))
print("Length of boy names: {}".format(len(boy_names)))
pairs = list(zip(girl_names, boy_names))
print("Length of pairs: {}".format(len(pairs)))
# Print the first 5 of the 68 pairs:
for idx, pair in enumerate(pairs[:5]):
    girl_name, boy_name = pair
    print("Rank {}: {} and {}".format(idx, girl_name, boy_name))

#### Making Tuples by Accident (Exercise)

In [None]:
# Making tupes by accident.
normal = "simple"
error = "trailing comma",
print(type(normal))
print(type(error))

### Sets

Sets contain unique, unordered, mutable data, so they are good for finding all unique items from a source.
Sets are usually created from a list.

#### Creating Sets (Demonstration)

In [None]:
# Create a set.
cookies_eaten_today = ["chocolate chip", "peanut butter", "chocolate chip",
    "oatmeal cream", "chocolate chip"]
types_of_cookies_eaten = set(cookies_eaten_today)
print(type(types_of_cookies_eaten))
print(types_of_cookies_eaten)

#### Modifying Sets (Demonstration)

In [None]:
# Add a single element using .add().
types_of_cookies_eaten.add("biscotti")
types_of_cookies_eaten.add("chocolate chip")
print(types_of_cookies_eaten)

#### Updating Sets (Demonstration)

In [None]:
# Merge in another set or list with .update().
cookies_hugo_ate = ["chocolate chip", "anzac"]
types_of_cookies_eaten.update(cookies_hugo_ate)
print(types_of_cookies_eaten)

#### Remove Data from Sets (Demonstration)

In [None]:
# .discard() safely removes an element from the set by value.
types_of_cookies_eaten.discard("biscotti")
print(types_of_cookies_eaten)

# Note: There is no error when the element to remove doesn't exist.
types_of_cookies_eaten.discard("oreo")
print(types_of_cookies_eaten)

# .pop() removes and returns an *arbitary* element from the set (KeyError when empty)
# Pop items from the set.
print(types_of_cookies_eaten)
print(types_of_cookies_eaten.pop())
print(types_of_cookies_eaten.pop())
print(types_of_cookies_eaten)

#### Set Operations - Similarities (Demonstration)

In [None]:
# .union() method returns a set of all names (or).
cookies_jason_ate = set(["chocolate chip", "oatmeal cream", "peanut butter"])
cookies_hugo_ate = set(["chocolate chip", "anzac"])
print(cookies_jason_ate.union(cookies_hugo_ate))
print(cookies_hugo_ate.union(cookies_jason_ate))

# .intersection() method identifies overlapping data (and).
print(cookies_jason_ate.intersection(cookies_hugo_ate))
print(cookies_hugo_ate.intersection(cookies_jason_ate))

#### Set Operations - Differences (Demonstration)

In [None]:
# .difference() method identifies data present in the set on which the
# method was used that is not in the arguments (minus)
print(cookies_jason_ate.difference(cookies_hugo_ate))
print(cookies_hugo_ate.difference(cookies_jason_ate))

#### Finding all the Data and the Overlapping Data between Sets (Exercise)

In [None]:
# Build two sets of baby names for 2011 and 2014.
# Use the str.title method to fix the names so capitalization is consistent.
baby_names_2011 = set([x[3].title() for x in records if x[0] == "2011"])
baby_names_2014 = set([x[3].title() for x in records if x[0] == "2014"])
print("Length of baby names from 2011: {}".format(len(baby_names_2011)))
print("Length of baby names from 2014: {}".format(len(baby_names_2014)))

# Find all names from 2011 and 2014..
all_names = baby_names_2011.union(baby_names_2014)
print("Length of all names from 2011 and 2014: {}".format(len(all_names)))

# Find the count of names in common from 2011 and 2014.
overlapping_names = baby_names_2011.intersection(baby_names_2014)
print("Length of common names from 2011 and 2014: {}".format(len(overlapping_names)))

#### Determining Set Differences (Exercise)

In [None]:
# The exercise includes building the set.
baby_names_2011 = set()
for row in records:
    if row[0] == "2011":
        baby_names_2011.add(row[3].title())
differences = baby_names_2011.difference(baby_names_2014)
print("Found {} differences.".format(len(differences)))
print()
# A set is unordered. Print 5 elements.
print(list(differences)[:5])
print()

# The set is printed in arbitary order, so sort it.
print(sorted(differences)[:5])

## Dictionaries

Dictionaries are containers that hold key/value pairs. They key is usually alphanumeric, but the value can be any Python object. It is possible to nest dictionaries so that a value can be a dictionary. Dictionaries are iterable over the keys, values, and items of dictionaries, which are tuples of keys and values.

Dictionaries are useful for grouping data by time or structuring hierarchical data such as organizational charts. (Perhaps they would be useful also for tracking prerequisites for courses.)

The standard documentation: https://docs.python.org/3/library/stdtypes.html#mapping-types-dict.

A good tutorial: # https://www.datacamp.com/community/tutorials/python-dictionary-comprehension.

### Using Dictionaries

#### Creating and Looping through Dictionaries (Demonstration)

We are not provided with the data source for the examples. In the course, the instructor uses different versions of the art_galleries variable. I have tried to straighten this out with the following variables:

| Variable | Data |
| :--- | :--- |
| art_galleries | keys are gallery names, values are zip codes |
| art_galleries2 | keys are gallery names, values are zip codes; used to demonstrate creating a dictionary from a dict comprehension |
| art_galleries3 | keys are zip codes, values are dictionaries where the keys are gallery names and the values are phone numbers |
| art_galleries4 | keys are gallery names, values are phone numbers |

In [None]:
# I copied the following data from one of the slides.
galleries = (
    ("Paige's Art Gallery", "10027"),
    ("Triple Candie", "10027"),
    ("Africart Motherland Inc", "10027"),
    ("Inner City Art Gallery Inc", "10027"),
    ("Zarre Andre Gallery", "10011"),
)
art_galleries = {}
for name, zip_code in galleries:
    art_galleries[name] = zip_code
print(art_galleries)

#### Create a Dictionary Using a Dictionary Comprehension (Extra)

In [None]:
# Use a dict comprehension to create a dict.
art_galleries2 = {x[0]: x[1] for x in galleries}
print(art_galleries2)
print(art_galleries == art_galleries2)

#### Printing in the Loop (Demonstration)

In [None]:
# Print the keys of a dictionary.
for name in art_galleries:
    print(name)

#### Overloading of `{}` for Creating Data Types (Extra)

In [None]:
# Note overloading of "{}" for creating dictionaries and sets.
var = {}
print(type(var))
var = {1}
print(type(var))
var = {"key": 1}
print(type(var))

#### Safely Finding by Key (Demonstration)

In [None]:
# Use a key to get the value.
# If the key is not found, a KeyError exception is raised.
print(art_galleries)
print(art_galleries["Africart Motherland Inc"])
try:
    print(art_galleries["Louvre"])
except KeyError as ex:
    # In a Jupyter notebook, must send the output to sys.stdout here.
    traceback.print_exc(limit=1, file=sys.stdout)

In [None]:
# Use the .get() method to safely access a key without error or
# exception handling; the method returns None by default, or you can
# supply a value to return.
print(art_galleries.get("Louvre"))
print(art_galleries.get("Louvre", "Not Found"))
print(art_galleries.get("Zarre Andre Gallery"))

#### Iterate through the Keys, Values, and Items of a Dictionary (Extra)

This is not well-explained by the course.

In [None]:
# Iterate through the keys, values, and items of a dict.
for name in art_galleries:
    print(name)
print()
for key in art_galleries.keys():
    print(key)
print()
for values in art_galleries.values():
    print(values)
print()
for key, value in art_galleries.items():
    print(key, value)

#### Working with Nested Dictionaries (Demonstration)

Nested data structures are a common way to deal with repeating data structures such as yearly data or hierarchical or grouped data.

In [None]:
# In this demonstration, the data is structured differently, with the key
# being a zip code and the the value being a dictionary.
art_galleries3 = {}
galleries_10027 = {
    "Paige's Art Gallery": "(212) 531-1577",
    "Triple Candie": "(212) 865-0783",
    "Africart Motherland Inc": "(212) 368-6802",
    "Inner City Art Gallery Inc": "(212) 368-4941"
}
art_galleries3["10027"] = galleries_10027
print(art_galleries3.keys())
pprint.pprint(art_galleries3["10027"])

#### Accessing Nested Data (Demonstration)

In [None]:
# Print the telephone number of a specific gallery.
print(art_galleries3["10027"]["Inner City Art Gallery Inc"])

#### Creating and Looping through Dictionaries (Exercise)

In [None]:
# female_baby_names_2012 is a dict containing RANK and NAME, with the items
# sorted in order by rank.
# Create female_baby_names_2012 using the records variable that was loaded
# above when working with lists.
female_baby_names_2012 = {}
for record in records:
    # header: ['BRITH_YEAR', 'GENDER', 'ETHNICTY', 'NAME', 'COUNT', 'RANK']
    birth_year, sex, ethnicity, name, count, rank = record
    birth_year = int(birth_year)
    rank = int(rank)
    if birth_year == 2012 and sex == "FEMALE":
        female_baby_names_2012[rank] = name

# Sort the keys (ranks) in reverse order and print the first 10 values.
names_by_rank = {}
for rank, name in female_baby_names_2012.items():
    names_by_rank[rank] = name
for rank in sorted(female_baby_names_2012.keys(), reverse=True)[:10]:
    print(names_by_rank[rank])

#### Safely Finding by Key (Exercise)

In [None]:
# Use the .get() method. There are no values associated with ranks or keys
# 100 or 105.
print(female_baby_names_2012.get(7))
print(type(female_baby_names_2012.get(100)))
print(female_baby_names_2012.get(105, "Not Found"))

#### Dealing with Nested Data (Exercise)

In [None]:
# Build the data record containing boy names from 2013 and 2014; set
# the value for key 2012 to an empty dict.
unsorted_boy_names = {2012: {}, 2013: {}, 2014: {}}
for record in records:
    birth_year, sex, ethnicity, name, count, rank = record
    birth_year = int(birth_year)
    rank = int(rank)
    if sex == "MALE":
        if birth_year == 2013 or birth_year == 2014:
            unsorted_boy_names[birth_year][rank] = name

# Build a data structure sorted by birth year and rank.
boy_names = {2012: {}, 2013: {}, 2014: {}}
for birth_year in unsorted_boy_names.keys():
    boy_names[birth_year] = \
        {rank: unsorted_boy_names[birth_year][rank] for rank in sorted(unsorted_boy_names[birth_year])}

# Perform the exercise.
# Print the birth years.
print(boy_names.keys())
# Print the ranks for year 2013.
print(boy_names[2013].keys())
# Print the name ranked 3 for each year.
for birth_year in boy_names:
    print(birth_year, boy_names[birth_year].get(3, "Unknown"))

### Altering Dictionaries

#### Adding and Extending Dictionaries (Demonstration)

In [None]:
# Use assignment to add a new key/value to a dictionary.
# Once again, the instructor has changed the structure of the data in
# a variable.
galleries_10007 = {"Nyabinghi African Gift Shop": "(212) 566-3336"}
art_galleries3["10007"] = galleries_10007
print(art_galleries3["10007"])

#### Updating a Dictionary (Demonstration)

In [None]:
# Use .update() to update a dictionary from another dictionary, list of
# tuples, or keywords.
# When using .update(), the key must already exist.
art_galleries3["11234"] = {}
galleries_11234 = [
    ("A J ARTS LTD", "(718) 763-5473"),
    ("Doug Meyer Fine Art", "(718) 375-8006"),
    ("Portrait Gallery", "718) 377-8762")]
art_galleries3["11234"].update(galleries_11234)
print(art_galleries3["11234"])

#### Popping and Deleting from Dictionaries (Demonstration)

Use `del` to delete a key/value pair. (`del` is an operator, not a function.) When using `del`, the key must exist, else an exception is raised.

The `.pop()` method raises a `KeyError` exception if the key doesn't exist; but it can have an optional parameter that is returned if the key does not exist.

In [None]:
# Prepare the data by adding an art gallery for zip code 10310.
art_galleries3["10310"] = {"New Dorp Village Antiques Ltd": "(718) 815-2526"}

# Delete from a dictionary using the key.
del art_galleries3["11234"]

# Pop an item from a dictionary using the key.
galleries_10310 = art_galleries3.pop("10310")
print(galleries_10310)

#### Adding and Extending Dictionaries (Exercise)

In [None]:
# Build the needed data record.
unsorted_boy_names_2011 = {}
for record in records:
    birth_year, sex, ethnicity, name, count, rank = record
    birth_year = int(birth_year)
    rank = int(rank)
    if sex == "MALE" and birth_year == 2011:
        unsorted_boy_names_2011[rank] = name.title()
names_2011 = {rank: unsorted_boy_names_2011[rank]
              for rank in sorted(unsorted_boy_names_2011)}

# Perform the exercise.
boy_names[2011] = names_2011

# Update the 2012 key in boy_names with data in a list of tuples.
print(boy_names.get(2012, None))
boy_names[2012].update([(1, "Casey"), (2, "Aiden")])
print(boy_names.get(2012, None))

# Loop over boy_names. Sort the data for each year by descending rank
# and take the first result, which will be the lowest ranked name.
# Print the year and the least popular name or "Not Available" if it
# is not found, using the .get() method.
for year in sorted(boy_names):
    lowest_ranked = sorted(boy_names[year], reverse=True)[0]
    print(year, boy_names[year].get(lowest_ranked, "Not Available"))

#### Popping and Deleting from Dictionaries (Exercise)

In [None]:
# Set up the data record.
# This can also be done by creating a list of 2-tuples and using the
# list as the argument to dict().
female_names = {}
female_names[2011] = {
    1: 'Olivia', 2: 'Esther', 3: 'Rachel', 4: 'Leah', 5: 'Emma',
    6: 'Chaya', 7: 'Sarah', 8: 'Sophia', 9: 'Ava', 10: 'Miriam'}
female_names[2012] = {}
female_names[2013] = {
    1: 'Olivia', 2: 'Emma', 3: 'Esther', 4: 'Sophia', 5: 'Sarah',
    6: 'Leah', 7: 'Rachel', 8: 'Chaya', 9: 'Miriam', 10: 'Chana'}
female_names[2014] = {
    1: 'Olivia', 2: 'Esther', 3: 'Rachel', 4: 'Leah', 5: 'Emma',
    6: 'Chaya', 7: 'Sarah', 8: 'Sophia', 9: 'Ava', 10: 'Miriam'}

# Pop an item from the dictionary.
female_names_2011 = female_names.pop(2011)
# Key 2015 doesn't exist; return an empty dictionary in this case.
female_names_2015 = female_names.pop(2015, {})
# Delete the record for 2012.
del female_names[2012]
# Print what's left.
pprint.pprint(female_names)

### Pythonically Using Dictionaries

Use `.items()` and tuple unpacking to iterate through the keys and values of a dictionary.

Use `in` to test whether a key is present in a dictionary; returns `True` or `False`.

#### Getting Items from a Dictionary (Demonstration)

In [None]:
# Create a dictionary and print its items.
# It is annoying that the instructor keeps changing the data in
# the variables.
art_galleries4 = {"Miakey Art Gallery": "(718) 686-0788",
             "Morning Star Gallery Ltd": "(212) 334-9330",
             "New York Art Expo Inc": "(212) 363-8280"}
for gallery, phone_num in art_galleries4.items():
    print(gallery)
    print(phone_num)

#### Checking Dictionaries for Data (Demonstration)

In [None]:
# Use in to check for data.
# I had to correct the example to use "10007" instead of "10010"
# for finding the "Nyabinghi African Gift Shop".
print("11234" in art_galleries3)
if "10007" in art_galleries3:
    print("I found %s" % art_galleries3["10007"])
else:
    print("No galleries found")

#### Iterating through Dictionary Items (Exercise)

In [None]:
# Set up the data record.
baby_names = {
    2012: {},
    2013: {1: 'David', 2: 'Joseph', 3: 'Michael', 4: 'Moshe', 5: 'Daniel',
           6: 'Benjamin', 7: 'James', 8: 'Jacob', 9: 'Jack', 10: 'Alexander'},
    2014: {1: 'Joseph', 2: 'David', 3: 'Michael', 4: 'Moshe', 5: 'Jacob',
           6: 'Benjamin', 7: 'Alexander', 8: 'Daniel', 9: 'Samuel', 10: 'Jack'}}

# Perform the exercise.
for rank, name in baby_names[2014].items():
    print(rank, name)
for rank, name in baby_names[2012].items():
    print(rank, name)

#### Checking Dictionaries for Data (Exercise)

In [None]:
# Look for year in dictionary.
if 2011 in baby_names:
    print("Found 2011")

# Look for ranks in sub-dictionaries.
if 1 in baby_names[2012]:
    print('Found Rank 1 in 2012')
else:
    print('Rank 1 missing from 2012')
    
if 5 in baby_names[2013]:
    print('Found Rank 5')

### Working with CSV Files

We are not given the ART_GALLERY.csv file. In addition, the examples shown are not consistent. I created ART_GALLERY.csv using data from the first example. The second example using the DictReader contained different data.

I have revised the code to use `with`.

#### Reading from a CSV (Example)

In [None]:
# Use the code recommended by the documentation for the csv module.
with open("ART_GALLERY.csv", newline="") as csvfile:
    for row in csv.reader(csvfile):
        print(row)

#### Creating a Dictionary from a File (Example)

The `csv.DictReader` will convert a CSV file to a dictionary. The header row will be used to create the keys of the dictionary. Otherwise, you can pass in the column names that will be used as the keys. The result is returned as an ordered dictionary.

In [None]:
# Use a csv.DictReader to process the file.
with open("ART_GALLERY.csv", newline="") as csvfile:
    for row in csv.DictReader(csvfile):
        print(row)

#### Reading from a File Using CSV Reader (Exercise)

In [None]:
# Collect some of the data from baby_names.csv.
# I revised the code for best practices.
# Note that the values in the baby_names dictionary are overwritten a few
# times before producing the final values. This is not a good example for
# reading the CSV file.
baby_names = {}
file_path = "baby_names.csv"
with open(file_path, newline="") as csvfile:
    csv_reader = csv.reader(csvfile)
    # Skip the header line.
    header = next(csv_reader)
    for row in csv.reader(csvfile):
        # print(row)
        baby_names[int(row[5])] = row[3]
print(baby_names)
# pprint.pprint sorts the dict keys for pretty printing.
pprint.pprint(baby_names)

#### Creating a Dictionary from a File (Exercise)

In [None]:
# This uses csv.DictReader.
# There are 102 entries in the dict.
baby_names2 = {}
file_path = "baby_names.csv"
with open(file_path, newline="") as csvfile:
    for row in csv.DictReader(csvfile):
        baby_names2[int(row["RANK"])] = row["NAME"]
print(baby_names2)
# Produce the same result as pprint.pprint().
for key in sorted(list(baby_names2.keys())):
    print(str(key) + ": " + str(baby_names2[key]))

## The `collections` Module

In the course, this chapter was called "Meet the collections module".

### collections.Counter

The collections module contains advanced containers. `Counter` is a special dictionary used for counting data and measuring frequency. `Counter` is great for frequency problems.

#### Counter (Demonstration)

We are not given the data for the demonstration.

```Python
from collections import Counter
# nyc_eatery_types is a list.
nyc_eatery_count_by_types = Counter(nyc_eatery_types)
print(nyc_eary_count_by_type)
# Output: Counter({"Mobile Food Truck": 114, "Food Cart": 74, ....})
print(nyc_eatery_count_by_types["Restaurant"])
# Output: 15
# .most_common() method returns counter values in descending order.
print(nyc_eatery_count_by_types.most_common(3))
# Output: [("Mobile Food Truck", 114), ("Food Cart", 74), ...]
```

#### Using Counter on Lists (Exercise)

In [None]:
# Read the data from the CSV file data/cta_daily_station_totals.csv
# and create the stations list.
# The fields in the CSV file are:
#   station_id, stationname, date, daytype, rides
# Note that in the exercise, the header is included in the stations list.
with open("cta_daily_station_totals.csv", newline="") as data_file:
    csv_reader = csv.reader(data_file)
    cta_daily_station_totals = [x for x in csv_reader]
stations = [x[1] for x in cta_daily_station_totals]
print(len(stations))

# Pass an iterable (list, set, tuple) or a dictionary to the Counter.
print(stations[:10])
station_count = Counter(stations)
print(station_count)

#### Finding the Most Common Items (Exercise)

In [None]:
# Find the 5 most common elements in station_count.
# Since everything has a count of 700, this isn't very useful.
print(station_count.most_common(5))

### collections.defaultdict

A good resource is https://realpython.com/python-defaultdict/.

This discusses how to use a default dict for
- grouping items in a collection
- counting items in a collection
- accumulating the values in a collection

#### Dictionary Handling (Demonstration)

The use case is complex data where we don't know what all the keys will be, but a list value is associated with each key. 

The code example iterates over a list of tuples. The code initializes the empty list for each new key. This is tedious, and a little autovivification would be useful here.

```Python
for park_id, name in nyc_eateries_parks:
    if park_id not in eateries_by_park:
        eateries_by_park[park_id] = []
    eateries_by_park[park_id].append(name)
print(eateries_by_park["M010"])
```

#### Using `defaultdict` (Demonstration)

Pass it a default type that every key will have even if it doesn't currently exist. The argument passed to defaultdict is a valid Python callable or None. Otherwise, defaultdict works exactly like a dictionary.

```Python
from collections import defaultdict
eateries_by_park = defaultdict(list)
for park_id, name in nyc_eateries_parks:
    eateries_by_park[park_id].append(name)
print(eateries_by_park["M010"])
```

It is also sommon to use a defaultdict as a counter to count keys from a list of dictionaries. The use case is to count the number of eateries that have published phone numbers or a website.

Note that `int()` returns 0, so the default value of a key is 0. If you want a different value, use a regular dict and the `.setdefault()` method of the dict; see https://docs.python.org/3/library/stdtypes.html#dict.setdefault.

```Python
from collections import defaultdict
eatery_contact_types = defaultdict(int)
for eatery in nyc_eateries:
    if eatery.get("phone"):
        eatery_contact_types["phones"] += 1
    if eatery.get("website"):
        eatery_contact_types["websites"] += 1
print(eatery_contact_types)
```

It's annoying that we don't have the data behind these examples for experimentation.

#### Creating Dictionaries of Unknown Structure (Exercise)

In [None]:
# entries is a list of tuples, where a tuple looks like this:
# ("08/15/2016", "Jackson/Dearborn", "6919")
# This data can be obtained from some of the fields in data/cta_daily_station_totals.csv.
# We need fields 2, 1, and 4 from the CSV file to get the data; these
# are stop, date, riders.
data_file_path = "cta_daily_station_totals.csv"
with open(data_file_path, newline="") as data_file:
    csv_reader = csv.reader(data_file)
    header = next(csv_reader)
    entries = [(x[2], x[1], int(x[4])) for x in csv_reader]
print(header)
print(len(entries))
print(entries[:5])
print()

# For each stop, record the date and the number of riders in a list
# of tuples. This part of the exercise does not use defaultdict.
ridership = {}
for date, stop, riders in entries:
    if date not in ridership:
        ridership[date] = []
    ridership[date].append((stop, riders))
print(ridership["03/09/2016"])
print()

#### Safely Appending to a Key's Value List Using defaultdict (Exercise)

In [None]:
# This time, use a defaultdict to initialize values for new keys.
# The result is identical to what we did above.
# When creating a defaultdict, pass it the type you want the value
# to have, such as list, tuple, set, int, str, dict, or any other
# valid type object.
ridership = defaultdict(list)
for date, stop, riders in entries:
    ridership[date].append((stop, riders))
print(ridership["03/09/2016"])
print()

# OK, that wasn't the exercise. This course needed an editor.
# Build a dict where the key is stop and the value is a list of riders.
# Print the first 10 items in the dict. Since this output is very large,
# print the number of riders values for each stop.
ridership = defaultdict(list)
for _, stop, riders in entries:
    ridership[stop].append(riders)
#print(list(ridership.items())[:10])
for stop in list(ridership.keys())[:10]:
    print(stop, len(ridership[stop]))

### collections.OrderedDict

Dictionaries are ordered in Python 3.6 and later, reducing the need for this class.

```Python
from collections import OrderedDict
nyc_eatery_permits = OrderedDict()
for eatery in nyc_eateries:
    nyc_eatery_permits[eatery["end_date"]] = eatery
print(list(nyc_eatery_permits.items())[:3])
```

The `.popitem()` method of a dict object returns items in LIFO (last in first out) order. The `.popitem()` method of an OrderedDict object returns items in LIFO order by default; use `.popitem(last=False)` to return items in FIFO (first in first out) order.

```Python
print(nyc_eatery_permits.popitem())
print(nyc_eatery_permits.popitem(last=False))
```

#### Working with Ordered Dictionaries (Exercise)

In [None]:
# This version uses the entries variable from above, which contains values
# for date, stop, and riders.
ridership_date = OrderedDict()
for date, _, riders in entries:
    if date not in ridership_date:
        ridership_date[date] = 0
    ridership_date[date] += riders
print(list(ridership_date.items())[:31])

#### Ordered Popping (Exercise)

In [None]:
# Powerful ordered popping. Popping removes a key and value from the
# dictionary and returns them as a tuple.
# Print the first key.
print(list(ridership_date.keys())[0])
# Pop an item in FIFO order.
print(ridership_date.popitem(last=False))
# Print the last key.
print(list(ridership_date.keys())[-1])
# Pop an item in LIFO order.
print(ridership_date.popitem())

### collections.namedtuple

See https://docs.python.org/3/library/collections.html#collections.namedtuple. See also Python Cookbook, Section 1.18.

A namedtuple can be used to create lightweight class objects. These objects are immutable (since they are subclasses of the tuple class). To change an attribute, use .replace(), which creates and returns a new namedtuple object.

A namedtuple is a tuple where each position (column) has a name. This is a container alternative to a dict or to a pandas DataFrame row.

In this example, Eatery is a subclass named Eatery, and eateries is a dictionary; the class or subclass is used to create objects.

#### Creating a Namedtuple (Example)

```Python
from collections import namedtuple
Eatery = namedtuple("Eatery", ["name", "location", "park_id", "type_name"])
eateries = []
for eatery in nyc_eateries:
    details = Eatery(eatery["name"],
                     eatery["location"],
                     eatery["park_id"],
                     eatery["type_name"])
    eateries.append(details)
print(eateries[0])
```

#### Using a Namedtuple (Example)

Fields have names and are attributes. This makes code cleaner where you find yourself indexing a list or a tuple to get to fields of a data record. Each field is available as an attribute of the namedtuple.

```Python
for eatery in eateries:
    print(eatery.name)
    print(eatery.park_id)
    print(eary.location)
```

#### Creating Namedtuples for Storing Data (Exercise)

In [None]:
# Use a namedtuple instead of a dict for storing data.
DateDetails = namedtuple("DateDetails", ["date", "stop", "riders"])
print(type(DateDetails))
print(DateDetails)
labeled_entries = []
for date, stop, riders in entries:
    labeled_entries.append(DateDetails(date=date, stop=stop, riders=riders))
print(type(labeled_entries[0]))
print(labeled_entries[:5])
print(len(labeled_entries))

In [None]:
# Repeat creating namedtuple objects without using key word arguments.
# Show that labeled_entries and labeled_entries2 contain the same data.
labeled_entries2 = []
for entry in entries:
    labeled_entries2.append(DateDetails(*entry))
print(labeled_entries2[:5])
print(labeled_entries == labeled_entries2)

#### Using Attributes of Namedtuples (Exercise)

In [None]:
# How to access attributes of a namedtuple object.
for item in labeled_entries[:20]:
    print(item.stop)
    print(item.date)
    print(item.riders)

#### Loading CSV Data into Namedtuple Objects (Extra)

In [None]:
# Here's code that reads the CSV file and returns a list of namedtuple objects.
# I have replicated Python Cookbook section 6.1 here.
# Convert the rides value from a string to an int.
data_file_path = "cta_daily_station_totals.csv"
with open(data_file_path, newline="") as data_file:
    csv_reader = csv.reader(data_file)
    header = next(csv_reader)
    print(header)
    DailyTotal = namedtuple("DailyTotal", header)
    entries2 = [DailyTotal(row[0], row[1], row[2], row[3], int(row[4])) for row in csv_reader]
print(len(entries2))
print(entries2[:5])

## Handling Dates and Times

See https://docs.python.org/3.8/library/datetime.html#strftime-strptime-behavior. I had a lot of trouble with time zones when I worked through this chapter the first time. From my notes in 2020:

> pytz doesn't do what's expected, and it corrupts datetime objects if you don't use the timezone objects correctly. The course teaches the wrong way to do things. You get funny UTC offsets, and then you get odd times when you convert to other timezones, because pytz uses historical timezones as the default (so stupid!).

See also https://stackoverflow.com/questions/13866926/is-there-a-list-of-pytz-timezones, where people blame the API of pytz and the implementation of datetime.datetime for the problems I and others have observed. And see https://stackoverflow.com/questions/11473721/weird-timezone-issue-with-pytz/11474330. And see https://yongweiwu.wordpress.com/2019/09/01/time-zones-in-python/.

Python 3.9 added the zoneinfo module; see https://docs.python.org/3/library/zoneinfo.html.

### The datetime Module

#### Parsing Strings into Datetimes (Example)

In [None]:
# strptime stands for string parse time.
# strftime stands for string format time.
parking_violations_date = "06/11/2016"
print(parking_violations_date)
date = datetime.datetime.strptime(parking_violations_date, "%m/%d/%Y")
print(type(date))
print(date)

#### Date to String (Example)

In [None]:
# Convert a datetime.datetime object into a string.
print(date.strftime("%m/%d/%Y"))
print(date.isoformat())
# Create a string for use with MySQL.
print(date.isoformat(sep=" "))

#### Strings to `datetime` Objects (Exercise)

In [None]:
# Convert strings to datetime objects. I copied the input strings from
# the DataCamp console. To support the exercise after this, collect the
# datetime objects into a new list, datetimes_list.
dates_list = ['02/19/2001',
             '04/10/2001',
             '05/30/2001',
             '07/19/2001',
             '09/07/2001',
             '10/27/2001',
             '12/16/2001',
             '02/04/2002',
             '03/26/2002',
             '05/15/2002',
             '07/04/2002',
             '08/23/2002',
             '10/12/2002',
             '12/01/2002',
             '01/20/2003',
             '03/11/2003',
             '04/30/2003',
             '06/19/2003',
             '08/08/2003',
             '09/27/2003',
             '11/16/2003',
             '01/05/2004',
             '02/24/2004',
             '04/14/2004',
             '06/03/2004',
             '07/23/2004',
             '09/11/2004',
             '10/31/2004',
             '12/20/2004',
             '02/08/2005',
             '03/30/2005',
             '05/19/2005',
             '07/08/2005',
             '08/27/2005',
             '10/16/2005',
             '12/05/2005',
             '01/24/2006',
             '03/15/2006',
             '05/04/2006',
             '06/23/2006',
             '08/12/2006',
             '10/01/2006',
             '11/20/2006',
             '01/09/2007',
             '02/28/2007',
             '04/19/2007',
             '06/08/2007',
             '07/28/2007',
             '09/16/2007',
             '11/05/2007',
             '12/25/2007',
             '02/13/2008',
             '04/03/2008',
             '05/23/2008',
             '07/12/2008',
             '08/31/2008',
             '10/20/2008',
             '12/09/2008',
             '01/28/2009',
             '03/19/2009',
             '05/08/2009',
             '06/27/2009',
             '08/16/2009',
             '10/05/2009',
             '11/24/2009',
             '01/13/2010',
             '03/04/2010',
             '04/23/2010',
             '06/12/2010',
             '08/01/2010',
             '09/20/2010',
             '11/09/2010',
             '12/29/2010',
             '02/17/2011',
             '04/08/2011',
             '05/28/2011',
             '07/17/2011',
             '09/05/2011',
             '10/24/2011',
             '11/12/2011',
             '01/01/2012',
             '02/20/2012',
             '04/10/2012',
             '05/30/2012',
             '07/19/2012',
             '09/07/2012',
             '10/27/2012',
             '12/16/2012',
             '02/04/2013',
             '03/26/2013',
             '05/15/2013',
             '07/04/2013',
             '08/23/2013',
             '10/12/2013',
             '12/01/2013',
             '01/20/2014',
             '03/11/2014',
             '04/30/2014',
             '06/19/2014',
             '08/08/2014',
             '09/27/2014',
             '11/16/2014',
             '07/05/2014',
             '01/24/2015',
             '03/15/2015',
             '05/04/2015',
             '06/23/2015',
             '08/12/2015',
             '10/01/2015',
             '11/20/2015',
             '01/09/2016',
             '02/28/2016',
             '04/18/2016',
             '06/07/2016',
             '07/27/2016',
             '09/15/2016',
             '11/04/2016']
datetimes_list = []
for date_str in dates_list[:5]:
    date_dt = datetime.datetime.strptime(date_str, "%m/%d/%Y")
    print(date_dt)
    datetimes_list.append(date_dt)

#### `datetime` Objects to Strings (Exercise)

In [None]:
# Convert a dateime object to a string.
for item in datetimes_list:
    print(item.strftime("%m/%d/%Y"))
    print(item.isoformat())

### Working with datetime Components and Current Time

#### Using Datetime components (Example)

Dateime components (day, month, year, hour, minute, second) are great for grouping data.

Here, given NYC parking violations data, count the number of violations for each day of the month. Sort the days of the month and print the number of violations.

```Python
daily_violations = defaultdict(int)
for violation in parking_violations:
    violation_date = datetime.datetime.strptime(violation[4], "%m/%d/%Y")
    daily_violations[violation_date.day] += 1
print(sorted(daily_violations.items()))
print(sorted(daily_violations.items(), key=lambda x: x[0]))
```

#### Sorting a List of Tuples (Extra)

What is the behavior of `sorted`() when sorting a list of tuples? See https://docs.python.org/3/howto/sorting.html and https://docs.python.org/3/reference/expressions.html#value-comparisons.

Sorting uses an object's `__lt__` method. Generally, `sorted` needs a key to sort on:

```Python
mysortedlist = sorted(mylist, key=lambda tup: tup[1])
```

Without a key, tuples are compared item by item.

In [None]:
# Experiment with sorting.
daily_violations = {31: 44125, 15: 99122, 1: 80986}
print(sorted(daily_violations.items()))
# Here's a better key for the sorting. The key must be a
# callable that takes one argument and returns one value.
# The value returned can be a tuple.
print(sorted(daily_violations.items(), key=lambda x: x[0]))
# Sort by multiple keys
print(sorted(daily_violations.items(), key=lambda x: (x[0], x[1])))
# Reverse the keys.
print(sorted(daily_violations.items(), key=lambda x: (x[1], x[0])))
# reverse sort using only one key
print(sorted(daily_violations.items(), key=lambda x: x[0], reverse=True))

#### Getting the Current Time (Example)

Working with timezones is problematic. See https://pypi.org/project/pytz/ and https://docs.python.org/3/library/datetime.html#datetime-objects.

The best way to do this is to use the pytz module, but use only the `.localize()` and `.astimezone()` methods, as documented, to add timezone information to a datetime.datetime object. DO NOT set the tzinfo attribute directly, since this does not work!

Do not call datetime.datetime.utcnow() to get the current UTC time. Call datetime.datetime.now(), localize it, and call .astimezone(pytz.utc) to set its timezone to UTC.

#### Getting the Current Time (Example)

In [None]:
# Get the current local time and the current UTC time.
local_dt = datetime.datetime.now()
print(local_dt)
utc_dt = datetime.datetime.utcnow()
print(utc_dt)

#### Getting a Timezone-Aware Current Time (Extra)

Big digression.

In [None]:
# Get a timezone-aware local time using pytz. This requires knowing the
# local time zone.
dt_now = datetime.datetime.now()
dt_now_local = pytz.timezone('US/Eastern').localize(dt_now)
dt_now_utc = dt_now_local.astimezone(pytz.UTC)
print(dt_now)
print(dt_now_local)
print(dt_now_utc)

In [None]:
# From https://stackoverflow.com/questions/1111056/get-time-zone-information-of-the-system-in-python
# This works correctly without needing to know the current timezone, but
# the name of the timezone is not in pytz.
dt_now3 = datetime.datetime.now().astimezone(None)
print(dt_now3)
print(dt_now3.tzinfo)
print(dt_now3.tzname())
print(dt_now3.isoformat())
print()

dt_utc3 = dt_now3.astimezone(datetime.timezone.utc)
print(dt_utc3)
print(dt_utc3.tzinfo)
print(dt_utc3.tzname())
print(dt_utc3.isoformat())
print()

# Get the UTC offset for local time and use it to convert
# a recorded time to local time. I don't know if this works
# correctly for DST.
utcoffset = dt_now3.tzinfo.utcoffset(dt_now3)
print(utcoffset)
print()

converted_now3 = dt_now3.astimezone(datetime.timezone(utcoffset))
print(converted_now3)
print(converted_now3.tzname())

#### Conversions That Don't Work Correctly (Extra)

These conversions are wrong! I'm saving these here to document examples that fail.

In [None]:
# This sets the wrong time.
dt_utc2 = datetime.datetime.utcnow().astimezone()
print(dt_utc2)
print(dt_utc2.tzinfo)
print(dt_utc2.tzname())
print()

# This returns the wrong time.
dt_utc4 = datetime.datetime.utcnow()
print(dt_utc4)
dt_utc4 = dt_utc4.astimezone(pytz.UTC)
print(dt_utc4)
print(dt_utc4.tzinfo)
print(dt_utc4.tzname())
print()

# This returns the wrong time.
dt_utc5 = datetime.datetime.utcnow().astimezone(datetime.timezone.utc)
print(dt_utc5)
print(dt_utc5.tzinfo)
print(dt_utc5.tzname())
print()

#### More Conversions (Extra)

In [None]:
# The .now() method returns the current local datetime as a naive object
# having no timezone information.
print("datetime.datetime.now() [naive]...")
now = datetime.datetime.now()
print(now)
print(now.isoformat())
print(now.isoformat(sep=" "))
print(now.tzinfo)
print()

# The .utcnow() method returns the current UTC datetime as a naive object
# having no timezone information.
print("datetime.datetime.utcnow() [naive]...")
utcnow = datetime.datetime.utcnow()
print(utcnow)
print(utcnow.isoformat())
print(utcnow.tzinfo)
print()

# How to get an aware datetime.
# Note we call datetime.datetime.now(), passing timezone
# information to get the equivalent of an aware utcnow value.
print("datetime.datetime.now() [aware]...")
utcnow2 = datetime.datetime.now(datetime.timezone.utc)
print(utcnow2)
print(utcnow2.isoformat())
print(utcnow2.tzinfo)
print(type(utcnow2.tzinfo))
print(utcnow2.tzinfo.utcoffset(utcnow2))
print()

# How to get an aware datetime using pytz.
print("datetime.datetime.utcnow() [aware]...")
utcnow3 = datetime.datetime.utcnow()
utcnow3 = pytz.utc.localize(utcnow3)
print(utcnow3)
print(utcnow3.isoformat())
print(utcnow3.tzinfo)
print(type(utcnow3.tzinfo))
print(utcnow3.tzinfo.utcoffset(utcnow3))

#### Timezones (Example)

We return to the course presentation here.

Naive datetime objects have no timezone data; aware objects have a timezone.

Timezone data is available via the pytz module via the timezone object. pytz needs to be installed; it is not part of the core installation.

Aware objects have the .astimezone() method so you can get the time in another timezone. [Extra] In Python 3.9 or later you can call the .astimezone() method without passing a tzinfo object, in which case the local timezone is set. See https://docs.python.org/3/library/datetime.html#datetime.datetime.astimezone.

In [None]:
# The slides are wrong. I reported this in 2020, and as of 2022-08-19,
# the slides are still wrong.
# I have spent hours and hours understanding how to do timezones correctly.

# strptime() should use "%m/%d/%Y %I:%M%p" ("%I" instead of "%H").
# From the documentation:
# When used with the strptime() method, the %p directive only affects the
# output hour field if the %I directive is used to parse the hour.

# Running the code as given in slides 13 and 14 gives a different result.
# The code below gives the correct result, knowing that the timezone for
# the data is "US/Eastern" or "North_America/New_York".
print("Create naive datetime object...")
record_dt = datetime.datetime.strptime("07/12/2016 04:39PM", "%m/%d/%Y %I:%M%p")
print("record_dt = {}".format(record_dt))
print("record_dt.tzinfo = {}".format(record_dt.tzinfo))
print()

# Create timezone-aware objects. We know that the times were recorded in
# New York City.
print("Create aware datetime_objects...")
ny_tz = pytz.timezone("US/Eastern")
la_tz = pytz.timezone("US/Pacific")
# Convert naive datetime to aware datetime using localize().
ny_dt = ny_tz.localize(record_dt)
# Convert aware datetime to a different timezone using astimezone().
la_dt = ny_dt.astimezone(la_tz)
print("record_dt = {}".format(record_dt))
print("record_dt = {}".format(record_dt.isoformat()))
print("    ny_dt = {}".format(ny_dt))
print("            {}".format(ny_dt.tzinfo))
print("    la_dt = {}".format(la_dt))
print("            {}".format(la_dt.tzinfo))

#### Grouping by Parts of a datetime Object (Exercise)

In [None]:
# The course calls this exercise "Pieces of Time".

# Read the data into objects. Let's use a namedtuple here!
# I renamed entries to entries3 here to avoid problems with the
# entries variable above.
data_file_path = "cta_daily_summary_totals.csv"
print("Reading data from {}...".format(data_file_path))
with open(data_file_path, newline="") as data_file:
    csv_reader = csv.reader(data_file)
    header = next(csv_reader)
    DailyTotal = namedtuple("DailyTotal", header)
    # Use a list comprehension to build the list.
    entries3 = [DailyTotal(*row) for row in csv_reader]

# Iterate through the objects representing the rows of data.
# Convert the service_date attribute to a datetime object called service_dateime.
print("Getting monthly total rides...")
monthly_total_rides = defaultdict(int)
for entry in entries3:
    service_datetime = datetime.datetime.strptime(entry.service_date, "%m/%d/%Y")
    monthly_total_rides[service_datetime.month] += int(entry.total_rides)
print(monthly_total_rides)

#### Grouping by Parts of a datetime Object (Extra)

Repeat the exercise, but convert the data as it's being read.

In [None]:
# Repeat the exercise, converting the data as it's being read.
print("Reading data from {}...".format(data_file_path))
with open(data_file_path, newline="") as data_file2:
    csv_reader2 = csv.reader(data_file2)
    header = next(csv_reader2)
    DailyTotal2 = namedtuple("DailyTotal", header)
    # Use a list comprehension to build the list.
    entries4 = [
        DailyTotal2(
            datetime.datetime.strptime(row[0], "%m/%d/%Y"),
            row[1],
            row[2],
            int(row[3]),
            int(row[4])
        ) for row in csv_reader2]
# print(entries4)

# Iterate through the objects to calculate the totals.
print("Getting monthly total rides...")
monthly_total_rides2 = defaultdict(int)
for entry in entries4:
    # print(entry.service_date.month, entry.total_rides)
    monthly_total_rides2[entry.service_date.month] += entry.total_rides

# Show that the results are identical.
# This doesn't work when using the defaultdict objects.
print(dict(monthly_total_rides) == dict(monthly_total_rides2))

#### Creating datetime Objects Now (Exercise)

In [None]:
# Get the current time.
print()
local_dt = datetime.datetime.now()
print(local_dt)
utc_dt = datetime.datetime.utcnow()
print(utc_dt)

#### Timezones (Exercise)

The conversion documented by the course is not correct. This shows two ways how not to make aware datetime objects and finally a way to correctly make an aware datetime object.

In [None]:
# Note that the course's code corrupts the times because it uses
# pytz incorrectly. This is because datetime.datetime and pytz do
# not work as expected due to poor APIs.
daily_summaries = [
    (datetime.datetime(2001, 1, 1, 21, 36), '126455'),
    (datetime.datetime(2001, 1, 2, 3, 0), '501952'),
    (datetime.datetime(2001, 1, 3, 18, 29), '536432'),
    (datetime.datetime(2001, 1, 4, 18, 53), '550011'),
    (datetime.datetime(2001, 1, 5, 3, 10), '557917'),
    (datetime.datetime(2001, 1, 6, 15, 53), '255356'),
    (datetime.datetime(2001, 1, 7, 8, 18), '169825'),
    (datetime.datetime(2001, 1, 8, 8, 56), '590706'),
    (datetime.datetime(2001, 1, 9, 6, 0), '599905')
]
# Create pytz.timezone objects.
chicago_usa_tz = pytz.timezone("US/Central")
ny_usa_tz = pytz.timezone("US/Eastern")

# Bad conversion using .replace():
print("Bad conversion using .replace()...")
for orig_dt, ridership in daily_summaries:
    # This does not work as expected.
    # From the pytz documentation:
    # Unfortunately using the tzinfo argument of the standard datetime
    # constructors ‘’does not work’’ with pytz for many timezones.
    chicago_dt = orig_dt.replace(tzinfo=chicago_usa_tz)
    ny_dt = chicago_dt.astimezone(ny_usa_tz)
    print('Original: %s, Chicago: %s, NY: %s, Ridership: %s' % (orig_dt, chicago_dt, ny_dt, ridership))
print()

# Bad conversion using .astimezone():
print("Bad conversion using .astimezone()...")
# This uses the system time zone for the naive datetime object before the
# conversion, which may give wrong results, as it does here when I'm
# in the Eastern or Pacific time zones.
for orig_dt, ridership in daily_summaries:
    chicago_dt = orig_dt.astimezone(chicago_usa_tz)
    ny_dt = chicago_dt.astimezone(ny_usa_tz)
    print('Original: %s, Chicago: %s, NY: %s, Ridership: %s' % (orig_dt, chicago_dt, ny_dt, ridership))

# Here's how it should be done.
# This localizes the time to the Chicago time zone, then performs
# the conversion.
print()
print("Good conversion using .localize()...")
for orig_dt, ridership in daily_summaries:
    chicago_dt = chicago_usa_tz.localize(orig_dt)
    ny_dt = chicago_dt.astimezone(ny_usa_tz)
    print('Original: %s, Chicago: %s, NY: %s, Ridership: %s' % (orig_dt, chicago_dt, ny_dt, ridership))

### Adding and Subtracting Time

#### Using timedelta (Example)

In [None]:
# Use the datetime.timedelta class to represent an amount of change in time.
# Can add or subtract a set amount of time from a datetime object using
# a timedelta object.
flashback = datetime.timedelta(days=90)
print(record_dt)
print(record_dt - flashback)
print(record_dt + flashback)

#### Datetime Differences (Example)

In [None]:
# Calculate datetime differences.
# The course's example doesn't make sense since record2_dt should be later
# than record_dt. I created a string to convert to record2_dt to get the
# same answer as the course.
record2_dt = datetime.datetime.strptime("07/12/2016 04:38:56PM", "%m/%d/%Y %I:%M:%S%p")
time_diff = record_dt - record2_dt
print(type(time_diff)) # datetime.timedelta
print(time_diff) # 0:00:04

#### Finding a Time in the Future and from the Past (Exercise)

In [None]:
# Reuse entries4 from above. This is a namedtuple with the data converted
# to the correct types.
# entry.day_type contains the day type; entry.total_rides contains
# total_ridership.
# Build a dictionary with a datetime key and a value of a dictionary
# with "day_type" and "total_ridership" keys.
daily_summaries = {}
for entry in entries4:
    daily_summaries[entry.service_date] = \
        {"day_type": entry.day_type, "total_ridership": entry.total_rides}

review_dates = [
    datetime.datetime(2013, 12, 22, 0, 0),
    datetime.datetime(2013, 12, 23, 0, 0),
    datetime.datetime(2013, 12, 24, 0, 0),
    datetime.datetime(2013, 12, 25, 0, 0),
    datetime.datetime(2013, 12, 26, 0, 0),
    datetime.datetime(2013, 12, 27, 0, 0),
    datetime.datetime(2013, 12, 28, 0, 0),
    datetime.datetime(2013, 12, 29, 0, 0),
    datetime.datetime(2013, 12, 30, 0, 0),
    datetime.datetime(2013, 12, 31, 0, 0)
]

glanceback = datetime.timedelta(days=30)
for date in review_dates:
    prior_period_dt = date - glanceback
    print("Date: {}, Type: {}, Total Ridership: {}".format(
        date,
        daily_summaries[date]["day_type"],
        daily_summaries[date]["total_ridership"]))
    print("Date: {}, Type: {}, Total Ridership: {}".format(
        prior_period_dt, 
        daily_summaries[prior_period_dt]["day_type"],
        daily_summaries[prior_period_dt]["total_ridership"]))
    print()

#### Find Differences in datetimes (Exercise)

In [None]:
# Build the zip object. This is truncated compared to the zip object
# used in the course.
start_dates = (
    datetime.datetime(2001, 1, 30, 0, 0),
    datetime.datetime(2001, 3, 31, 0, 0),
    datetime.datetime(2001, 5, 30, 0, 0),
    datetime.datetime(2001, 7, 29, 0, 0),
    datetime.datetime(2001, 9, 27, 0, 0),
)
end_dates = (
    datetime.datetime(2001, 3, 1, 0, 0),
    datetime.datetime(2001, 4, 30, 0, 0),
    datetime.datetime(2001, 6, 29, 0, 0),
    datetime.datetime(2001, 8, 28, 0, 0),
    datetime.datetime(2001, 10, 27, 0, 0)
)
date_ranges = zip(start_dates, end_dates)

# Subtract one datetime object from another.
for start_date, end_date in date_ranges:
    print(end_date, start_date)
    print(end_date - start_date)

### Libraries for Date Manipulations

This section was called "HELP! Libraries to make it easier" in the course.

The pendulumn module (https://pypi.org/project/pendulum/) is a drop-in replacement for datetime except for some problems with MySQL and sqlite3 (which check for the type of a datetime object).

The pendulum.parse() method does not require a format string for conversion, but it is limited to certain input formats.

#### Parsing Time with pendulum (Example)

In [None]:
# I am replicating the example here.
# I could not get pendulum.parse() to parse the datetime string, but
# I did not know the original values of violation[4] and violation[5].
violation = ["", "", "", "", "6/11/2016", "2:38:00 P"]
occurred = violation[4] + " " + violation[5] + "M"
print(occurred)
occurred_dt = pendulum.from_format(occurred, "M/DD/YYYY h:m:ss A", tz="US/Eastern")
print(occurred_dt)
print()

# Parse works with ISO8601 strings.
violation2 = ["", "", "", "", "2016-06-11", "14:38:00"]
occurred2 = violation2[4] + "T" + violation2[5]
print(occurred2)
occurred2_dt = pendulum.parse(occurred2, tz="US/Eastern")
print(occurred2_dt)

#### Converting Timezones with pendulum (Example)

The `.in_timezone()` method converts a pendulum object to a desired timezone.

The `.now()` method accepts a timezone you want the current time in.

In [None]:
# Recreate the data. The behavior on printing is very different from
# the example presented in the movie.
violation_dts = [
    pendulum.parse("2016-06-11T14:38:00-04:00"),
    pendulum.parse("2016-06-25T14:09:00-04:00"),
    pendulum.parse("2016-01-04T09:52:00-05:00"),
]
pprint.pprint(violation_dts)
print()

# Convert the datetimes to Japan time.
for violation_dt in violation_dts:
    print(violation_dt.in_timezone("Asia/Tokyo"))
print()

# Get the current time in Tokyo.
print(pendulum.now("Asia/Tokyo"))

#### Humanizing Time Differences (Example)

Pendulum provides methods for natural time differences.

In [None]:
# Replicate the example.
# Once again, my output doesn't match what is presented (sign difference).
pdt1 = pendulum.parse("2016-04-26T07:09:00-04:00")
pdt2 = pendulum.parse("2016-04-23T07:49:00-04:00")
diff = pdt2 - pdt1
print(diff)
print(diff.in_words())
print(diff.in_days())
print(diff.in_hours())

#### Localizing Time with pendulum (Exercise)

In [None]:
# Create a now datetime for Tokyo: tokyo_dt
tokyo_dt = pendulum.now(tz="Asia/Tokyo")

# Covert the tokyo_dt to Los Angeles: la_dt
la_dt = tokyo_dt.in_timezone("America/Los_Angeles")

# Print the ISO 8601 string of la_dt
print(la_dt)
print(la_dt.to_iso8601_string())

#### Humanizing Differences with pendulum (Exercise)

In [None]:
# Replicate the data.
start_dates = (
    "01/30/2001",
    "03/31/2001",
    "05/30/2001",
    "07/29/2001",
    "09/27/2001"
)
end_dates = (
    "03/01/2001",
    "04/30/2001",
    "06/29/2001",
    "08/28/2001",
    "10/27/2001"
)
date_ranges = zip(start_dates, end_dates)
for start_date, end_date in date_ranges:
    start_dt = pendulum.parse(start_date, strict=False)
    end_dt = pendulum.parse(end_date, strict=False)
    print(end_dt, start_dt)
    diff_period = end_dt - start_dt
    print(diff_period.in_days())    

## Answering Data Science Questions

This chapter applies what we have learned to answering questions about the data for crime in Chicago.

### Case Study - Counting Crimes

Counting within date ranges.

#### Dataset Overview

The data is stored in a CSV file named crime_sampler.csv. The dataset has been shrunken to make it manageable. The full dataset is available on Chicago's Open Data Portal at https://data.cityofchicago.org/.

The fields in the data file, crime_sampler.csv, are:
| Field | Description |
| :--- | :--- |
| Date | datetime |
| Block | block |
| Primary Type | primary type of crime |
| Description | description of the crime |
| Location Description | description of the location |
| Arrest | if an arrest was made ("true" or "false") |
| Domestic | if the crime was a domestic case ("true" or "false") |
| District | the city district number |

What we're going to do is find the locations with the most crimes each month.

#### Read Data and Establish Data Containers (Exercise)

In [None]:
# Read the data. There is a header line and many rows of data.
# This course does not use pandas.read_csv() to load the data into a
# pandas DataFrame.
# I made some revisions to this code.
crime_data = []
with open("crime_sampler.csv", newline="") as csvfile:
    for row in csv.reader(csvfile):
        # Keep data for the "Date", "Primary Type", "Location Description",
        # and "Arrest" fields.
        crime_data.append((row[0], row[2], row[4], row[5]))
# Remove the header row.
del crime_data[0]
# crime_data.pop(0)
print(crime_data[:10])

#### Find the Months with the Highest Number of Crimes (Exercise)

Use collections.Counter for counting.

In [None]:
# Count the number of crimes per month, and print the months and
# the number of crimes for the three months with the most crimes.
crimes_by_month = Counter()
for crime in crime_data:
    date = datetime.datetime.strptime(crime[0], "%m/%d/%Y %I:%M:%S %p")
    crimes_by_month[date.month] += 1
print(crimes_by_month.most_common(3))

#### Create Month and Location Data Containers (Exercise)

Oh FFS, Jason changed the data structure of the crime_data variable!

In [None]:
# Build a data structure using a defaultdict with the keys being the
# month number and the values being a list of crime locations.
crime_data = []
with open("crime_sampler.csv", newline="") as csvfile:
    for row in csv.reader(csvfile):
        crime_data.append(row)
del crime_data[0]
print(crime_data[:3])

locations_by_month = defaultdict(list)
for crime in crime_data:
    date = datetime.datetime.strptime(crime[0], "%m/%d/%Y %I:%M:%S %p")
    if date.year == 2016:
        locations_by_month[date.month].append(crime[4])
# Don't print the voluminous output.
# print(locations_by_month)

#### Find the Most Common Crimes by Location Type by Month in 2016 (Exercise)

For each month in 2016, we have a list of locations. Use collections.Counter
to count the locations.

In [None]:
for month, locations in locations_by_month.items():
    location_count = Counter(locations)
    print(month)
    print(location_count.most_common(5))

### Case Study - Crimes by District and Differences by Block

This exercise uses dictionaries with time windows for keys. We will figure out how many crimes occurred per district and how types of crimes differed between city blocks.

Step 1: The code will use a csv.DictReader to read the data from the file. It will pop a key and its associated values from the dictionary as it works.

Step 2: We will Pythonically iterate over the dictionary and use collections.Counter and collections.defaultdict to determine the number of arrests in each district for each year.

Step 3: Finally, we will get a unique set of crimes for a block. We'll use the .difference() method to identify differences.

In my opinion, this was poorly explained.

#### Read Data with csv.DictReader and Create Data Containers (Exercise)

In [None]:
crimes_by_district = defaultdict(list)
with open("crime_sampler.csv", newline="") as csvfile:
    for row in csv.DictReader(csvfile):
        # We want to use the "District" field as the key for the rest of the
        # data. This is where row.pop() is useful. We remove the district
        # key and value from the dictionary and save the dictionary as the
        # value associated with the key.
        district = row.pop("District")
        crimes_by_district[district].append(row)

If I try to print the crimes_by_district dictionary, I get this error:

```
IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)
```
See https://stackoverflow.com/questions/52730839/how-do-i-change-notebookapp-iopub-data-rate-limit-for-jupyter

In [None]:
print(crimes_by_district)

#### Determine the Arrests by District by Year (Exercise)

This is Step 2 from above.

In [None]:
for district, crimes in crimes_by_district.items():
    print(district)
    year_count = Counter()
    for crime in crimes:
        if crime["Arrest"] == "true":
            year = datetime.datetime.strptime(crime["Date"], "%m/%d/%Y %I:%M:%S %p").year
            year_count[year] += 1
    print(year_count)

#### Unique Crimes by City Block (Exercise)

In [None]:
# This requires a dictionary for which the keys are blocks and the values are
# a list of the crimes observed for the block.
crimes_by_block = defaultdict(list)
with open("crime_sampler.csv", newline="") as csvfile:
    for row in csv.DictReader(csvfile):
        block = row["Block"]
        crimes_by_block[block].append(row["Primary Type"])
# print(crimes_by_block)

# Identify crime types that occurred on 001XX N State St but not
# on 0000X W Terminal St.
n_state_st_crimes = set(crimes_by_block["001XX N STATE ST"])
print(n_state_st_crimes)
w_terminal_st_crimes = set(crimes_by_block["0000X W TERMINAL ST"])
print(w_terminal_st_crimes)
crime_differences = n_state_st_crimes.difference(w_terminal_st_crimes)
print(crime_differences)