# Working with Dates and Times in Python

## Introduction

These are my notes for DataCamp's course [_Working with Dates and Times in Python_](https://www.datacamp.com/courses/working-with-dates-and-times-in-python).

This course is presented by Max Shron, data scientist and author. Collaborators are Chester Ismay and Sumedh Panchadhar.

Prerequisite:

- [_Data Manipulation with Pandas_](../Data%20Manipulation%20with%20Pandas/Data%20Manipulation%20with%20Pandas.ipynb)

This course is part of these tracks:

- Data Scientist with Python
- Data Scientist Professional with Python
- Python Programmer
- Python Toolbox

### Notes

Modules for managing timezone data in Python include `zoneinfo`, `dateutil`, and `pytz`. See https://developers.home-assistant.io/blog/2021/05/07/switch-pytz-to-python-dateutil/ about problems using `pytz`. I also experienced problems using `pytz` in the "Data Types for Data Science in Python" course.

This course uses the `dateutil` module.

With the release of Python 3.9, it is recommended to use Python's `zoneinfo` module (https://docs.python.org/3/library/zoneinfo.html) and the third-party `tzdata` module (https://pypi.org/project/tzdata/), which is generally required only on Windows servers.

For more information about the problems using `pytz`, see https://blog.ganssle.io/tag/timezones.html.

If you're collecting data, store datetime values in UTC or with a fixed UTC offset!

This is a very good course, but it should be updated to use the `zoneinfo` and `tzdata` modules.

Chapter 4 (Dates and Times in pandas) of this course requires completion of the [_Data Manipulation with Pandas_](../Data%20Manipulation%20with%20Pandas/Data%20Manipulation%20with%20Pandas.ipynb) course.

## Imports

Imports are collected here for clarity and convenience.

In [None]:
import calendar
import collections
import datetime
import pickle
import pprint
import sys
import traceback

import dateutil.tz
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pytz.exceptions # for pandas support

## Datasets

| Name | File |
| :--- | :--- |
| Florida Hurricanes | florida_hurricane_dates.pkl |
| W20529 Bike Data (Capital Bikeshare) | capital-onebike.csv |

The bikeshare data is data for a single bicycle, W20529, for all trips in October, November, and December of 2017.

### Load Florida Hurricane Data from a Pickle File

In [None]:
# Read the data from the .pkl file.
# This contains a list of datetime.date objects.
print("Loading Florida hurricane dates...")
with open("florida_hurricane_dates.pkl", "rb") as file:
    florida_hurricane_dates = pickle.load(file)
print(florida_hurricane_dates[:5])

### Read the Bikeshare CSV File into a pandas DataFrame

The code below specifies the columns containing dates by their names. The columns can also be specified by column numbers:

```Python
onebike_df = \
    pd.read_csv(
        "capital-onebike.csv",
        parse_dates=[0, 1])
```

In [None]:
# Convert the CSV file to a pandas DateFrame, parsing the dates.
# Can also use parse_dates=[0, 1].
onebike_df = \
    pd.read_csv(
        "capital-onebike.csv",
        parse_dates=["Start date", "End date"])
print(onebike_df.info())
print()
print(onebike_df.head())

## Dates and Calendars

### Dates in Python

#### Creating Date Objects (Example)

In [None]:
# Create dates.
# The positional arguments correspond to year, month, and day.
two_hurricanes_dates = [datetime.date(2016, 10, 7), datetime.date(2017, 6, 21)]
print(two_hurricanes_dates)

#### Attributes of Dates (Example)

Weekdays in Python begin with 0 for Monday.

In [None]:
print("year:", two_hurricanes_dates[0].year)
print("month:", two_hurricanes_dates[0].month)
print("day:", two_hurricanes_dates[0].day)
print("weekday:", two_hurricanes_dates[0].weekday())

#### Getting the Name of the Weekday (Extra)

See https://stackoverflow.com/questions/9847213/how-do-i-get-the-day-of-week-given-a-date/29519293#29519293.

In [None]:
# Get the name of the weekday.
weekday = two_hurricanes_dates[0].strftime("%A")
print(weekday)
weekday2 = calendar.day_name[two_hurricanes_dates[0].weekday()]
print(weekday2)

#### Get the Weekday (Exercise)

In [None]:
# On what day of the week did Hurricane Andrew occur?
hurricane_andrew = datetime.date(1992, 8, 24)
print(hurricane_andrew.weekday())
# Bonus answer.
print(hurricane_andrew.strftime("%A"))

#### Count Early Florida Hurricanes (Exercise)

In [None]:
# Count the number of early hurricanes (hurricanes arriving before June).
early_hurricanes = 0
for hurricane in florida_hurricane_dates:
    if hurricane.month < 6:
        early_hurricanes = early_hurricanes + 1
print(early_hurricanes)

### Math with Dates

#### Math with Dates (Example)

In [None]:
# Create a datetime.delta object from two datetime.date objects.
d1 = datetime.date(2017, 11, 5)
d2 = datetime.date(2017, 12, 4)
l = [d1, d2]
print(min(l))
delta = d2 - d1
print(type(delta))
print(delta)
print(delta.days)

# Create a datetime.timedelta object and add it to a datetime.date.
td = datetime.timedelta(days=29)
print(d1 + td)

#### Subtracting Dates (Exercise)

Find the number of days between the first and last hurricanes of 2007.

In [None]:
# Find the number of dates from May 9, 2007, to December 13, 2007.
start = datetime.date(2007, 5, 9)
end = datetime.date(2007, 12, 13)
print((end - start).days)

#### Counting Events per Calendar Month (Exercise)

In [None]:
# First approach, used by the course.
# This has the advantage that all keys exist and they are sorted.
hurricanes_each_month = {1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6:0,
		  				 7: 0, 8:0, 9:0, 10:0, 11:0, 12:0}
for hurricane in florida_hurricane_dates:
    month = hurricane.month
    hurricanes_each_month[month] += 1
print(list(hurricanes_each_month.items()))

In [None]:
# Second approach, using defaultdict.
# The defaultdict is missing keys for months with no hurricanes.
hurricanes_each_month2 = collections.defaultdict(int)
for hurricane in florida_hurricane_dates:
    hurricanes_each_month2[hurricane.month] += 1
print(sorted(hurricanes_each_month2.items()))

In [None]:
# Third approach using Counter.
# The Counter is missing keys for months with no hurricanes.
hurricanes_each_month3 = collections.Counter([x.month for x in florida_hurricane_dates])
print(sorted(hurricanes_each_month3.items()))

#### Sorting Dates (Exercise)

datetime.date objects can be sorted using `sorted()`.

In [None]:
# Replicate the exercise with a subset of the data.
dates_scrambled = [
     datetime.date(1988, 8, 4),
     datetime.date(1990, 10, 12),
     datetime.date(2003, 4, 20),
     datetime.date(1971, 9, 1),
     datetime.date(1988, 8, 23),
     datetime.date(1950, 8, 31),
     datetime.date(2017, 10, 29),
     datetime.date(2011, 7, 18),
]
print(dates_scrambled[0])
print(dates_scrambled[-1])
dates_ordered = sorted(dates_scrambled)
print(dates_ordered[0])
print(dates_ordered[-1])

### Turning Dates into Strings

#### Convert a Date to ISO 8601 Format (Example)

By default, a datetime.date object is printed in ISO 8601 format, YYYY-MM-DD.

In [None]:
# Print or convert a datetime.date object in ISO format.
d = datetime.date(2017, 11, 5)
print(d)
print([d.isoformat()])
# Sort date strings that computers once had trouble with.
# Date strings in ISO 8601 format sort correctly.
# This is handy for file names.
some_dates = ["2000-01-01", "1999-12-31"]
print(sorted(some_dates))

#### Convert Dates to Other Formats

Use the `.strftime()` method of a datetime.date object to convert a date to a string in a format that is not ISO 8601. See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior for the documentation of `.strftime()` and `.strptime()`.

In [None]:
print(d.strftime("%Y"))
print(d.strftime("Year is %Y"))
print(d.strftime("%Y/%m/%d"))

#### Print Dates in a Friendly Format (Exercise)

In [None]:
# Print the earliest date in ISO format and in US format.
first_date = sorted(florida_hurricane_dates)[0]
iso = "Our earliest hurricane date: " + first_date.isoformat()
us = "Our earliest hurricane date: " + first_date.strftime("%m/%d/%Y")
# Extra: print the month and day without the leading 0.
us2 = "Our earliest hurricane date: " +\
    "{}/{}/{}".format(first_date.month, first_date.day, first_date.year)
print("ISO: " + iso)
print("US: " + us)
print("US: " + us2)

#### Represent Dates in Different Ways (Exercise)

> Astronomers usually use the 'day number' out of 366 instead of the month and date, to avoid ambiguities between languages.

In [None]:
# Print the date for Hurricane Andrew in various formats.
andrew = datetime.date(1992, 8, 26)
# Print the date in the format 'YYYY-MM'.
print(andrew.strftime("%Y-%m"))
# Print the date in the format 'Month (YYYY)'.
print(andrew.strftime("%B (%Y)"))
# Print the date in the format 'YYYY-DDD'.
print(andrew.strftime("%Y-%j"))

## Combining Dates and Times

### Dates and Times

#### Create a datetime.datetime Object (Demonstration)

In [None]:
# Create a datetime.datetime object.
dt = datetime.datetime(2017, 10, 1, 15, 23, 25)
print(dt)
print(dt.isoformat())
# Add microseconds.
# pandas supports billionths of seconds (nanoseconds) using the
# pandas.TimeStamp class.
dt = datetime.datetime(2017, 10, 1, 15, 23, 25, 500000)
print(dt)
# Use named arguments.
dt = datetime.datetime(year=2017, month=10, day=1, hour=15, minute=23, second=25, microsecond=500000)
print(dt)

#### Replacing Parts of a datetime.datetime Object (Demonstration)

In [None]:
# Create a new datetime.datetime by replacing some attributes.
dt_hr = dt.replace(minute=0, second=0, microsecond=0)
print(dt_hr)

#### Creating datetime.datetime Objects by Hand (Exercise)

In [None]:
# Create and print a datetime.datetime.
dt = datetime.datetime(2017, 10, 1, 15, 26, 26)
print(dt.isoformat())
dt_old = dt.replace(year=1917)
print(dt_old.isoformat())

#### Count Events before Noon Using Pandas and Numpy Methods (Extra)

In [None]:
# Count the number of start events before and after noon using the pandas
# TimeStamp objects in the DataFrame. This is very efficient.
start_hours = onebike_df["Start date"].array.hour
trip_counts = {"AM": np.sum(start_hours < 12), "PM": np.sum(start_hours >= 12)}
print(trip_counts)

#### Create a List of Dictionaries from the DataFrame (Extra)

In [None]:
# Create onebike_datetimes, a list of dicts with keys "start" and "end"
# and datetime.datetime values. It took me about an hour to figure out
# how to do this, which was good practice with iterating through a pandas
# DataFrame.

# Extract the "Start date" and "End date" values into a list of dicts.
# Set index=False to avoid including the index in the namedtuple.
# In the namedtuple objects, "Start date" and "End date" were not valid
# field names, so the fields were named "_0" and "_1".
# In the pandas Series, the datetimes are stored as pandas TimeStamp objects,
# which can be converted to datetime.datetime objects.
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html
onebike_datetimes = []
for row in onebike_df.itertuples(index=False, name="Onebike"):
    start_end = {"start": row._0.to_pydatetime(), "end": row._1.to_pydatetime()}
    onebike_datetimes.append(start_end)
# print(onebike_datetimes[:5])

#### Counting Events before Noon (Exercise)

In [None]:
# Finish the course's exercise, counting trips that started before noon or 
# after noon. This requires the onebike_datetimes list of dictionaries.
trip_counts = {'AM': 0, 'PM': 0}
for trip in onebike_datetimes:
    if trip['start'].hour < 12:
        trip_counts["AM"] += 1
    else:
        trip_counts["PM"] += 1
print(trip_counts)

### Printing and Parsing Datetimes

#### Print Datetimes (Example)

In [None]:
# Print datetimes.
dt = datetime.datetime(2017, 12, 30, 15, 19, 13)
print(dt.strftime("%Y-%m-%d"))
print(dt.strftime("%Y-%m-%d %H:%M:%S"))
print(dt.strftime("%H:%M:%S on %Y/%m/%d"))

#### Parsing Datetimes from Strings (Example)

In [None]:
# Parse a string to create a datetime.datetime object.
dt = datetime.datetime.strptime("12/30/2017 15:19:13", "%m/%d/%Y %H:%M:%S")
print(dt)

# If parsing fails, a ValueError exception is raised.
try:
    dt = datetime.datetime.strptime("12/30/2017 15:19:13", "%m/%d/%Y")
except Exception as exc:
    # These create the same output.
    print(traceback.format_exc(limit=0))
    # traceback.print_exception(exc, limit=0, file=sys.stdout)

#### Convert a Unix Timestamp into a Datetime (Example)

In [None]:
# Unix timestamps are limited to dates after January 1, 1970 UTC.
ts = 1514665153
dt = datetime.datetime.fromtimestamp(ts)
print(dt)

#### Turning Strings into DateTimes (Exercise)

> Python does not have the ability to parse non-zero-padded dates and times out of the box (such as "1/2/2018"). If needed, you can use other string methods to create zero-padded strings suitable for `strptime()`.

In [None]:
# Convert strings to datetime.datetime objects.
s = '2017-02-03 00:00:01'
fmt = '%Y-%m-%d %H:%M:%S'
d = datetime.datetime.strptime(s, fmt)
print(d)

s = '2030-10-15'
fmt = '%Y-%m-%d'
d = datetime.datetime.strptime(s, fmt)
print(d)

s = '12/15/1986 08:00:00'
fmt = '%m/%d/%Y %H:%M:%S'
d = datetime.datetime.strptime(s, fmt)
print(d)

# Non-zero-padded strings.
s = "1/2/2018 4:7:2"
fmt = "%m/%d/%Y %H:%M:%S"
d = datetime.datetime.strptime(s, fmt)
print(d)

#### Parsing Pairs of Strings as Datetimes (Exercise)

In [None]:
# Work with a subset of the data since I used pandas.read_csv and parsed the
# dates already.
onebike_datetime_strings = [
    ('2017-10-01 15:23:25', '2017-10-01 15:26:26'),
    ('2017-10-01 15:42:57', '2017-10-01 17:49:59'),
    ('2017-10-02 06:37:10', '2017-10-02 06:42:53'),
    ('2017-10-02 08:56:45', '2017-10-02 09:18:03'),
    ('2017-10-02 18:23:48', '2017-10-02 18:45:05')
]

# Convert each pair of strings into a dictionary with keys "start" and "end".
fmt = "%Y-%m-%d %H:%M:%S"
onebike_datetimes2 = []
for (start, end) in onebike_datetime_strings:
    trip = {'start': datetime.datetime.strptime(start, fmt),
            'end': datetime.datetime.strptime(end, fmt)}
    onebike_datetimes2.append(trip)
print(onebike_datetimes2)

#### Recreating ISO 8601 Format with `.strftime()` (Exercise)

In [None]:
# Use the .strftime() method to format a datetime in ISO 8601 format,
first_start = onebike_datetimes[0]['start']
fmt = "%Y-%m-%dT%H:%M:%S"
print(first_start.isoformat())
print(first_start.strftime(fmt))

#### Convert Unix Timestamps to Datetimes (Exercise)

In [None]:
# Convert Unix timestamps into datetime.datetime objects.
timestamps = [1514665153, 1514664543]
dts = []
for ts in timestamps:
    dts.append(datetime.datetime.fromtimestamp(ts))
print(dts)
# Output is different when printing an individual datetime.datetime object.
for dt in dts:
    print(dt)

### Working with Durations

#### Working with Durations (Example)

In [None]:
start = datetime.datetime(2017, 10, 8, 23, 46, 47)
end = datetime.datetime(2017, 10, 9, 0, 10, 57)
duration = end - start
print(duration.total_seconds())

#### Creating Timedeltas (Example)

In [None]:
# Create and use datetime.timedelta objects.
delta1 = datetime.timedelta(seconds=1)
print(start)
print(start + delta1)
delta2 = datetime.timedelta(days=1, seconds=1)
print(start)
print(start + delta2)
delta3 = datetime.timedelta(weeks=-1)
print(start)
print(start + delta3)
delta4 = datetime.timedelta(weeks=1)
print(start)
print(start - delta4)

#### Turning Pairs of Datetimes into Durations (Exercise)

Internally, datetime.timedelta objects store days and seconds. Use the `.total_seconds()` method to get the duration in seconds.

In [None]:
# Calculate the length of time the bicycle was out on each trip.
onebike_durations = []
for trip in onebike_datetimes:
    trip_duration = trip["end"] - trip["start"]
    trip_length_seconds = trip_duration.total_seconds()
    onebike_durations.append(trip_length_seconds)

#### Determine Average Trip Time (Exercise)

In [None]:
# Determine average trip time.
total_elapsed_time = sum(onebike_durations)
number_of_trips = len(onebike_durations)
print(total_elapsed_time / number_of_trips)

#### Calculate Longest and Shortest Trips (Exercise)

The shortest trip, with negative duration, happened during the conversion from daylight saving time to standard time.

In [None]:
# Some of the results look suspicious.
shortest_trip = min(onebike_durations)
longest_trip = max(onebike_durations)
print("The shortest trip was " + str(shortest_trip) + " seconds")
print("The longest trip was " + str(longest_trip) + " seconds")

## Time Zones and Daylight Saving

### UTC Offsets

Clocks west of UTC are set less than UTC (UTC-xx:00); clocks east of UTC are set greater than UTC (+xx:00). UTC offsets allow us to compare times from different timezones.

#### UTC (Example)

In [None]:
# datetime.timezone objects.
# Eastern standard time.
EST = datetime.timezone(datetime.timedelta(hours=-5))
print(EST)
# India standard time.
IST = datetime.timezone(datetime.timedelta(hours=5, minutes=30))
print(IST)
# Create a timezone-aware datetime.datetime object.
dt = datetime.datetime(2017, 12, 30, 15, 9, 3, tzinfo=EST)
print(dt)
print(dt.isoformat())
# Convert the time to another timezone.
print(dt.astimezone(IST))

#### Adjusting Timezone versus Changing `tzinfo` (Example)

In [None]:
# Use the .replace() method to change the tzinfo of a datetime.datetime
# object. This does not adjust the other attributes.
# Note: Don't do this with timezones obtained from pytz!
print(dt)
print(dt.replace(tzinfo=datetime.timezone.utc))
print(dt.astimezone(datetime.timezone.utc))

#### Creating Timezone-Aware Datetimes (Exercise)

In [None]:
# UTC (Universal Coordinated Time)
dt_utc = datetime.datetime(2017, 10, 1, 15, 26, 26, tzinfo=datetime.timezone.utc)
print(dt_utc.isoformat())
# PST (Pacific Standard Time)
pst = datetime.timezone(datetime.timedelta(hours=-8))
dt_pst = datetime.datetime(2017, 10, 1, 15, 26, 26, tzinfo=pst)
print(dt_pst.isoformat())
# AEDT (Australian Eastern Daylight Time)
aedt = datetime.timezone(datetime.timedelta(hours=11))
dt_aedt = datetime.datetime(2017, 10, 1, 15, 26, 26, tzinfo=aedt)
print(dt_aedt.isoformat())

#### Setting Timezones (Exercise)

In [None]:
# Set the timezone for the first 10 rows of data.
edt = datetime.timezone(datetime.timedelta(hours=-4))
for trip in onebike_datetimes[:10]:
    print("start:", trip["start"], "end:", trip["end"])
    trip['start'] = trip['start'].replace(tzinfo=edt)
    trip['end'] = trip['end'].replace(tzinfo=edt)
    print("start:", trip["start"], "end:", trip["end"])

#### What Time Did the Bike Leave in UTC? (Exercise)

In [None]:
# Display times for UTC.
for trip in onebike_datetimes[:10]:
    dt = trip['start']
    dt = dt.astimezone(datetime.timezone.utc)
    print('Original:', trip['start'], '| UTC:', dt.isoformat())

### Timezone Database

Use the `dateutil` timezone database, which, because it must be updated several times a year, is not packaged with the standard Python distribution. The `dateutil.tz` database formats the names of timezones as "Region/City" (e.g., "America/New_York)").

#### Timezone Database (Example)

In [None]:
et = dateutil.tz.gettz("America/New_York")
print(type(et))
print(et)

In [None]:
last = datetime.datetime(2017, 12, 30, 15, 9, 3, tzinfo=et)
print(last)

In [None]:
# The object returned by dateutil.tz.gettz() adjusts for daylight saving time
# automatically.
first = datetime.datetime(2017, 10, 1, 15, 23, 25, tzinfo=et)
print(first)

#### Putting the Bike Trips into the Right Timezone (Exercise)

We are using the Internet Assigned Numbers Authority (IANA) timezone database (https://www.iana.org/time-zones).

In [None]:
# Loop over trips, updating the datetimes to be in Eastern Time
for trip in onebike_datetimes[:10]:
    trip['start'] = trip['start'].replace(tzinfo=et)
    trip['end'] = trip['end'].replace(tzinfo=et)

#### What Time Did the Bike Leave? (Global Edition) (Exercise)

In [None]:
# Convert the timezone of a datetime.datetime object to different
# timezones.
# London.
uk = dateutil.tz.gettz("Europe/London")
local = onebike_datetimes[0]["start"]
notlocal = local.astimezone(uk)
print(local.isoformat())
print(notlocal.isoformat())

# India Standard Time.
ist = dateutil.tz.gettz("Asia/Kolkata")
notlocal = local.astimezone(ist)
print(notlocal.isoformat())

# Samoa.
sm = dateutil.tz.gettz("Pacific/Apia")
notlocal = local.astimezone(sm)
print(notlocal.isoformat())

### Starting Daylight Saving Time

#### Start of Daylight Saving Time (Example)

Watch out for how datetime.timedelta objects are calculated! See https://blog.ganssle.io/articles/2018/02/aware-datetime-arithmetic.html. For arithmetic in the same time zone, Python uses wall time. For arithmetic in different time zones, Python uses absolute time. This behavior is documented here: https://docs.python.org/3/library/datetime.html#datetime-objects. The course is misleading about this. See also https://stackoverflow.com/questions/71428364/python-timedelta-spanning-dst-changeover-returns-incorrect-result.

In [None]:
# Create naitve datetime.datetime objects and compare them.
# Do not use datetime.timedelta.totalseconds() to obtain the difference in
# absolute time, because it calculates wall time. One way to deal with this
# is to use the different of the .timestamp() values, which returns seconds.
spring_ahead_159am = datetime.datetime(2017, 3, 12, 1, 59, 59)
print(spring_ahead_159am.isoformat())
spring_ahead_300am = datetime.datetime(2017, 3, 12, 3, 0, 0)
print(spring_ahead_300am.isoformat())
# It looks like using .timestamp() on a naïve datetime.datetime object
# might apply the local timezone, and the calculation is correct.
print(spring_ahead_300am.timestamp() - spring_ahead_159am.timestamp())

# Manually add timezone information, using UTC offsets.
EST = datetime.timezone(datetime.timedelta(hours=-5))
EDT = datetime.timezone(datetime.timedelta(hours=-4))
spring_ahead_159am = spring_ahead_159am.replace(tzinfo=EST)
spring_ahead_300am = spring_ahead_300am.replace(tzinfo=EDT)
print(spring_ahead_159am.isoformat())
print(spring_ahead_300am.isoformat())
# This is correct because it calculates absolute time because two
# different timezones are used.
print(spring_ahead_300am.timestamp() - spring_ahead_159am.timestamp())

# Use dateutil.
eastern = dateutil.tz.gettz("America/New_York")
eastern_spring_ahead_159am = datetime.datetime(2017, 3, 12, 1, 59, 59, tzinfo=eastern)
eastern_spring_ahead_300am = datetime.datetime(2017, 3, 12, 3, 0, 0, tzinfo=eastern)

# The objects appear to be identical.
print(eastern_spring_ahead_159am.isoformat())
print(eastern_spring_ahead_300am.isoformat())
print(spring_ahead_159am == eastern_spring_ahead_159am)
print(spring_ahead_300am == eastern_spring_ahead_300am)

# This difference is calculated as wall time and gives a misleading result,
# where the result is "wall time". Wall time is calculated when the timezone
# info is the same for the two datetime.datetime objects.
print((eastern_spring_ahead_300am - eastern_spring_ahead_159am).total_seconds())

# Here is a way to calculate the absolute time difference:
print(eastern_spring_ahead_300am.timestamp() - eastern_spring_ahead_159am.timestamp())

# Another way is to convert to UTC before subtracting.
utc = datetime.timezone.utc
print((spring_ahead_300am.astimezone(utc) - spring_ahead_159am.astimezone(utc)).total_seconds())
print((eastern_spring_ahead_300am.astimezone(utc) - eastern_spring_ahead_159am.astimezone(utc)).total_seconds())

#### How Many Hours Elapsed Around Daylight Saving? (Exercise)

In [None]:
# This calculation is correct, because Python is using wall time here.
# Start on March 12, 2017, midnight, then add 6 hours.
start = datetime.datetime(2017, 3, 12, tzinfo = dateutil.tz.gettz('America/New_York'))
end = start + datetime.timedelta(hours=6)
print(start.isoformat() + " to " + end.isoformat())
# How many hours have elapsed?
print((end - start).total_seconds() / (60 * 60))
# What if we move to UTC?
# The timezone can be dateutil.tz.UTC or datetime.timezone.utc.
td = end.astimezone(dateutil.tz.UTC) - start.astimezone(dateutil.tz.UTC)
print(td.total_seconds() / (60 * 60))

#### March 29, throughout a Decade (Exercise)

In [None]:
# Look at the UTC offset for 10 years in the past.
# This is caused by the shift being made on a Sunday.
dt = datetime.datetime(2000, 3, 29, tzinfo = dateutil.tz.gettz("Europe/London"))

# Loop over the dates, replacing the year, and print the ISO timestamp
for y in range(2000, 2011):
    print(dt.replace(year=y).isoformat())

### Ending Daylight Saving Time

When ending daylight saving time, there are two 1:00 am times in local time. This creates an ambiguity. Here's how to check it.

> Python often tries to be helpful by glossing over daylight saving time difference, and oftentimes that's what you want. However, when you do care about it, use dateutil to set the timezone information correctly and then switch into UTC for the most accurate comparisons between events.

#### Ending Daylight Saving Time (Example)

In [None]:
eastern = dateutil.tz.gettz("US/Eastern")
first_1am = datetime.datetime(2017, 11, 5, 1, 0, 0, tzinfo=eastern)
print(dateutil.tz.datetime_ambiguous(first_1am))

# Use the enfold method to indicate the second 1:00 am moment.
second_1am = datetime.datetime(2017, 11, 5, 1, 0, 0, tzinfo=eastern)
second_1am = dateutil.tz.enfold(second_1am)
print(dateutil.tz.datetime_ambiguous(second_1am))

# This doesn't change the behavior during subtraction, because the
# subtraction calculates wall time.
print((second_1am - first_1am).total_seconds())

# Convert to UTC and do the subtraction. This returns the correct result.
first_1am = first_1am.astimezone(tz=dateutil.tz.UTC)
second_1am = second_1am.astimezone(tz=dateutil.tz.UTC)
print((second_1am - first_1am).total_seconds())

# It's up to the code to check the fold and do the correct thing with it.

#### Find Ambiguous Datetimes (Exercise)

In [None]:
# Reload the onebike_datetimes data and apply the "America/New_York" timezone.
onebike_datetimes = []
et = dateutil.tz.gettz("America/New_York")
for row in onebike_df.itertuples(index=False, name="Onebike"):
    start_end = {
        "start": row._0.to_pydatetime().replace(tzinfo=et),
        "end": row._1.to_pydatetime().replace(tzinfo=et)}
    onebike_datetimes.append(start_end)
# print(onebike_datetimes[10:20])

In [None]:
# Search for ambiguous datetimes.
# This check identifies one record that needs to be fixed.
for trip in onebike_datetimes:
    if dateutil.tz.datetime_ambiguous(trip["start"]):
        print("Ambiguous start at " + str(trip["start"]))
        print("with end at " + str(trip["end"]))
    if dateutil.tz.datetime_ambiguous(trip["end"]):
        print("Ambiguous end at " + str(trip["end"]))
        print("with start at " + str(trip["start"]))
        print(trip["end"] - trip["start"])

Avoid ambiguous datetimes in practice by storing datetimes in UTC.

#### Clean Daylight Saving Data with Fold (Exercise)

The exercise above revealed a trip with ambiguous start and end times. Since the start time was later than the end time, the start time has fold=0 and the end time has fold=1.

In [None]:
# The shortest trip is always 116.0 seconds; a negative duration is
# not seen.
trip_durations = []
for trip in onebike_datetimes:
    # When the start is later than the end, set the fold to be 1
    if trip["start"] > trip["end"]:
        trip['end'] = dateutil.tz.enfold(trip['end'])
    # Convert to UTC
    start = trip['start'].astimezone(dateutil.tz.UTC)
    end = trip['end'].astimezone(dateutil.tz.UTC)

    # Subtract the difference
    trip_length_seconds = (end - start).total_seconds()
    trip_durations.append(trip_length_seconds)

# Take the shortest trip duration
print("Shortest trip: " + str(min(trip_durations)))

## Dates and Times in pandas

### Reading Date and Time Data in pandas

The pandas `read_csv()` method can attempt to parse dates in specific columns. If this does not work, read the data without parsing the dates and then call `pd.to_datetime()` to parse the dates using format strings.

The value stored in the DataFrame is a pandas Timestamp, which is to all intents and purposes equivalent to a datetime.datetime object.

#### Loading Datetimes with `parse_dates` (Example)

In [None]:
# Load the bikeshare data from the CSV file, parsing the dates.
rides = pd.read_csv("capital-onebike.csv", parse_dates=["Start date", "End date"])
print(rides.info())

#### Converting Datetimes with `pd.to_datetime()` (Example)

In [None]:
# Load the bikeshare data from the CSV file, and convert the date strings
# afterwards.
rides2 = pd.read_csv("capital-onebike.csv")
rides2["Start date"] = pd.to_datetime(rides2["Start date"], format="%Y-%m-%d %H:%M:%S")
rides2["End date"] = pd.to_datetime(rides2["End date"], format="%Y-%m-%d %H:%M:%S")
print(rides2.info())
# Show that the results are identical.
print(rides.equals(rides2))

#### Datetime Arithmetic (Example)

Now that we have datetime objects, we can computate a column named "Duration". (Technically, we need to fix the one row of data where the start datetime is greater than the end datetime because of the switch from daylight saving time.

In [None]:
rides["Duration"] = rides["End date"] - rides["Start date"]
print(rides["Duration"].info())
print()
print(type(rides["Duration"][0]))
print()
print(rides["Duration"][:5])
print()
print(rides["Duration"].head(5))

In [None]:
# Here's the bad duration.
print(rides["Duration"].min())

pandas has convenient methods for conversion of Timedelta objects; here, we use `.dt.total_seconds()`.

In [None]:
print(rides["Duration"].dt.total_seconds())

In [None]:
# Look for the bad row of data caused by the shift from daylight saving time
# to standard time.
print(rides["Duration"][rides["Duration"].dt.total_seconds() < 0])

#### Loading a CSV file in pandas (Exercise)

In [None]:
# Read the CSV file, parsing the dates.
rides3 = pd.read_csv('capital-onebike.csv', 
                    parse_dates = ["Start date", "End date"])
print(rides3.iloc[0])

In [None]:
# Compute the "Durations" column.
ride_durations = rides3["End date"] - rides3["Start date"]
rides3["Duration"] = ride_durations.dt.total_seconds()
print(rides3['Duration'].head())

### Summarizing Datetime Data in pandas

This course is old; the instructor mentions to make sure you're using at least Pandas version 0.23.

For the strings used for the first argument to `.resample()`, see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects.

#### Summarizing Data in pandas (Example)

In [None]:
# Average time out of dock.
print(rides["Duration"].mean())
print(rides["Duration"].median())
print(rides["Duration"].min())
print(rides["Duration"].sum())

# Calculate percent of time out of the dock.
print(rides["Duration"].sum() / datetime.timedelta(days=91))

# Count the number of types of riders.
print(rides["Member type"].value_counts())
print(rides["Member type"].value_counts() / len(rides))

In [None]:
# Add and use a column named "Duration seconds".
rides["Duration seconds"] = rides["Duration"].dt.total_seconds()
print(rides.groupby("Member type")["Duration seconds"].mean())
# Ride duration per month. This looks like magic.
print(rides.resample("M", on="Start date")["Duration seconds"].mean())
# Size of groups.
print(rides.groupby("Member type").size())
# First ride for each group.
print(rides.groupby("Member type").first())
# Quick plotting.
rides.resample("M", on="Start date")["Duration seconds"].mean().plot()
plt.show()
# Resample by days and plot.
# The outlier was possibly a bicycle repair. Evidence supporting this is
# that the bike was not used for several days before the outlier was
# recorded, suggesting that the bicycle needed repair.
rides.resample("D", on="Start date")["Duration seconds"].mean().plot()
plt.show()

#### How Many Joyrides? (Exercise)

A joyride is defined as a ride that starts and stops at the same dock.

In [None]:
# Create the boolean series.
joyrides = rides["Start station"] == rides["End station"]
print("{} rides were joyrides".format(joyrides.sum()))
# Median of all rides.
print("The median duration overall was {:.2f} seconds"\
      .format(rides['Duration'].dt.total_seconds().median()))
# Median of joyrides.
print("The median duration for joyrides was {:.2f} seconds"\
      .format(rides[joyrides]['Duration'].dt.total_seconds().median()))

#### Plot Rides per Unit Time (Exercise)

In [None]:
# Is there a trend in fewer rides per day as winter approaches?
rides.resample("D", on = 'Start date')\
  .size()\
  .plot(ylim = [0, 15])
plt.show()

In [None]:
# The data are noisy. Try plotting the number of rides per month.
rides.resample("M", on="Start date")\
    .size()\
    .plot(ylim=[0, 150])
plt.show()

#### Members versus Casual Riders over Time (Exercise)

> Note that by default, `.resample()` labels Monthly resampling with the last day in the month and not the first. It certainly looks like the fraction of Casual riders went down as the number of rides dropped. With a little more digging, you could figure out if keeping Member rides only would be enough to stabilize the usage numbers throughout the fall.

In [None]:
# Resample rides to be monthly on the basis of Start date.
monthly_rides = rides.resample("M", on="Start date")['Member type']
# Take the ratio of the .value_counts() over the total number of rides
print(monthly_rides.value_counts() / monthly_rides.size())

#### Combining `.groupby()` and `.resample()` (Exercise)

> Whereas `.resample()` groups rows by some time or date information, `.groupby()` groups rows based on the values in one or more columns.

> It looks like casual riders consistently took longer rides, but that both groups took shorter rides as the months went by. Note that, by combining grouping and resampling, you can answer a lot of questions about nearly any data set that includes time as a feature. Keep in mind that you can also group by more than one column at once.

In [None]:
# Group rides by member type, and resample to the month.
grouped = rides.groupby('Member type')\
  .resample("M", on="Start date")
# Print the median duration for each group.
print(grouped["Duration seconds"].median())

### Additional Datetime Methods in pandas

#### Timezones in pandas (Example)

Finally we will deal with row 129 where the start date is in daylight time but the end date is in standard time.

The datetime values we have so far are timezone-naïve; they do not have associated timezone information. In pandas use `.dt.tz_localize()` to localize a datetime to a particular timezone. This uses pytz under the hood. The plan for pandas 2.0 appears to be to switch to using the zoneinfo module.

In [None]:
print(rides["Start date"][:3])

In [None]:
print(rides["Start date"].head(3)\
    .dt.tz_localize("America/New_York"))

In [None]:
# Try to set a timezone for all values.
# This raises a pytz.exceptions.AmbiguousTimeError.
print(rides["Duration"][rides["Duration"].dt.total_seconds() < 0])
try:
    tz_rides = rides["Start date"].dt.tz_localize("America/New_York")
except pytz.exceptions.AmbiguousTimeError as exc:
    print(traceback.format_exc(limit=0))

In [None]:
# We could have fixed the Start date value using .enfold().
# The course sets the ambiguous times to "NaT", which is ignored by pandas.
# Recalculate the "Duration" and "Duration seconds" columns.
print(rides["Duration seconds"].min())
rides["Start date"] = rides["Start date"].dt.tz_localize("America/New_York", ambiguous="NaT").copy()
rides["End date"] = rides["End date"].dt.tz_localize("America/New_York", ambiguous="NaT").copy()
rides["Duration"] = rides["End date"] - rides["Start date"]
rides["Duration seconds"] = (rides["End date"] - rides["Start date"]).dt.total_seconds()

# Show the values for row 129.
print(rides.iloc[129])

# Now the minimum duration is positive.
print(rides["Duration seconds"].min())

#### Other Datetime Operations in pandas (Example)

In [None]:
# Display the start year for reach row.
print(rides["Start date"].head(3).dt.year)
# Display the day of the week for each row.
print(rides["Start date"].head(3).dt.day_name())
# These results can be aggregated by a .groupby() call.

In [None]:
# Shift the indexes forward one, padding with NaT.
print(rides["End date"].shift(1).head(3))

#### Timezones in pandas (Exercise)

In [None]:
# This repeats work we've already accomplished.
# Localize the Start date column to America/New_York
# rides['Start date'] = rides['Start date'].dt.tz_localize("America/New_York", ambiguous="NaT")
# Note two equivalent ways of getting to the scalar value.
print(rides['Start date'].iloc[0])
print(rides.iloc[0]["Start date"])
# Convert the Start date column to Europe/London.
rides['Start date'] = rides['Start date'].dt.tz_convert("Europe/London")
print(rides['Start date'].iloc[0])

#### How Long per Weekday? (Exercise)

Calculate the median ride length for each day of the week.

In [None]:
# Add a column for the weekday of the start of the ride and
# print the weekday and the median ride length.
# The result don't match the course's results because the course
# is using some different rows (e.g., row 79).
rides['Ride start weekday'] = rides['Start date'].dt.day_name()
print(rides.groupby("Ride start weekday")['Duration seconds'].median())

#### How Long between Rides? (Exercise)

This is where shifting a column is useful. We want to compare the End date of a ride with the Start date of the next ride.

In [None]:
# Shift the index of the end date up one; now subract it from the start date.
rides['Time since'] = rides['Start date'] - (rides['End date'].shift(1))
# Move from a timedelta to a number of seconds, which is easier to work with.
rides['Time since'] = rides['Time since'].dt.total_seconds()
# Resample to the month.
monthly = rides.resample("M", on="Start date")
# Print the average hours between rides each month
print(monthly['Time since'].mean() / (60 * 60))