# **S01: DATETIMES IN PYTHON**

Date and time data in Python comes in a few flavors:

- *Time stamps* reference particular moments in time (e.g., July 4th, 2015 at 7:00am).
- *Periods* reference a length of datetime between a particular beginning and end point; for example, the year 2015.
- *Time deltas* or *durations* reference an exact length of time (e.g., a duration of 22.56 seconds).

## Dates and Times in Python

The Python world has a number of available representations of dates, times, deltas, and timespans.
While the time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see their relationship to other packages used in Python.

### Native Python dates and times: ``datetime`` and ``dateutil``

Python's basic objects for working with dates and times reside in the built-in ``datetime`` module.
Along with the third-party ``dateutil`` module, you can use it to quickly perform a host of useful functionalities on dates and times.
For example, you can manually build a date using the ``datetime`` type:

In [17]:
from datetime import datetime
a = datetime(2023, 11, hour=9, day=7, minute=6) 

In [18]:
type(a)

datetime.datetime

Or, using the ``dateutil`` module, you can parse dates from a variety of string formats:

In [19]:
from dateutil import parser
date = parser.parse("December 12th, 2024")
date

datetime.datetime(2024, 12, 12, 0, 0)

### The `strftime` method

This method states for "string from time" and it's very useful to transform a `datetime` variable into a formatted string according to the date and time format we want. All possible options here: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

In [26]:
date.strftime("%V") # string format time

'50'

In [4]:
date.strftime('%d-%B-%Y')

'12-December-2024'

## Dealing with timeseries in Pandas

This section will introduce the fundamental Pandas data structures for working with time series data:

- For *time stamps*, Pandas provides the ``Timestamp`` type. As mentioned before, it is essentially a replacement for Python's native ``datetime``, but is based on the more efficient ``numpy.datetime64`` data type. The associated Index structure is ``DatetimeIndex``.
- For *time Periods*, Pandas provides the ``Period`` type. This encodes a fixed-frequency interval based on ``numpy.datetime64``. The associated index structure is ``PeriodIndex``.
- For *time deltas* or *durations*, Pandas provides the ``Timedelta`` type. ``Timedelta`` is a more efficient replacement for Python's native ``datetime.timedelta`` type, and is based on ``numpy.timedelta64``. The associated index structure is ``TimedeltaIndex``.

**So, in general, for date and time manipulation in pandas bear in mind `Timestamp`, `Period` and `Timedelta`**

In [5]:
import numpy as np
import pandas as pd

### Operating with `Timestamp` and `Period`

One of the useful things we can do with datetimes in pandas is checking if a specific timestamp is comprised inside a specific period

In [30]:
pd.Period('2022') # check this

Period('2022', 'A-DEC')

In [32]:
p.start_time, p.end_time

(Timestamp('2022-08-01 00:00:00'), Timestamp('2022-08-01 23:59:59.999999999'))

In [31]:
p = pd.Period('2022-08-01')

timestamp = pd.Timestamp('2022-08-01 20:00')

p.start_time < timestamp < p.end_time

True

### Creating datetimes with `pd.to_datetime` and `pd.to_timedelta` functions

This function tries to convert the provided input into a sequence of pandas datetime objects. The most common use of this function is to convets a **formatted string** into a **datetime**

In [35]:
dt_s = pd.Series(["2023-01-01", "2023-01-02"])

def parse_date(element):
    year = int(element.split("-")[0])
    month = int(element.split("-")[1])
    day = int(element.split("-")[2])
    return datetime(year, month, day)

dt_s_converted = dt_s.map(parse_date)

dt_s_converted.dtype
dt_s.dtype

dtype('O')

Or 

In [40]:
dt_s = pd.Series(["2023-01-01", "2023-01-02"])

pd.to_datetime(dt_s)

0   2023-01-01
1   2023-01-02
dtype: datetime64[ns]

In [41]:
# convert a string date into a pandas datetime
date = pd.to_datetime("23rd of July, 2024")
date

Timestamp('2024-07-23 00:00:00')

In [42]:
# convert a array of string dates into a pandas datetime array
date = pd.to_datetime(["24th of July, 2024", "25th of July, 2024"])
date

DatetimeIndex(['2024-07-24', '2024-07-25'], dtype='datetime64[ns]', freq=None)

In [48]:
# crate a list of datetimes with different formats
dates = pd.to_datetime([
    '4th of July, 2015',
    #'2015-Jul-6',
    datetime(2015, 7, 3),
    '07-07-2015',
    '20150708'
])

dates

ValueError: time data "07-07-2015" doesn't match format "%dth of %B, %Y", at position 2. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

The detection of the format is done automatically, but sometimes it fails. For more securtity, we can provide directly the _format_ with the **format** argument. Only valid if the format is always the same

In [52]:
date_format = "%d/%m/%Y"
date_format_2 = '%d of %B, %Y'

# use the "format" argument to provide the datetime format 
dates = pd.to_datetime(['2 of December, 2024', "3 of December, 2024"], format=date_format_2) 
dates

DatetimeIndex(['2024-12-02', '2024-12-03'], dtype='datetime64[ns]', freq=None)

In [None]:
dates.to_period("H")

PeriodIndex(['2023-11-05 00:00', '2024-11-06 00:00'], dtype='period[H]')

Additionally, we can create timedeltas (time span) with the following code

In [62]:
# create a timedelta of 1 day
span = pd.to_timedelta(1.666667, unit="H")
span

Timedelta('0 days 01:40:00.001200')

Timedeltas can be used to perform operations with datetime objects in pandas. For example:

In [60]:
pd.to_datetime("3rd of September, 2024") + span

Timestamp('2024-09-03 01:30:00')

In [61]:
pd.to_datetime('2024-09-03 16:44') - span

Timestamp('2024-09-03 15:14:00')

The same with timedelta arrays

In [63]:
list(np.arange(12))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

In [64]:
spans = pd.to_timedelta(np.arange(12), 'H')
spans 

TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
                '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
                '0 days 06:00:00', '0 days 07:00:00', '0 days 08:00:00',
                '0 days 09:00:00', '0 days 10:00:00', '0 days 11:00:00'],
               dtype='timedelta64[ns]', freq=None)

In [67]:
datetimes = pd.to_datetime("23rd of July, 2024") + spans[3:6]

datetimes

DatetimeIndex(['2024-07-23 03:00:00', '2024-07-23 04:00:00',
               '2024-07-23 05:00:00'],
              dtype='datetime64[ns]', freq=None)

In [69]:
# create a dataframe and convert one column into the index

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})

df.set_index("b")

Unnamed: 0_level_0,a
b,Unnamed: 1_level_1
3,1
4,2


### Indexing by Time

Where the Pandas time series tools really become useful is when you begin to *index data by timestamps*.
For example, we can construct a ``Series`` object that has time indexed data:

In [70]:
index = pd.DatetimeIndex([
    '2014-07-04',
    '2014-08-04',
    '2015-07-04',
    '2015-08-04'
])

data = pd.Series([0, 1, 2, 3], index=index)
data

2014-07-04    0
2014-08-04    1
2015-07-04    2
2015-08-04    3
dtype: int64

Now that we have this data in a ``Series``, we can make use of any of the ``Series`` indexing patterns we discussed in previous sections, passing values that can be coerced into dates:

In [74]:
data['07'] # how to extract motnhs or other frequencies

KeyError: '7'

There are additional special date-only indexing operations, such as passing a year to obtain a slice of all data from that year:

In [72]:
data['2015']

2015-07-04    2
2015-08-04    3
dtype: int64

### Create sequences with `pd.date_range()`, `pd.period_range()` and `pd.timedelta_range()`

To make the creation of regular date sequences more convenient, Pandas offers a few functions for this purpose: ``pd.date_range()`` for timestamps, ``pd.period_range()`` for periods, and ``pd.timedelta_range()`` for time deltas.

In [75]:
# create a daily range between two dates
pd.date_range('2015-07-03', '2015-07-20', freq="D")

DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-05', '2015-07-06',
               '2015-07-07', '2015-07-08', '2015-07-09', '2015-07-10',
               '2015-07-11', '2015-07-12', '2015-07-13', '2015-07-14',
               '2015-07-15', '2015-07-16', '2015-07-17', '2015-07-18',
               '2015-07-19', '2015-07-20'],
              dtype='datetime64[ns]', freq='D')

In [76]:
# create a hourly range between two dates
pd.date_range('2015-07-03', '2015-07-10', freq="H")

DatetimeIndex(['2015-07-03 00:00:00', '2015-07-03 01:00:00',
               '2015-07-03 02:00:00', '2015-07-03 03:00:00',
               '2015-07-03 04:00:00', '2015-07-03 05:00:00',
               '2015-07-03 06:00:00', '2015-07-03 07:00:00',
               '2015-07-03 08:00:00', '2015-07-03 09:00:00',
               ...
               '2015-07-09 15:00:00', '2015-07-09 16:00:00',
               '2015-07-09 17:00:00', '2015-07-09 18:00:00',
               '2015-07-09 19:00:00', '2015-07-09 20:00:00',
               '2015-07-09 21:00:00', '2015-07-09 22:00:00',
               '2015-07-09 23:00:00', '2015-07-10 00:00:00'],
              dtype='datetime64[ns]', length=169, freq='H')

Alternatively, the date range can be specified not with a start and endpoint, but with a startpoint and a number of periods:

In [77]:
pd.date_range('2015-07', periods=8)

DatetimeIndex(['2015-07-01', '2015-07-02', '2015-07-03', '2015-07-04',
               '2015-07-05', '2015-07-06', '2015-07-07', '2015-07-08'],
              dtype='datetime64[ns]', freq='D')

The spacing can be modified by altering the ``freq`` argument, which defaults to ``D``.
For example, here we will construct a range of hourly timestamps:

In [79]:
pd.date_range('2015-07-03', periods=8, freq='2D') 

DatetimeIndex(['2015-07-03', '2015-07-05', '2015-07-07', '2015-07-09',
               '2015-07-11', '2015-07-13', '2015-07-15', '2015-07-17'],
              dtype='datetime64[ns]', freq='2D')

To create regular sequences of ``Period`` or ``Timedelta`` values, the very similar ``pd.period_range()`` and ``pd.timedelta_range()`` functions are useful.
Here are some monthly periods:

In [80]:
pd.period_range('2015-07', periods=8, freq='M')

PeriodIndex(['2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12',
             '2016-01', '2016-02'],
            dtype='period[M]')

And a sequence of durations increasing by an hour:

In [84]:
pd.timedelta_range(0, periods=10, freq='H')

TimedeltaIndex(['0 days 00:00:00', '0 days 01:00:00', '0 days 02:00:00',
                '0 days 03:00:00', '0 days 04:00:00', '0 days 05:00:00',
                '0 days 06:00:00', '0 days 07:00:00', '0 days 08:00:00',
                '0 days 09:00:00'],
               dtype='timedelta64[ns]', freq='H')

All of these require an understanding of Pandas frequency codes, which we'll summarize in the next section.

#### Frequencies and Offsets

Fundamental to these Pandas time series tools is the concept of a frequency or date offset.
Just as we saw the ``D`` (day) and ``H`` (hour) codes above, we can use such codes to specify any desired frequency spacing.
The following table summarizes the main codes available:

| Code   | Description         | Code   | Description          |
|--------|---------------------|--------|----------------------|
| ``D``  | Calendar day        | ``B``  | Business day         |
| ``W``  | Weekly              |        |                      |
| ``M``  | Month end           | ``BM`` | Business month end   |
| ``Q``  | Quarter end         | ``BQ`` | Business quarter end |
| ``A``  | Year end            | ``BA`` | Business year end    |
| ``H``  | Hours               | ``BH`` | Business hours       |
| ``T``  | Minutes             |        |                      |
| ``S``  | Seconds             |        |                      |
| ``L``  | Milliseonds         |        |                      |
| ``U``  | Microseconds        |        |                      |
| ``N``  | nanoseconds         |        |                      |

The monthly, quarterly, and annual frequencies are all marked at the end of the specified period.
By adding an ``S`` suffix to any of these, they instead will be marked at the beginning:

| Code    | Description            | Code    | Description            |
|---------|------------------------|---------|------------------------|
| ``MS``  | Month start            |``BMS``  | Business month start   |
| ``QS``  | Quarter start          |``BQS``  | Business quarter start |
| ``AS``  | Year start             |``BAS``  | Business year start    |

Additionally, you can change the month used to mark any quarterly or annual code by adding a three-letter month code as a suffix:

- ``Q-JAN``, ``BQ-FEB``, ``QS-MAR``, ``BQS-APR``, etc.
- ``A-JAN``, ``BA-FEB``, ``AS-MAR``, ``BAS-APR``, etc.

In the same way, the split-point of the weekly frequency can be modified by adding a three-letter weekday code:

- ``W-SUN``, ``W-MON``, ``W-TUE``, ``W-WED``, etc.

On top of this, codes can be combined with numbers to specify other frequencies.
For example, for a frequency of 2 hours 30 minutes, we can combine the hour (``H``) and minute (``T``) codes as follows:

In [85]:
pd.timedelta_range(0, periods=9, freq="2H30T")

TimedeltaIndex(['0 days 00:00:00', '0 days 02:30:00', '0 days 05:00:00',
                '0 days 07:30:00', '0 days 10:00:00', '0 days 12:30:00',
                '0 days 15:00:00', '0 days 17:30:00', '0 days 20:00:00'],
               dtype='timedelta64[ns]', freq='150T')

## Resampling, Shifting, and Windowing

The ability to use dates and times as indices to intuitively organize and access data is an important piece of the Pandas time series tools.
The benefits of indexed data in general (automatic alignment during operations, intuitive data slicing and access, etc.) still apply, and Pandas provides several additional time series-specific operations.

We will take a look at a few of those here, using some stock price data as an example. Install the `yfinance` package (installable via ``conda install yfinance``), and download Google's stock price history:

In [90]:
!pip install yfinance



In [86]:
import yfinance as yf

goog = yf.download('GOOG', start='2023-01-01', end='2024-09-03')
goog.head()

[*********************100%%**********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-01-03,89.830002,91.550003,89.019997,89.699997,89.598038,20738500
2023-01-04,91.010002,91.239998,87.800003,88.709999,88.609169,27046500
2023-01-05,88.07,88.209999,86.559998,86.769997,86.671371,23136100
2023-01-06,87.360001,88.470001,85.57,88.160004,88.059799,26612600
2023-01-09,89.195,90.830002,88.580002,88.800003,88.699066,22996700


For simplicity, we'll use just the closing price:

In [87]:
goog = goog['Close']

goog

Date
2023-01-03     89.699997
2023-01-04     88.709999
2023-01-05     86.769997
2023-01-06     88.160004
2023-01-09     88.800003
                 ...    
2024-08-26    167.929993
2024-08-27    166.380005
2024-08-28    164.500000
2024-08-29    163.399994
2024-08-30    165.110001
Name: Close, Length: 418, dtype: float64

In [88]:
import matplotlib.pyplot as plt

import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"

In [89]:
px.line(goog, title="GOOG Stock")


  v = v.dt.to_pydatetime()


### Resampling and converting frequencies

One common need for time series data is resampling at a higher or lower frequency.
This can be done using the ``resample()`` method, or the much simpler ``asfreq()`` method.
The primary difference between the two is that ``resample()`` is fundamentally a *data aggregation*, while ``asfreq()`` is fundamentally a *data selection*.

Taking a look at the Google closing price, let's compare what the two return when we down-sample the data.
Here we will resample the data at the end of business year:

In [95]:
goog.head()

Date
2023-01-03    89.699997
2023-01-04    88.709999
2023-01-05    86.769997
2023-01-06    88.160004
2023-01-09    88.800003
Name: Close, dtype: float64

In [91]:
goog_resample = goog.resample('M').mean()
goog_freq = goog.asfreq('M')

In [None]:
goog_resample

Date
2023-01-31     94.016001
2023-02-28     96.808948
2023-03-31     98.558696
2023-04-30    106.348422
2023-05-31    116.745682
2023-06-30    123.228096
2023-07-31    123.553499
2023-08-31    131.149131
2023-09-30    135.196502
2023-10-31    135.354091
2023-11-30    134.868570
2023-12-31    136.907500
2024-01-31    145.425714
2024-02-29    144.067999
2024-03-31    143.481499
2024-04-30    158.730909
2024-05-31    173.573636
2024-06-30    179.243684
2024-07-31    186.645001
Freq: M, Name: Close, dtype: float64

In [96]:
goog_freq

Date
2023-01-31     99.870003
2023-02-28     90.300003
2023-03-31    104.000000
2023-04-30           NaN
2023-05-31    123.370003
2023-06-30    120.970001
2023-07-31    133.110001
2023-08-31    137.350006
2023-09-30           NaN
2023-10-31    125.300003
2023-11-30    133.919998
2023-12-31           NaN
2024-01-31    141.800003
2024-02-29    139.779999
2024-03-31           NaN
2024-04-30    164.639999
2024-05-31    173.960007
2024-06-30           NaN
2024-07-31    173.149994
Freq: M, Name: Close, dtype: float64

In [94]:
figure = px.line({
    "resample":goog_resample,
    "as_freq":goog_freq},
line_dash_sequence=["dashdot"], title="GOOG Stock (Yearly)")
figure.add_trace(px.line({"daily": goog}, color_discrete_sequence=["green"]).data[0])


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



In this case we've made a down-sampling of timeseries data

For up-sampling, ``resample()`` and ``asfreq()`` are largely equivalent, though resample has many more options available.
In this case, the default for both methods is to leave the up-sampled points empty, that is, filled with NA values.
Just as with the ``pd.fillna()`` function discussed previously, ``asfreq()`` accepts a ``method`` argument to specify how values are imputed.
Here, we will resample the business day data at a daily frequency (i.e., including weekends):

In [97]:
goog_d = goog.asfreq('D')
goog_d_fill = goog.asfreq('D', method='bfill')

In [98]:
px.line({
    "empty_weekends":goog_d+10, 
    "filled_weekends":goog_d_fill
}, title="GOOG Stock (Daily)")


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



### Time-shifts with `shift()`

Another common time series-specific operation is shifting of data in time. The method is `shift()`

In [101]:
goog.head()

Date
2023-01-03    89.699997
2023-01-04    88.709999
2023-01-05    86.769997
2023-01-06    88.160004
2023-01-09    88.800003
Name: Close, dtype: float64

In [109]:
goog_sh = goog.shift(periods=1)

goog_sh.head()

Date
2023-01-03          NaN
2023-01-04    89.699997
2023-01-05    88.709999
2023-01-06    86.769997
2023-01-09    88.160004
Name: Close, dtype: float64

In [111]:
goog_df = pd.DataFrame({
    "original": goog,
    "original_shift_1": goog_sh
})

goog_df.corr()

Unnamed: 0,original,original_shift_1
original,1.0,0.995496
original_shift_1,0.995496,1.0


In [108]:
px.line({"original":goog, "shifted":goog_sh}, title="GOOG Stock (Daily)")


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



This feature is very useful to calculate target variable in machine learning in forecasting problems

### Rolling windows

Rolling statistics are a third type of time series-specific operation implemented by Pandas.
These can be accomplished via the ``rolling()`` attribute of ``Series`` and ``DataFrame`` objects, which returns a view similar to what we saw with the ``groupby`` operation
This rolling view makes available a number of aggregation operations by default.

For example, here is the one-year rolling mean and standard deviation of the Google stock prices:

In [116]:
rolling = goog.rolling(30)

data = pd.DataFrame({
    'input': goog,
    'rolling_mean': rolling.mean(),
    'rolling_std': rolling.std()
})

data

Unnamed: 0_level_0,input,rolling_mean,rolling_std
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-01-03,89.699997,,
2023-01-04,88.709999,,
2023-01-05,86.769997,,
2023-01-06,88.160004,,
2023-01-09,88.800003,,
...,...,...,...
2024-08-26,167.929993,169.989999,7.272992
2024-08-27,166.380005,169.352666,6.680600
2024-08-28,164.500000,168.748666,6.244622
2024-08-29,163.399994,168.221332,5.992757


In [117]:
px.line(data)


The behavior of DatetimeProperties.to_pydatetime is deprecated, in a future version this will return a Series containing python datetime objects instead of an ndarray. To retain the old behavior, call `np.array` on the result



## The `dt` attribute in Series

The `dt` attribute of a pandas Series represents the datetime values of the series as a DatetimeIndex, which provides a lot of convenient functions for working with dates and times.

The `dt` attribute is only available for Series objects that contain datetime values. If the series does not contain datetime values, attempting to access the dt attribute will raise an `AttributeError`.

In [118]:
today = pd.to_datetime("2024-09-05")

In [122]:
today.day_of_week

3

In [128]:
# Create a series with datetime values
s = pd.Series(['2022-01-01', '2022-02-01', '2022-03-01'], dtype='datetime64[ns]')

s.dt.day_of_week

0    5
1    1
2    1
dtype: int32

The `dt` attribute provides access to the following properties:

- `year`: the year of the datetime
- `month`: the month of the datetime
- `day`: the day of the datetime
- `hour`: the hour of the datetime
- `minute`: the minute of the datetime
- `second`: the second of the datetime

In [129]:
# Get the year of each datetime
s.dt.year

0    2022
1    2022
2    2022
dtype: int32

In [130]:
# Get the month of each datetime
s.dt.month

0    1
1    2
2    3
dtype: int32

In [131]:
# Get the day of each datetime
s.dt.day

0    1
1    1
2    1
dtype: int32

In [135]:
df = pd.DataFrame({
    "date": pd.to_datetime(['2022-01-01', '2022-02-01', '2022-03-01']),
    "values": [12, 23, 435]
})

df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day"] = df["date"].dt.day
df["day_of_week"] = df["date"].dt.day_of_week

df.drop("date", axis=1)

Unnamed: 0,values,year,month,day,day_of_week
0,12,2022,1,1,5
1,23,2022,2,1,1
2,435,2022,3,1,1
