# [Introduction to Date and Time in Pandas](#)

Time series data is a sequence of data points indexed in time order. Pandas provides powerful tools for working with time-based data, making it an essential library for tasks such as financial analysis, scientific research, and business intelligence.


Working with dates and times is crucial in many data analysis tasks. Here's why datetime functionality is so important:

- **Data Organization**: Time-based indexing allows for intuitive data slicing and selection.
- **Trend Analysis**: Enables the identification of patterns and trends over time.
- **Forecasting**: Facilitates predictive modeling based on historical time series data.
- **Event Analysis**: Helps in studying the impact of specific events on time-dependent variables.
- **Data Aggregation**: Allows for easy grouping and summarizing of data by various time periods (e.g., daily, monthly, yearly).
- **Cross-Dataset Comparison**: Enables alignment and comparison of multiple datasets based on timestamps.


<img src="../images/time-series.png" width="800">

Pandas offers a rich set of tools and functions for handling time series data:


1. **Datetime objects**: Pandas provides Timestamp for representing a single point in time and DatetimeIndex for time series.


In [1]:
import pandas as pd
import numpy as np

In [2]:
# Creating a DatetimeIndex
dates = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
s = pd.Series(np.random.randn(len(dates)), index=dates)
s

2023-01-01   -0.132048
2023-01-02   -0.317583
2023-01-03   -0.824561
2023-01-04   -0.431450
2023-01-05    0.025972
2023-01-06   -0.790383
2023-01-07   -0.251666
2023-01-08   -1.096422
2023-01-09   -0.253983
2023-01-10   -1.458300
Freq: D, dtype: float64

2. **Date ranges**: Easy creation of regular time series.


In [3]:
# Creating a date range
pd.date_range(start='2023-01-01', periods=7, freq='D')

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07'],
              dtype='datetime64[ns]', freq='D')

3. **Flexible parsing**: Convert various string formats to datetime objects.


In [4]:
# Parsing dates
pd.to_datetime(['2023-01-01', '20230102', '01/03/2023'], format="mixed")

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03'], dtype='datetime64[ns]', freq=None)

4. **Time zone handling**: Support for localization and conversion between time zones.


In [5]:
# Time zone conversion
ts = pd.Timestamp('2023-01-01 12:00:00', tz='UTC')
ts.tz_convert('US/Eastern')

Timestamp('2023-01-01 07:00:00-0500', tz='US/Eastern')

5. **Resampling**: Change the frequency of time series data.


In [6]:
# Resampling to monthly frequency
monthly = s.resample('ME').mean()
monthly

2023-01-31   -0.553042
Freq: ME, dtype: float64

6. **Rolling windows**: Compute moving statistics.


In [7]:
# Computing 3-day rolling mean
s.rolling(window=3).mean()

2023-01-01         NaN
2023-01-02         NaN
2023-01-03   -0.424731
2023-01-04   -0.524532
2023-01-05   -0.410013
2023-01-06   -0.398621
2023-01-07   -0.338692
2023-01-08   -0.712824
2023-01-09   -0.534024
2023-01-10   -0.936235
Freq: D, dtype: float64

7. **Shifting and lagging**: Easily offset your data in time.


In [8]:
# Shifting the series by 2 days
s.shift(2)

2023-01-01         NaN
2023-01-02         NaN
2023-01-03   -0.132048
2023-01-04   -0.317583
2023-01-05   -0.824561
2023-01-06   -0.431450
2023-01-07    0.025972
2023-01-08   -0.790383
2023-01-09   -0.251666
2023-01-10   -1.096422
Freq: D, dtype: float64

8. **Period functionality**: Work with time spans rather than specific timestamps.


In [9]:
# Creating a PeriodIndex
periods = pd.period_range(start='2023-01', end='2023-12', freq='M')
pd.Series(np.random.randn(len(periods)), index=periods)

2023-01   -0.367263
2023-02   -0.413871
2023-03    0.262793
2023-04    0.276197
2023-05    0.807989
2023-06   -1.085122
2023-07    0.456688
2023-08   -0.134236
2023-09   -0.540731
2023-10   -0.251733
2023-11   -1.831110
2023-12   -0.851892
Freq: M, dtype: float64

These capabilities make Pandas an extremely powerful tool for handling time series data. Throughout this lecture series, we'll explore these features in depth, providing you with the skills to effectively analyze and manipulate time-based data.


In the upcoming sections, we'll dive deeper into the specifics of working with datetime objects, indexing, and performing various operations on time series data in Pandas.

## <a id='toc1_'></a>[Datetime Data Types](#toc0_)

Pandas provides several specialized data types for working with dates and times. These types are crucial for effective time series analysis and manipulation.


### <a id='toc1_1_'></a>[Timestamp](#toc0_)


A `Timestamp` object represents a single point in time. It's similar to Python's `datetime`, but with additional features tailored for use in Pandas.


In [10]:
# Creating a Timestamp
ts = pd.Timestamp('2023-06-15 14:30:00')
ts

Timestamp('2023-06-15 14:30:00')

When you create a Timestamp, Pandas parses the string and creates an object representing that exact moment in time. This Timestamp object has various attributes that you can access:


In [11]:
# Accessing attributes
ts.year

2023

In [12]:
ts.month

6

In [13]:
ts.day

15

In [14]:
ts.hour

14

In [15]:
ts.minute

30

In [16]:
ts.second

0

These attributes allow you to easily extract specific components of the date and time.


Timestamps can also include timezone information. This is particularly useful when working with data from different regions or when performing calculations across time zones:


In [17]:
# Creating Timestamp with timezone
ts_tz = pd.Timestamp('2023-06-15 14:30:00', tz='UTC')
ts_tz

Timestamp('2023-06-15 14:30:00+0000', tz='UTC')

In this case, we've created a Timestamp in the UTC timezone. Pandas supports a wide range of timezone specifications, making it flexible for international data analysis.


### <a id='toc1_2_'></a>[DatetimeIndex](#toc0_)


A `DatetimeIndex` is an Index object that contains Timestamp objects. It's commonly used as the index for time series data in Pandas, allowing for powerful time-based operations.


In [18]:
# Creating a DatetimeIndex
dates = pd.date_range(start='2023-01-01', periods=5, freq='D')
dates

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05'],
              dtype='datetime64[ns]', freq='D')

Here, we've created a DatetimeIndex spanning 5 days, starting from January 1, 2023. The 'D' frequency specifies daily intervals. This index can now be used to create a time series:


In [19]:
# Creating a Series with DatetimeIndex
s = pd.Series(range(5), index=dates)
s

2023-01-01    0
2023-01-02    1
2023-01-03    2
2023-01-04    3
2023-01-05    4
Freq: D, dtype: int64

This Series now has dates as its index, enabling intuitive time-based selection:


In [20]:
# Selecting data using DatetimeIndex
s['2023-01-03']  # Select a single date

2

In [21]:
s['2023-01-02':'2023-01-04']  # Select a range of dates

2023-01-02    1
2023-01-03    2
2023-01-04    3
Freq: D, dtype: int64

DatetimeIndex also enables powerful resampling operations. For example, we can easily aggregate our daily data into 2-day periods:


In [22]:
# Resampling with DatetimeIndex
s.resample('2D').sum()

2023-01-01    1
2023-01-03    5
2023-01-05    4
Freq: 2D, dtype: int64

This resampling operation sums the values for every two-day period, demonstrating how DatetimeIndex facilitates time-based data aggregation.


### <a id='toc1_3_'></a>[Timedelta and TimedeltaIndex](#toc0_)


`Timedelta` represents a duration or difference between two dates or times. It's useful for performing date arithmetic and representing time spans.


In [23]:
# Creating a Timedelta
td = pd.Timedelta(days=2, hours=3, minutes=30)
td

Timedelta('2 days 03:30:00')

This Timedelta represents a duration of 2 days, 3 hours, and 30 minutes. We can use it in arithmetic operations with Timestamps:


In [24]:
# Arithmetic with Timestamps
ts = pd.Timestamp('2023-06-15 14:30:00')
ts + td

Timestamp('2023-06-17 18:00:00')

This calculation adds our Timedelta to the Timestamp, giving us a new Timestamp 2 days, 3 hours, and 30 minutes later.


A `TimedeltaIndex` is an Index of Timedelta objects, useful for representing sequences of durations:


In [25]:
# Creating a TimedeltaIndex
tdi = pd.timedelta_range(start='1 day', end='5 days', freq='D')
tdi

TimedeltaIndex(['1 days', '2 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq='D')

This creates a TimedeltaIndex representing 1 to 5 days. We can use this to create a Series:


In [26]:
# Creating a Series with TimedeltaIndex
s_td = pd.Series(range(5), index=tdi)
s_td

1 days    0
2 days    1
3 days    2
4 days    3
5 days    4
Freq: D, dtype: int64

Now we have a Series indexed by time durations. We can select data based on these durations:


In [27]:
# Selecting data using TimedeltaIndex
s_td['2 days':'4 days']

2 days    1
3 days    2
4 days    3
Freq: D, dtype: int64

This selects all data points with durations between 2 and 4 days.


Timedelta and TimedeltaIndex are particularly useful in scenarios where you're interested in time differences rather than absolute dates. For example, in financial applications for calculating time to maturity, or in project management for tracking task durations.


Here's a practical example combining these concepts:


In [28]:
# Example: Calculating time differences
dates = pd.date_range(start='2023-01-01', periods=5, freq='D')
df = pd.DataFrame({'Date': dates, 'Value': range(5)})
df['TimeSinceStart'] = df['Date'] - df['Date'].min()
df

Unnamed: 0,Date,Value,TimeSinceStart
0,2023-01-01,0,0 days
1,2023-01-02,1,1 days
2,2023-01-03,2,2 days
3,2023-01-04,3,3 days
4,2023-01-05,4,4 days


In this example, we create a DataFrame with a date column and a value column. We then calculate a new column 'TimeSinceStart' which shows the duration since the start date for each row. This demonstrates how Timedelta can be used to compute time differences within a dataset.


These datetime data types form the foundation for working with time series data in Pandas. They allow for intuitive representation of dates, times, and time periods, and enable powerful time-based operations and analyses. As you work more with time series data, you'll find these types invaluable for various data manipulation and analysis tasks.

## <a id='toc2_'></a>[Creating Datetime Objects](#toc0_)

Pandas provides several methods for creating datetime objects, which are essential for working with time series data. Let's explore the main approaches: using `to_datetime()`, `date_range()`, and specifying frequencies.


### <a id='toc2_1_'></a>[Using `to_datetime()`](#toc0_)


The `to_datetime()` function is a versatile tool for converting various input formats into Pandas datetime objects. It can handle strings, numbers, and even mixed-type data.


Let's start with a simple example:


In [29]:
# Converting a string to datetime
date_str = '2023-06-15'
pd.to_datetime(date_str)

Timestamp('2023-06-15 00:00:00')

This converts the string '2023-06-15' into a Timestamp object. But `to_datetime()` is much more flexible:


In [30]:
# Converting multiple strings with different formats
date_strings = ['2023-06-15', '20230616', '06/17/2023']
pd.to_datetime(date_strings, format="mixed")

DatetimeIndex(['2023-06-15', '2023-06-16', '2023-06-17'], dtype='datetime64[ns]', freq=None)

Here, `to_datetime()` automatically detects and parses different date formats. This is particularly useful when dealing with inconsistent date representations in your data.


You can also specify the format explicitly for more control:


In [31]:
# Specifying a custom format
custom_date = '15-Jun-2023'
pd.to_datetime(custom_date, format='%d-%b-%Y')

Timestamp('2023-06-15 00:00:00')

The `format` parameter uses strftime-style formatting codes to specify the input format.


`to_datetime()` can also handle timestamps and epoch times:


In [32]:
# Converting Unix timestamp
pd.to_datetime(1623766800, unit='s')

Timestamp('2021-06-15 14:20:00')

This converts a Unix timestamp (seconds since January 1, 1970) to a Pandas Timestamp.


### <a id='toc2_2_'></a>[Using `date_range()`](#toc0_)


While `to_datetime()` is great for converting existing data, `date_range()` is used to generate sequences of dates or timestamps. This is particularly useful for creating DatetimeIndex objects or generating date-based data.


Let's start with a basic example:


In [33]:
# Creating a date range for a month
pd.date_range(start='2023-01-01', end='2023-01-31')

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10', '2023-01-11', '2023-01-12',
               '2023-01-13', '2023-01-14', '2023-01-15', '2023-01-16',
               '2023-01-17', '2023-01-18', '2023-01-19', '2023-01-20',
               '2023-01-21', '2023-01-22', '2023-01-23', '2023-01-24',
               '2023-01-25', '2023-01-26', '2023-01-27', '2023-01-28',
               '2023-01-29', '2023-01-30', '2023-01-31'],
              dtype='datetime64[ns]', freq='D')

This creates a DatetimeIndex with daily dates for January 2023. By default, `date_range()` uses a daily frequency.


You can also specify the number of periods instead of an end date:


In [34]:
# Creating a date range with a specific number of periods
pd.date_range(start='2023-01-01', periods=10)

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10'],
              dtype='datetime64[ns]', freq='D')

This creates 10 daily timestamps starting from January 1, 2023.


### <a id='toc2_3_'></a>[Specifying frequency](#toc0_)


Both `date_range()` and `to_datetime()` allow you to specify the frequency of the timestamps. Pandas uses frequency aliases to represent different time intervals.


Let's look at some examples with `date_range()`:


In [35]:
# Creating a range of business days
pd.date_range(start='2023-01-01', periods=5, freq='B')

DatetimeIndex(['2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05',
               '2023-01-06'],
              dtype='datetime64[ns]', freq='B')

This creates a range of 5 business days, skipping weekends.


In [36]:
# Creating a range of hourly timestamps
pd.date_range(start='2023-01-01', periods=24, freq='h')

DatetimeIndex(['2023-01-01 00:00:00', '2023-01-01 01:00:00',
               '2023-01-01 02:00:00', '2023-01-01 03:00:00',
               '2023-01-01 04:00:00', '2023-01-01 05:00:00',
               '2023-01-01 06:00:00', '2023-01-01 07:00:00',
               '2023-01-01 08:00:00', '2023-01-01 09:00:00',
               '2023-01-01 10:00:00', '2023-01-01 11:00:00',
               '2023-01-01 12:00:00', '2023-01-01 13:00:00',
               '2023-01-01 14:00:00', '2023-01-01 15:00:00',
               '2023-01-01 16:00:00', '2023-01-01 17:00:00',
               '2023-01-01 18:00:00', '2023-01-01 19:00:00',
               '2023-01-01 20:00:00', '2023-01-01 21:00:00',
               '2023-01-01 22:00:00', '2023-01-01 23:00:00'],
              dtype='datetime64[ns]', freq='h')

This creates 24 hourly timestamps starting from January 1, 2023.


You can also use more complex frequencies:


In [37]:
# Every 2 weeks on Monday
pd.date_range(start='2023-01-01', periods=5, freq='2W-MON')

DatetimeIndex(['2023-01-02', '2023-01-16', '2023-01-30', '2023-02-13',
               '2023-02-27'],
              dtype='datetime64[ns]', freq='2W-MON')

This creates a range of 5 dates, each 2 weeks apart, always landing on a Monday.


Here are some common frequency aliases:
- 'D': Calendar day
- 'B': Business day
- 'W': Weekly
- 'ME': Month end
- 'QE': Quarter end
- 'YE': Year end
- 'h': Hourly
- 't': Minutely
- 's': Secondly


Understanding how to create datetime objects efficiently is crucial for working with time series data in Pandas. These methods provide flexibility in handling various input formats and generating date sequences, enabling you to prepare your time-based data for further analysis and manipulation.

## <a id='toc3_'></a>[Date and Time Components](#toc0_)

When working with datetime data in Pandas, it's often necessary to access or extract specific components of dates and times. Pandas provides convenient methods to access these components and create new columns based on them.


### <a id='toc3_1_'></a>[Accessing Components (year, month, day, etc.)](#toc0_)


Pandas datetime objects (whether in a Series or DatetimeIndex) have attributes for accessing various date and time components. Let's explore these using a sample DataFrame:


In [38]:
# Create a sample DataFrame with a date range
dates = pd.date_range(start='2023-01-01', periods=10, freq='D')
df = pd.DataFrame({'date': dates, 'value': np.random.rand(10)})
df

Unnamed: 0,date,value
0,2023-01-01,0.192302
1,2023-01-02,0.684846
2,2023-01-03,0.371756
3,2023-01-04,0.200897
4,2023-01-05,0.51251
5,2023-01-06,0.098722
6,2023-01-07,0.143695
7,2023-01-08,0.940325
8,2023-01-09,0.39869
9,2023-01-10,0.951168


This creates a DataFrame with a 'date' column containing 10 consecutive days starting from January 1, 2023, and a 'value' column with random numbers.


Now, let's access various components of the 'date' column:


In [39]:
# Accessing year
df['date'].dt.year

0    2023
1    2023
2    2023
3    2023
4    2023
5    2023
6    2023
7    2023
8    2023
9    2023
Name: date, dtype: int32

In [40]:
# Accessing month
df['date'].dt.month

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: date, dtype: int32

In [41]:
# Accessing day
df['date'].dt.day

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
Name: date, dtype: int32

In [42]:
# Accessing day name
df['date'].dt.day_name()

0       Sunday
1       Monday
2      Tuesday
3    Wednesday
4     Thursday
5       Friday
6     Saturday
7       Sunday
8       Monday
9      Tuesday
Name: date, dtype: object

In [43]:
# Accessing day of week (0 is Monday, 6 is Sunday)
df['date'].dt.dayofweek

0    6
1    0
2    1
3    2
4    3
5    4
6    5
7    6
8    0
9    1
Name: date, dtype: int32

In [44]:
# Accessing quarter
df['date'].dt.quarter

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: date, dtype: int32

Each of these operations returns a Series with the respective component for each date in the 'date' column. The `.dt` accessor is used to access datetime-specific methods and attributes.


For time components, let's create a new DataFrame with a datetime column including time:


In [45]:
# Create a DataFrame with datetime including time
datetimes = pd.date_range(start='2023-01-01 00:00:00', periods=10, freq='h')
df_time = pd.DataFrame({'datetime': datetimes, 'value': np.random.rand(10)})
df_time

Unnamed: 0,datetime,value
0,2023-01-01 00:00:00,0.162446
1,2023-01-01 01:00:00,0.243746
2,2023-01-01 02:00:00,0.890413
3,2023-01-01 03:00:00,0.149612
4,2023-01-01 04:00:00,0.430437
5,2023-01-01 05:00:00,0.370105
6,2023-01-01 06:00:00,0.436552
7,2023-01-01 07:00:00,0.441679
8,2023-01-01 08:00:00,0.841236
9,2023-01-01 09:00:00,0.037252


Now we can access time components:


In [46]:
# Accessing hour
df_time['datetime'].dt.hour

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
Name: datetime, dtype: int32

In [47]:
# Accessing minute
df_time['datetime'].dt.minute

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: datetime, dtype: int32

In [48]:
# Accessing second
df_time['datetime'].dt.second

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: datetime, dtype: int32

These attributes make it easy to extract specific parts of datetime objects for analysis or filtering.


### <a id='toc3_2_'></a>[Extracting Components to New Columns](#toc0_)


Often, you'll want to create new columns in your DataFrame based on these datetime components. This is useful for grouping, filtering, or analyzing data based on specific time periods.


Let's extend our original DataFrame with new columns for various date components:


In [49]:
# Adding year, month, and day columns
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_name'] = df['date'].dt.day_name()
df['day_of_week'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter

df

Unnamed: 0,date,value,year,month,day,day_name,day_of_week,quarter
0,2023-01-01,0.192302,2023,1,1,Sunday,6,1
1,2023-01-02,0.684846,2023,1,2,Monday,0,1
2,2023-01-03,0.371756,2023,1,3,Tuesday,1,1
3,2023-01-04,0.200897,2023,1,4,Wednesday,2,1
4,2023-01-05,0.51251,2023,1,5,Thursday,3,1
5,2023-01-06,0.098722,2023,1,6,Friday,4,1
6,2023-01-07,0.143695,2023,1,7,Saturday,5,1
7,2023-01-08,0.940325,2023,1,8,Sunday,6,1
8,2023-01-09,0.39869,2023,1,9,Monday,0,1
9,2023-01-10,0.951168,2023,1,10,Tuesday,1,1


Now our DataFrame has separate columns for year, month, day, day name, day of week, and quarter. This makes it easy to perform operations like grouping or filtering based on these components.


For example, we can now easily filter for all rows in a specific month:


In [50]:
# Filter for January data
january_data = df[df['month'] == 1]
january_data

Unnamed: 0,date,value,year,month,day,day_name,day_of_week,quarter
0,2023-01-01,0.192302,2023,1,1,Sunday,6,1
1,2023-01-02,0.684846,2023,1,2,Monday,0,1
2,2023-01-03,0.371756,2023,1,3,Tuesday,1,1
3,2023-01-04,0.200897,2023,1,4,Wednesday,2,1
4,2023-01-05,0.51251,2023,1,5,Thursday,3,1
5,2023-01-06,0.098722,2023,1,6,Friday,4,1
6,2023-01-07,0.143695,2023,1,7,Saturday,5,1
7,2023-01-08,0.940325,2023,1,8,Sunday,6,1
8,2023-01-09,0.39869,2023,1,9,Monday,0,1
9,2023-01-10,0.951168,2023,1,10,Tuesday,1,1


Or calculate the mean value for each day of the week:


In [51]:
# Calculate mean value by day of week
df.groupby('day_name')['value'].mean().sort_values(ascending=False)

day_name
Tuesday      0.661462
Sunday       0.566314
Monday       0.541768
Thursday     0.512510
Wednesday    0.200897
Saturday     0.143695
Friday       0.098722
Name: value, dtype: float64

For time-based components, let's add some columns to our df_time DataFrame:


In [52]:
# Adding hour and minute columns
df_time['hour'] = df_time['datetime'].dt.hour
df_time['minute'] = df_time['datetime'].dt.minute

df_time

Unnamed: 0,datetime,value,hour,minute
0,2023-01-01 00:00:00,0.162446,0,0
1,2023-01-01 01:00:00,0.243746,1,0
2,2023-01-01 02:00:00,0.890413,2,0
3,2023-01-01 03:00:00,0.149612,3,0
4,2023-01-01 04:00:00,0.430437,4,0
5,2023-01-01 05:00:00,0.370105,5,0
6,2023-01-01 06:00:00,0.436552,6,0
7,2023-01-01 07:00:00,0.441679,7,0
8,2023-01-01 08:00:00,0.841236,8,0
9,2023-01-01 09:00:00,0.037252,9,0


This allows for easy time-based analysis:


In [53]:
# Calculate mean value by hour
df_time.groupby('hour')['value'].mean()

hour
0    0.162446
1    0.243746
2    0.890413
3    0.149612
4    0.430437
5    0.370105
6    0.436552
7    0.441679
8    0.841236
9    0.037252
Name: value, dtype: float64

Extracting date and time components to separate columns can significantly simplify many common data analysis tasks. It allows you to easily group, filter, and aggregate your data based on various time periods, which is crucial in many time series analysis scenarios.


Remember that creating these additional columns increases memory usage, so it's a trade-off between convenience and efficiency. In some cases, it might be more efficient to use the dt accessor directly in your operations rather than creating new columns, especially for large datasets.


By mastering these techniques for accessing and extracting date and time components, you'll be well-equipped to handle a wide range of time series analysis tasks in Pandas.

## <a id='toc4_'></a>[Datetime Indexing and Slicing](#toc0_)

When working with time series data in Pandas, efficient indexing and slicing are crucial for data analysis and manipulation. Pandas provides powerful tools for selecting data based on dates and times, including partial string indexing and accessing specific date and time components.


Let's start by creating a sample DataFrame with a DatetimeIndex:


In [54]:
# Create a DataFrame with a DatetimeIndex
dates = pd.date_range(start='2023-01-01', end='2023-12-31', freq='D')
df = pd.DataFrame({'value': np.random.randn(len(dates))}, index=dates)
df

Unnamed: 0,value
2023-01-01,0.586287
2023-01-02,-2.077276
2023-01-03,-1.372788
2023-01-04,-1.080971
2023-01-05,0.534736
...,...
2023-12-27,0.876468
2023-12-28,0.887156
2023-12-29,0.347792
2023-12-30,0.337033


This creates a DataFrame with daily data for the entire year of 2023.


### <a id='toc4_1_'></a>[Partial string indexing](#toc0_)


Partial string indexing is a powerful feature in Pandas that allows you to select data using incomplete date string labels. This is particularly useful when you want to select data for specific time periods without specifying the full date.


Let's explore some examples:


In [55]:
# Select all data for January 2023
df.loc['2023-01']

Unnamed: 0,value
2023-01-01,0.586287
2023-01-02,-2.077276
2023-01-03,-1.372788
2023-01-04,-1.080971
2023-01-05,0.534736
2023-01-06,-1.226532
2023-01-07,0.708189
2023-01-08,0.621466
2023-01-09,-0.835543
2023-01-10,-1.677096


In [56]:
# Select data for the first quarter of 2023
df.loc['2023-01':'2023-03']

Unnamed: 0,value
2023-01-01,0.586287
2023-01-02,-2.077276
2023-01-03,-1.372788
2023-01-04,-1.080971
2023-01-05,0.534736
...,...
2023-03-27,0.080388
2023-03-28,0.648832
2023-03-29,-1.443029
2023-03-30,-1.403090


In [57]:
# Select data for a specific day
df.loc['2023-01-04']

value   -1.080971
Name: 2023-01-04 00:00:00, dtype: float64

In [58]:
# Select data from a specific date to the end of the dataset
df.loc['2023-07-01':]

Unnamed: 0,value
2023-07-01,-0.517635
2023-07-02,0.349562
2023-07-03,-0.918661
2023-07-04,-0.138210
2023-07-05,-0.723003
...,...
2023-12-27,0.876468
2023-12-28,0.887156
2023-12-29,0.347792
2023-12-30,0.337033


In [59]:
# Select data up to a specific date
df.loc[:'2023-03-31']

Unnamed: 0,value
2023-01-01,0.586287
2023-01-02,-2.077276
2023-01-03,-1.372788
2023-01-04,-1.080971
2023-01-05,0.534736
...,...
2023-03-27,0.080388
2023-03-28,0.648832
2023-03-29,-1.443029
2023-03-30,-1.403090


Partial string indexing is very flexible. You can use various levels of precision:


In [60]:
# Select all data for June
df.loc['2023-06']

Unnamed: 0,value
2023-06-01,-0.930227
2023-06-02,-0.04763
2023-06-03,0.893778
2023-06-04,-0.184063
2023-06-05,0.035179
2023-06-06,-0.322121
2023-06-07,0.642049
2023-06-08,2.318969
2023-06-09,-1.142591
2023-06-10,0.436437


In [61]:
# Select data for the first half of the year
df.loc['2023-01':'2023-06']

Unnamed: 0,value
2023-01-01,0.586287
2023-01-02,-2.077276
2023-01-03,-1.372788
2023-01-04,-1.080971
2023-01-05,0.534736
...,...
2023-06-26,0.736874
2023-06-27,1.309112
2023-06-28,0.268463
2023-06-29,-0.428709


This feature also works with timestamps:


In [62]:
# Create a DataFrame with hourly data for a day
hourly_dates = pd.date_range(start='2023-06-15 00:00:00', end='2023-06-15 23:59:59', freq='h')
df_hourly = pd.DataFrame({'value': np.random.randn(len(hourly_dates))}, index=hourly_dates)
df_hourly

Unnamed: 0,value
2023-06-15 00:00:00,0.889814
2023-06-15 01:00:00,0.152884
2023-06-15 02:00:00,-0.861144
2023-06-15 03:00:00,-1.257476
2023-06-15 04:00:00,-0.75381
2023-06-15 05:00:00,-0.993356
2023-06-15 06:00:00,-0.594063
2023-06-15 07:00:00,-1.704967
2023-06-15 08:00:00,-0.438706
2023-06-15 09:00:00,-0.608612


In [63]:
# Select data for a specific hour
df_hourly.loc['2023-06-15 14:00:00']

value   -2.879855
Name: 2023-06-15 14:00:00, dtype: float64

In [64]:
# Select data for a range of hours
df_hourly.loc['2023-06-15 08:00:00':'2023-06-15 16:00:00']

Unnamed: 0,value
2023-06-15 08:00:00,-0.438706
2023-06-15 09:00:00,-0.608612
2023-06-15 10:00:00,0.440381
2023-06-15 11:00:00,0.677931
2023-06-15 12:00:00,0.283943
2023-06-15 13:00:00,0.830373
2023-06-15 14:00:00,-2.879855
2023-06-15 15:00:00,1.211274
2023-06-15 16:00:00,0.060807


Partial string indexing makes it easy to select data for specific time periods without needing to know the exact start and end dates in your dataset.


### <a id='toc4_2_'></a>[Date and time components access](#toc0_)


In addition to partial string indexing, Pandas allows you to select data based on specific components of dates and times. This is done using the `dt` accessor along with various date and time attributes.


Here are some examples:


In [65]:
# Select all Mondays
df[df.index.day_name() == 'Monday']


Unnamed: 0,value
2023-01-02,-2.077276
2023-01-09,-0.835543
2023-01-16,0.657934
2023-01-23,0.982301
2023-01-30,0.527168
2023-02-06,2.179397
2023-02-13,-0.473396
2023-02-20,0.333065
2023-02-27,-0.292679
2023-03-06,-1.545982


In [66]:
# Select all days in March
df[df.index.month == 3]

Unnamed: 0,value
2023-03-01,-0.564322
2023-03-02,-0.863697
2023-03-03,-0.398906
2023-03-04,-0.964817
2023-03-05,0.707872
2023-03-06,-1.545982
2023-03-07,-0.699935
2023-03-08,-0.140399
2023-03-09,1.867963
2023-03-10,0.154971


In [67]:
# Select all Fridays in the second quarter
df[(df.index.day_name() == 'Friday') & (df.index.quarter == 2)]

Unnamed: 0,value
2023-04-07,0.909454
2023-04-14,0.243474
2023-04-21,-0.562639
2023-04-28,0.222666
2023-05-05,1.008951
2023-05-12,-1.339248
2023-05-19,-0.34312
2023-05-26,1.48807
2023-06-02,-0.04763
2023-06-09,-1.142591


In [68]:
# Select all data points where the day of the month is greater than 15
df[df.index.day > 15]

Unnamed: 0,value
2023-01-16,0.657934
2023-01-17,-0.600103
2023-01-18,0.758400
2023-01-19,0.030888
2023-01-20,-0.850621
...,...
2023-12-27,0.876468
2023-12-28,0.887156
2023-12-29,0.347792
2023-12-30,0.337033


For time-based selection, let's use our hourly DataFrame:


In [69]:
# Select all data points between 9 AM and 5 PM
df_hourly[(df_hourly.index.hour >= 9) & (df_hourly.index.hour <= 17)]

Unnamed: 0,value
2023-06-15 09:00:00,-0.608612
2023-06-15 10:00:00,0.440381
2023-06-15 11:00:00,0.677931
2023-06-15 12:00:00,0.283943
2023-06-15 13:00:00,0.830373
2023-06-15 14:00:00,-2.879855
2023-06-15 15:00:00,1.211274
2023-06-15 16:00:00,0.060807
2023-06-15 17:00:00,-0.913869


In [70]:
# Select all data points in the morning (before noon)
df_hourly[df_hourly.index.hour < 12]

Unnamed: 0,value
2023-06-15 00:00:00,0.889814
2023-06-15 01:00:00,0.152884
2023-06-15 02:00:00,-0.861144
2023-06-15 03:00:00,-1.257476
2023-06-15 04:00:00,-0.75381
2023-06-15 05:00:00,-0.993356
2023-06-15 06:00:00,-0.594063
2023-06-15 07:00:00,-1.704967
2023-06-15 08:00:00,-0.438706
2023-06-15 09:00:00,-0.608612


You can combine these methods for more complex selections:


In [71]:
# Select all Mondays in June where the value is positive
june_mondays_positive = df[(df.index.month == 6) & 
                           (df.index.day_name() == 'Monday') & 
                           (df['value'] > 0)]
june_mondays_positive

Unnamed: 0,value
2023-06-05,0.035179
2023-06-26,0.736874


These indexing and slicing techniques are powerful tools for working with time series data. They allow you to easily select and analyze data for specific time periods or based on particular date and time characteristics.


Here's a more complex example that combines several concepts:


In [72]:
# Select the highest value for each month
monthly_max = df.resample('ME').max()

In [73]:
# Find the day of the week with the highest average value
day_of_week_avg = df.groupby(df.index.day_name())['value'].mean().sort_values(ascending=False)

In [74]:
print("Monthly maximum values:")
monthly_max


Monthly maximum values:


Unnamed: 0,value
2023-01-31,1.43951
2023-02-28,2.179397
2023-03-31,1.867963
2023-04-30,1.676344
2023-05-31,2.246705
2023-06-30,2.318969
2023-07-31,1.088492
2023-08-31,1.138809
2023-09-30,2.152956
2023-10-31,3.059004


In [75]:
print("\nAverage value by day of week:")
day_of_week_avg


Average value by day of week:


Thursday     0.127229
Friday       0.042777
Sunday       0.002105
Saturday    -0.062145
Wednesday   -0.070790
Monday      -0.088084
Tuesday     -0.268439
Name: value, dtype: float64

This example demonstrates how you can combine resampling, grouping, and date component access to perform more complex analyses on your time series data.


By mastering these datetime indexing and slicing techniques, you'll be able to efficiently navigate and analyze your time series data in Pandas, extracting valuable insights with ease.

## <a id='toc5_'></a>[Parsing and Formatting Dates](#toc0_)

Working with dates in Pandas often involves converting between string representations and datetime objects. This process includes parsing strings into dates and formatting dates back into strings. Mastering these techniques is crucial for data cleaning, integration, and presentation.


### <a id='toc5_1_'></a>[Parsing Strings to Dates](#toc0_)


Pandas provides powerful tools for converting string representations of dates into datetime objects. The primary function for this task is `pd.to_datetime()`.


Let's start with some examples:


In [76]:
# Basic date parsing
date_str = '2023-06-15'
pd.to_datetime(date_str)

Timestamp('2023-06-15 00:00:00')

This converts a simple ISO format date string to a Timestamp object. However, `to_datetime()` is much more flexible and can handle various formats:


In [77]:
# Parsing different date formats
dates = ['2023-06-15', '15/06/2023', 'June 15, 2023', '20230615']
pd.to_datetime(dates, format="mixed")

DatetimeIndex(['2023-06-15', '2023-06-15', '2023-06-15', '2023-06-15'], dtype='datetime64[ns]', freq=None)

In this case, Pandas automatically infers the format for each date string. However, for ambiguous formats or to ensure correct parsing, you can specify the format explicitly:


In [78]:
# Specifying format explicitly
custom_date = '15-Jun-2023'
pd.to_datetime(custom_date, format='%d-%b-%Y')

Timestamp('2023-06-15 00:00:00')

The `format` parameter uses strftime-style codes to specify the input format. Here are some common format codes:
- `%Y`: Year with century (e.g., 2023)
- `%m`: Month as a zero-padded decimal number (01-12)
- `%d`: Day of the month as a zero-padded decimal number (01-31)
- `%H`: Hour (00-23)
- `%M`: Minute (00-59)
- `%S`: Second (00-59)
- `%b`: Month as locale's abbreviated name (e.g., Jan, Feb)


Let's look at a more complex example:


In [79]:
# Parsing dates with time
datetime_strings = ['2023-06-15 14:30:00', '2023-06-16 09:45:30']
pd.to_datetime(datetime_strings)

DatetimeIndex(['2023-06-15 14:30:00', '2023-06-16 09:45:30'], dtype='datetime64[ns]', freq=None)

When working with a DataFrame, you can parse an entire column of date strings:


In [80]:
# Create a DataFrame with date strings
df = pd.DataFrame({
    'date_string': ['2023-06-15', '2023-06-16', '2023-06-17'],
    'value': [10, 20, 30]
})

# Parse the date_string column
df['date'] = pd.to_datetime(df['date_string'])
df

Unnamed: 0,date_string,value,date
0,2023-06-15,10,2023-06-15
1,2023-06-16,20,2023-06-16
2,2023-06-17,30,2023-06-17


### <a id='toc5_2_'></a>[Formatting Dates as Strings](#toc0_)


Once you have datetime objects, you might need to convert them back to strings in a specific format. This is often necessary for data presentation or when integrating with other systems that expect dates in a particular string format.


The primary method for this is the `dt.strftime()` method, which is available on Series containing datetime data.


Let's start with a simple example:


In [81]:
# Create a Series of dates
dates = pd.date_range(start='2023-06-15', periods=3)
s = pd.Series(dates)

# Format dates as strings
s.dt.strftime('%Y-%m-%d')

0    2023-06-15
1    2023-06-16
2    2023-06-17
dtype: object

You can create more complex date string formats:


In [82]:
# More complex formatting
s.dt.strftime('%A, %B %d, %Y')

0    Thursday, June 15, 2023
1      Friday, June 16, 2023
2    Saturday, June 17, 2023
dtype: object

This creates strings like "Thursday, June 15, 2023".


Here are some additional format codes:
- `%A`: Full weekday name
- `%B`: Full month name
- `%I`: Hour (12-hour clock)
- `%p`: Locale's equivalent of AM or PM


Let's look at an example with a DataFrame:


In [83]:
# Create a DataFrame with a datetime column
df = pd.DataFrame({
    'date': pd.date_range(start='2023-06-15', periods=3),
    'value': [10, 20, 30]
})

# Format the date column
df['formatted_date'] = df['date'].dt.strftime('%Y-%m-%d %H:%M:%S')
df

Unnamed: 0,date,value,formatted_date
0,2023-06-15,10,2023-06-15 00:00:00
1,2023-06-16,20,2023-06-16 00:00:00
2,2023-06-17,30,2023-06-17 00:00:00


You can also format dates with specific locales:


In [84]:
# Formatting dates in French
df['date_fr'] = df['date'].dt.strftime('%A %d %B %Y')
df['date_fr']

0    Thursday 15 June 2023
1      Friday 16 June 2023
2    Saturday 17 June 2023
Name: date_fr, dtype: object

Note that this will still use English names. To get French names, you need to change the locale:


In [85]:
import locale

In [86]:
# Set locale to French
locale.setlocale(locale.LC_TIME, 'fr_FR.UTF-8')

'fr_FR.UTF-8'

In [87]:
df['date_fr'] = df['date'].dt.strftime('%A %d %B %Y')
df['date_fr']

0       Jeudi 15 juin 2023
1    Vendredi 16 juin 2023
2      Samedi 17 juin 2023
Name: date_fr, dtype: object

In [88]:
# Reset locale
locale.setlocale(locale.LC_TIME, '')

'en_CA.UTF-8'

In [89]:
df

Unnamed: 0,date,value,formatted_date,date_fr
0,2023-06-15,10,2023-06-15 00:00:00,Jeudi 15 juin 2023
1,2023-06-16,20,2023-06-16 00:00:00,Vendredi 16 juin 2023
2,2023-06-17,30,2023-06-17 00:00:00,Samedi 17 juin 2023


Remember to reset the locale after you're done to avoid affecting other parts of your code.


Mastering date parsing and formatting is crucial for effective data manipulation and presentation. These techniques allow you to handle various date string formats in your input data and create custom date string representations for your output or analysis needs.

## <a id='toc6_'></a>[Conclusion](#toc0_)

In this lecture, we've explored the fundamental aspects of working with dates and times in Pandas, a crucial skill for any data analyst or scientist dealing with time series data. Let's recap the key points we've covered:

1. **Datetime Data Types**: We learned about Timestamp, DatetimeIndex, Timedelta, and TimedeltaIndex. These specialized data types form the backbone of time series functionality in Pandas.

2. **Creating Datetime Objects**: We explored various methods to create datetime objects, including `to_datetime()` for parsing strings and `date_range()` for generating sequences of dates. We also learned how to specify frequencies for these date ranges.

3. **Date and Time Components**: We discovered how to access and extract individual components of dates and times, such as year, month, day, hour, etc. This allows for detailed analysis and manipulation of time-based data.

4. **Datetime Indexing and Slicing**: We covered powerful techniques for selecting data based on dates and times, including partial string indexing and component-based selection. These methods enable efficient data extraction for specific time periods.

5. **Parsing and Formatting Dates**: We learned how to convert between string representations and datetime objects, which is essential for data cleaning, integration, and presentation.


These skills are fundamental to working with time series data in Pandas. They allow you to:

- Clean and standardize date and time data from various sources
- Perform time-based analysis and aggregations
- Create custom date ranges for analysis or reporting
- Extract specific time periods or components for detailed study
- Present date and time data in desired formats


As you continue to work with time series data, you'll find these techniques invaluable. They form the foundation for more advanced time series analysis, such as resampling, rolling windows, and time-based joins.


Remember, practice is key to mastering these concepts. Try applying these techniques to your own datasets or explore public time series datasets to reinforce your learning.


In the next lectures, we'll build upon these fundamentals to explore more advanced time series functionality in Pandas, including resampling, time zone handling, and period functionality. These advanced topics will further enhance your ability to analyze and manipulate time-based data effectively.


Keep in mind that working with dates and times can sometimes be tricky, especially when dealing with time zones, daylight saving time, or inconsistent data formats. Always be mindful of the specific requirements of your data and analysis, and don't hesitate to refer back to the Pandas documentation for detailed information on these functions and methods.


By mastering these datetime manipulation techniques in Pandas, you're well-equipped to tackle a wide range of time series analysis tasks, opening up new possibilities for insights in your data science projects.