Link to Medium blog post: https://towardsdatascience.com/working-with-datetime-in-pandas-dataframe-663f7af6c587

# 1. Convert strings to datetime

In [1]:
import pandas as pd

## With default arguments

Pandas has a built-in function called to_datetime() that can be used to convert strings to datetime.

In [4]:
# Pandas to_datetime() is able to parse any valid date string to datetime without any additional arguments

df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000'], 'value': [2, 3, 4]})

df['date'] = pd.to_datetime(df['date'])

df

Unnamed: 0,date,value
0,2000-03-10,2
1,2000-03-11,3
2,2000-03-12,4


## Day first format

By default, to_datetime() will parse string with month first (MM/DD, MM DD, or MM-DD) format, and this arrangement is relatively unique in the United States.

In [5]:
# To consider day first instead of month, you can set the argument dayfirst to True

df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000'], 'value': [2, 3, 4]})

df['date'] = pd.to_datetime(df['date'], dayfirst=True)

df

Unnamed: 0,date,value
0,2000-10-03,2
1,2000-11-03,3
2,2000-12-03,4


## Custom format

By default, strings are parsed using the Pandas built-in parser from dateutil.parser.parse. Sometimes, your strings might be in a custom format, for example, YYYY-DD-MM HH:MM:SS.

In [7]:
# Pandas to_datetime() has an argument called format that allows you to pass a custom format

df = pd.DataFrame({'date': ['2016-6-10 20:30:0', '2016-7-1 19:45:30', '2013-10-12 4:5:1'], 'value': [2, 3, 4]})

df['date'] = pd.to_datetime(df['date'], format='%Y-%d-%m %H:%M:%S')

df

Unnamed: 0,date,value
0,2016-10-06 20:30:00,2
1,2016-01-07 19:45:30,3
2,2013-12-10 04:05:01,4


## Speed up parsing

Passing infer_datetime_format=True can often speed up a parsing if its not an ISO8601 format exactly but in a regular format. According to [1], in some cases, this can increase the parsing speed by 5–10x.

In [9]:
# Make up 3000 rows
df = pd.DataFrame({'date': ['3/11/2000', '3/12/2000', '3/13/2000'] * 1000 })

%timeit pd.to_datetime(df['date'])

%timeit pd.to_datetime(df['date'], infer_datetime_format=False) # deprecated argument

2.08 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)




2.21 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)




## Handle parsing error

You will end up with a TypeError if the date string does not meet the timestamp format.

In [10]:
df = pd.DataFrame({'date': ['3/10/2000', 'a/11/2000', '3/12/2000'], 'value': [2, 3, 4]})

df['date'] = pd.to_datetime(df['date'])

ValueError: time data "a/11/2000" doesn't match format "%m/%d/%Y", at position 1. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

In [11]:
# to_datetime() has an argument called errors that allows you to ignore the error or force an invalid value to NaT

df['date'] = pd.to_datetime(df['date'], errors='ignore')

df

  df['date'] = pd.to_datetime(df['date'], errors='ignore')


Unnamed: 0,date,value
0,3/10/2000,2
1,a/11/2000,3
2,3/12/2000,4


In [12]:
df['date'] = pd.to_datetime(df['date'], errors='coerce')

df

Unnamed: 0,date,value
0,2000-03-10,2
1,NaT,3
2,2000-03-12,4


# 2. Assemble a datetime from multiple columns

to_datetime() can be used to assemble a datetime from multiple columns as well. The keys (columns label) can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same.

In [13]:
df = pd.DataFrame({'year': [2015, 2016], 'month': [2, 3], 'day': [4, 5]})

df['date'] = pd.to_datetime(df)

df

Unnamed: 0,year,month,day,date
0,2015,2,4,2015-02-04
1,2016,3,5,2016-03-05


# 3. Get year, month, and day

dt.year, dt.month and dt.day are the inbuilt attributes to get year, month , and day from Pandas datetime object.

In [14]:
# First, let’s create a dummy DateFrame and parse DoB to datetime

df = pd.DataFrame({'name': ['Tom', 'Andy', 'Lucas'],
                 'DoB': ['08-05-1997', '04-28-1996', '12-16-1995']})

df['DoB'] = pd.to_datetime(df['DoB'])

In [15]:
# And to get year, month, and day separately, you can use dt accessor
df['year'] = df['DoB'].dt.year
df['month'] = df['DoB'].dt.month
df['day'] = df['DoB'].dt.day

df

Unnamed: 0,name,DoB,year,month,day
0,Tom,1997-08-05,1997,8,5
1,Andy,1996-04-28,1996,4,28
2,Lucas,1995-12-16,1995,12,16


# 4. Get the week of year, the day of week and leap year

Similarly, dt.isocalendar().week, dt.dayofweek, and dt.is_leap_year are the inbuilt attributes to get the week of year, the day of week, and leap year.

In [17]:
df['week_of_year'] = df['DoB'].dt.isocalendar().week
df['day_of_week'] = df['DoB'].dt.dayofweek
df['is_leap_year'] = df['DoB'].dt.is_leap_year

df

Unnamed: 0,name,DoB,year,month,day,week_of_year,day_of_week,is_leap_year
0,Tom,1997-08-05,1997,8,5,32,1,False
1,Andy,1996-04-28,1996,4,28,17,6,True
2,Lucas,1995-12-16,1995,12,16,50,5,False


Note that Pandas dt.dayofweek attribute returns the day of the week and it is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6.

In [19]:
# To replace the number with full name, we can create a mapping and pass it to map()
dw_mapping = {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}

df['day_of_week_name'] = df['DoB'].dt.weekday.map(dw_mapping)

df

Unnamed: 0,name,DoB,year,month,day,week_of_year,day_of_week,is_leap_year,day_of_week_name
0,Tom,1997-08-05,1997,8,5,32,1,False,Tuesday
1,Andy,1996-04-28,1996,4,28,17,6,True,Sunday
2,Lucas,1995-12-16,1995,12,16,50,5,False,Saturday


# 5. Get the age from the date of birth

In [20]:
# The simplest solution to get age is by subtracting year
today = pd.to_datetime('today')
df['age'] = today.year - df['DoB'].dt.year

df

Unnamed: 0,name,DoB,year,month,day,week_of_year,day_of_week,is_leap_year,day_of_week_name,age
0,Tom,1997-08-05,1997,8,5,32,1,False,Tuesday,27
1,Andy,1996-04-28,1996,4,28,17,6,True,Sunday,28
2,Lucas,1995-12-16,1995,12,16,50,5,False,Saturday,29


However, this is not accurate as people might have not had their birthday this year.

In [21]:
# A more accurate solution would be to consider the birthday

# Year difference
today = pd.to_datetime('today')
diff_y = today.year - df['DoB'].dt.year

# Haven't had birthday
b_md = df['DoB'].apply(lambda x: (x.month,x.day) )
no_birthday = b_md > (today.month,today.day)
df['age'] = diff_y - no_birthday
df

Unnamed: 0,name,DoB,year,month,day,week_of_year,day_of_week,is_leap_year,day_of_week_name,age
0,Tom,1997-08-05,1997,8,5,32,1,False,Tuesday,26
1,Andy,1996-04-28,1996,4,28,17,6,True,Sunday,27
2,Lucas,1995-12-16,1995,12,16,50,5,False,Saturday,28


# 6. Improve performance by setting date column as the index

In [None]:
# A common solution to select data by date is using a boolean maks

condition = (df['date'] > start_date) & (df['date'] <= end_date)
df.loc[condition]

If you are going to do a lot of selections by date, it would be faster to set date column as the index first so you take advantage of the Pandas built-in optimization. Then, you can select data by date using df.loc[start_date:end_date] .

In [None]:
'''df = pd.read_csv('data/city_sales.csv',parse_dates=['date'])
df.info()
RangeIndex: 1795144 entries, 0 to 1795143
Data columns (total 3 columns):
 #   Column  Dtype         
---  ------  -----         
 0   date    datetime64[ns]
 1   num     int64         
 2   city    object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 41.1+ MB'''

# To set the date column as the index
df = df.set_index(['date'])
df

# 7. Select data with a specific year and perform aggregation


In [None]:
# Let’s say we would like to select all data in the year 2018
df.loc['2018']

# Get the total num in 2018
df.loc['2018', 'num'].sum()

# Get the total num for each city in 2018
df['2018'].groupby('city').sum()

# 8. Select data with a specific month and a specific day of the month

In [None]:
# To select data with a specific month, for example, May 2018
df.loc['2018-5']

# Similarly, to select data with a specific day of the month, for example, 1st May 2018
df.loc['2018-5-1']

# 9. Select data between two dates

To select data between two dates, you can usedf.loc[start_date:end_date] 

In [None]:
# Select data between 2016 and 2018
df.loc['2016' : '2018']

# Select data between 10 and 11 o'clock on the 2nd May 2018
df.loc['2018-5-2 10' : '2018-5-2 11']

# Select data between 10:30 and 10:45 on the 2nd May 2018
df.loc['2018-5-2 10:30' : '2018-5-2 10:45']

# And to select data between time, we should use between_time(), for example, 10:30 and 10:45
df.between_time('10:30','10:45')

# 10. Handle missing values

We often need to compute window statistics such as a rolling mean or a rolling sum.

In [None]:
# Compute the rolling sum over a 3 window period and then have a look at the top 5 rows
df.['rolling_sum'] = df.rolling(3).sum()
df.head()

We can see that it only starts having valid values when there are 3 periods over which to look back. One solution to handle this is by backfilling of data.

In [None]:
df['rolling_sum_backfilled'] = df['rolling_sum'].fillna(method='backfill')
df.head()