# Dates and times in pandas

As mentioned in previously what makes a dataset a time series is the the date indexing. This means while working with such data we are to ensure we have well formatted dates so as to ensure  that any other processing done on the data is aware of the date element.

In a dataset the date interval could be yearly, monthly, weekly and daily. We could also have intervals in time such as hourly, minutes and even seconds. We need to ensure that the integrity of these units are kept throughout.

Both python and pandas have builtiin support for dates and times. In this course we will specifially looking at dealing with dates in pandas dataframes.


# Reading data with dates and times

Datetimes in data are mostly in one of three formats


*   **String** - A string representing a datetime Eg. "2018/06/12", "2018-Jun-12"
*   **Epoch** - An integer specifying the number of seconds that has elaspse since a particular time(origin). Eg. 1544572800 which translates to 12 june 2018 midnight
*   **Spread out in different columns** - Occasionally the date is spread out in different columns such that the month, year, day are in integers or strings in different columns of the dataframe.

Pandas is built to be able to handle all of the above scenarios. A datetime object in pandas is called a TimeStamp

Pandas provides a function `pandas.to_datetime` to convert all the above options to a datetime object

Please review the official documentation [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) to see the syntax of this function and also understand the different parameters it accepts.


# Time Deltas
Timedeltas are differences in times, expressed in difference units, e.g. days, hours, minutes, seconds. They can be both positive and negative.
An example is a "differnce of 2 days", "difference of -4 months".

In pandas when you subtract two dates what you get is a time delta specifying the time difference between the two dates. This can then be converted into different units like days, hours, minutes.

Review the documentation to know the understand the aliases and units used https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Timedelta.html


In [None]:
import pandas as pd
import numpy as np

# strings
# pd.Timedelta('1 days') 

# Also written as
pd.Timedelta(1,'D')

In [None]:
# integers with a unit
pd.Timedelta(50, unit='s')

# Converting string formatted dates
The most common form dates are presented in. String formatted dates can be in any very different arrangements. Pandas is able to handle all formats.

In the example below we create dataframe with four different date formats. Pandas automatically detects the format foreach and makes the right convertion.

In [None]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'date': ['2005/11/23',
                            '2010.12.31',
                            'Jul 31, 2009', 
                            '2010-01-10',
                             None]})
df    

In [None]:
df['full_date'] = pd.to_datetime(df['date'])
df

When provided with invalid data pandas returns "NaT" which means "not a time"

When using string dates you can also pass a format argument.This states the format in which the original date is in. Providing this ensures specific parsing and also potentially speed up the conversion considerably.

Note: passing the format argument means all the dates will be parsed with the same format, so if you have a datasets with different date styles in the same column its advisable not to provide this option.

In [None]:
import pandas as pd
df = pd.DataFrame({'date': ['2005/11/23 00:00',
                            '2010/12/31 03:45',
                            '2015/8/20 13:59',
                            '2018/5/7 19:12']})
df    

In [None]:
df['full_date'] = pd.to_datetime(df['date'], format='%Y/%m/%d %H:%M')
df

For more information on how to specify the `format` options, see https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.

# Converting dates from Epoch to timestamps

As mentioned before epoch is an integer which specifies the number of seconds that has elaspsed since a particular time which is defined as the origin. The default origin for computers is January 1, 1970 midnight. This means most epoch times represent the number of seconds since that time. It is possible to have a different origin for private use, pandas allows you to specify the origin if yours isn't the default "January 1, 1970".

Epoch while an interesting concepts does raise an intersting problem in the future similar to Y2K, there's nothing to worry though, you read more on this [here](https://www.theguardian.com/technology/2014/dec/17/is-the-year-2038-problem-the-new-y2k-bug) 

You can also expreriment to epoch times with this [converter](https://www.epochconverter.com/)

In [None]:
import pandas as pd
df = pd.DataFrame({'date': [1524460623, 1524560699,151456057,1544572800]})
df    

In [None]:
df['full_date'] = pd.to_datetime(df['date'],unit='s')
df

# Reading dates spread out in different columns

In [None]:
import pandas as pd
df = pd.DataFrame({'year': [2015, 2016,2017,2018],
                       'month': [2, 3, 4, 5],
                       'day': [4, 5, 12, 23],
                       'count': [258, 356, 421, 578]})
df    

From above the dataframe has column for day, month and year. The pandas function `pandas.to_datetime` also accepts a dataframe of columns which it can then assemble into a series of timestamps

We will create a new column called "full_date" which will hold the assembled date. Also pay attention to the fact that only the columns needed for the date is passed to the function.

In [None]:
df['full_date'] = pd.to_datetime(df[['year', 'day', 'month']])
df

Note: the names of the columns passed to the function matters. Pandas uses the keys to determine which part of the date is which. 

**From the documentation:** The keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same

# Adding to and subtracting from dates

Its not possible to add two dates, but it is possible to add to and subtract from dates. You will use this answer question like

*   What is 5 days from now
*   what was the date 2 months ago

To this we use we add or subtract a time delta. 
See the example below



In [None]:
import pandas as pd
df = pd.DataFrame({'date': ['2005/11/23',
                            '2010.12.31',
                            'Jul 31, 2009', 
                            '2010-01-10']})
df['date'] = pd.to_datetime(df['date']) # converting column to timestamp object


df['plus_5_days'] =df['date'] + pd.Timedelta(5,'D')
df['minus_2_months'] =df['date'] - pd.Timedelta(2,'M')
df['minus_50_minutes'] =df['date'] - pd.Timedelta(50,'m')
df

Another way is to use the DateOffset function

In [None]:
import pandas as pd
df = pd.DataFrame({'date': ['2005/11/23',
                            '2010.12.31',
                            'Jul 31, 2009', 
                            '2010-01-10']})
df['date'] = pd.to_datetime(df['date']) # converting column to timestamp object


df['plus_5_days'] =df['date'] + pd.DateOffset(days=5)
df['minus_2_months'] =df['date'] - pd.DateOffset(months=2)
df['minus_50_minutes'] =df['date'] - pd.DateOffset(minutes=50)
df

There is a slight  difference between `timedelta` and `dateoffset`, timedelta holds a time difference so adding or subtracting it adds or subtracts it from the date as a whole while `DateOffset` ads or subtracts to the part of the date which is the same unit as the `DateOffset`

You can see an example from the above where subtracting two months yeilds different results with respect to the hours.
`Timedelta` uses the fact that 1 month = 30.4167 days while `DateOffset` simply subtracts 2 from the month part of date.

# Subtracting two dates
Subtracting two dates is as easy as using the subtraction sign. Subtracting two dates yeilds a timedelta object

In [None]:
import pandas as pd
df = pd.DataFrame({'start_date': ['2005/11/23 14:00',
                            '2010.12.25',
                            'Jul 31, 2009', 
                            '2010-01-10'],
                  'end_date':['2005/11/23 14:45',
                            '2010.12.31',
                            'Oct 31, 2009', 
                            '2011-01-10']})
df['start_date'] = pd.to_datetime(df['start_date']) # converting column to timestamp object
df['end_date'] = pd.to_datetime(df['end_date']) # converting column to timestamp object

df['difference'] = (df['end_date'] - df['start_date'])
print(df)

print(df.dtypes)


# Creating date ranges

We can also create date ranges which starts from a given date to another and at any interval of our choosing. we use the function `date_range`.

In the example below we generate dates from '1 january 2018' to '31 july 2018' with a 5 day interval

In [None]:
pd.date_range( start='1/1/2011', end='31 july 2018' ,freq='5D')