## Datetime Data
Datasets will often contain datetimes that reference when each observation occurred. When summarising data, we often want to use features of the datetime (eg. hour, day of the week, month, year). These features can be easily accessed so long as you work with datetime format.

***
## Import with datetime format
If you import data without any optional arguments, Pandas will import datetimes as strings (ie. a piece of text). You can test this out with the pollution dataset. When running the code below, you should see that the 'date' column will have the object data type, which is essentially a string.


In [1]:
import pandas as pd
pollution_data = pd.read_csv('LSTM-Multivariate_pollution.csv')
pollution_data.dtypes

date          object
pollution      int64
dew            int64
temp         float64
press        float64
wnd_dir       object
wnd_spd      float64
snow           int64
rain           int64
dtype: object

Having the dates in this format makes it difficult to access datetime features. For example, how could we find day of the week (Mon, Tues, etc.) if Pandas is viewing the dates as pieces of text where the characters have no meaning? To extract datetime features, we need to import so this column is in the datetime format. Revisiting our earlier import, we see this sets date as the index of the dataframe. Using the <code>parse_dates = True</code> argument tells Pandas to try and convert the index to datetime format.

In [2]:
pollution_data = pd.read_csv('LSTM-Multivariate_pollution.csv', index_col = 'date', parse_dates = True)
pollution_data.index

Index(['2/01/2010 0:00', '2/01/2010 1:00', '2/01/2010 2:00', '2/01/2010 3:00',
       '2/01/2010 4:00', '2/01/2010 5:00', '2/01/2010 6:00', '2/01/2010 7:00',
       '2/01/2010 8:00', '2/01/2010 9:00',
       ...
       '31/12/2014 14:00', '31/12/2014 15:00', '31/12/2014 16:00',
       '31/12/2014 17:00', '31/12/2014 18:00', '31/12/2014 19:00',
       '31/12/2014 20:00', '31/12/2014 21:00', '31/12/2014 22:00',
       '31/12/2014 23:00'],
      dtype='object', name='date', length=43800)

Looking at the index, we see that it still contains the object (ie. string) data type. In this instance Pandas has been unable to convert to datetime format by itself, and will require some more assistance. Looking at the read_csv documentation, you will notice that the <code>day_first</code> argument defaults to <code>False</code>. Our dates start with the day number, so we will want to change this to <code>True</code>.


In [3]:
pollution_data = pd.read_csv('LSTM-Multivariate_pollution.csv', index_col = 'date', parse_dates = True, dayfirst = True)
pollution_data.index

DatetimeIndex(['2010-01-02 00:00:00', '2010-01-02 01:00:00',
               '2010-01-02 02:00:00', '2010-01-02 03:00:00',
               '2010-01-02 04:00:00', '2010-01-02 05:00:00',
               '2010-01-02 06:00:00', '2010-01-02 07:00:00',
               '2010-01-02 08:00:00', '2010-01-02 09:00:00',
               ...
               '2014-12-31 14:00:00', '2014-12-31 15:00:00',
               '2014-12-31 16:00:00', '2014-12-31 17:00:00',
               '2014-12-31 18:00:00', '2014-12-31 19:00:00',
               '2014-12-31 20:00:00', '2014-12-31 21:00:00',
               '2014-12-31 22:00:00', '2014-12-31 23:00:00'],
              dtype='datetime64[ns]', name='date', length=43800, freq=None)

Now you will notice the index is in datetime format. For more complicated cases than this one you may need to use the <code>date_format</code> argument, which allows you to specify the exact format the date is in. Date format is specified with a string - for this dataset the string is <code>'%d/%m/%Y %I:%M'</code>. For further information you can consults the strftime documentation https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.

***
## Extracting datetime features
Once you have a datetime index or column for your dataframe you can easily extract features of the datetime with methods and attributes. Datetime methods and attributes are directly accessible for a datetime index. For a datetime column, you need to use the <code>dt</code> accessor first. 

| Attribute/Method	| Description |
|-------------------|-------------|
| <code>index.year</code> or <code>series.dt.year</code>	| The year |
| <code>index.month</code> or <code>series.dt.month</code>	| The month described as a number (1-12) |
| <code>index.month_name()</code> or <code>series.dt.month_name()</code>	| The month name described as a string |
| <code>index.day_of_year</code> or <code>series.dt.day_of_year</code>	| The day of the year described as a number (1-366) |
| <code>index.day</code> or <code>series.dt.day</code>	| The day of the month as a number (1-31) |
| <code>index.day_of_week</code> or <code>series.dt.day_of_week</code>	| The day of the week as a number (0-6) |
| <code>index.day_name()</code> or <code>series.dt.day_name()</code>	| The day name as a string |
| <code>index.hour</code> or <code>series.dt.hour</code>	| The hour (0-23) |
| <code>index.minute</code> or <code>series.dt.minute</code>	 | The minute (0-59) |
| <code>index.second</code> or <code>series.dt.second</code>	| The second (0-59) |


If we wanted to add a new column to the pollution dataframe that had the month name:

In [4]:
pollution_data['month'] = pollution_data.index.month_name()

or if we wanted a column with the hour:

In [5]:
pollution_data['hour'] = pollution_data.index.hour

We could then use these for grouped aggregation. For example, looking at how average pollution varies based on hour of the day:

In [6]:
pollution_data.groupby('hour')['pollution'].mean()

hour
0     107.798356
1     108.714521
2     105.124384
3     103.306849
4      99.460822
5      95.355068
6      92.370411
7      91.499726
8      91.335890
9      90.211507
10     88.787945
11     86.755616
12     84.841644
13     84.326027
14     82.122740
15     81.503562
16     81.779178
17     83.588493
18     87.797260
19     92.968767
20     99.741918
21    104.221370
22    105.911233
23    106.801096
Name: pollution, dtype: float64