# Time Series Analysis in Python

In [None]:
import pandas as pd
import matplotlib
import matplotlib.pylab as plt
%matplotlib inline 

### Python Date and Time Data Types 

The Python standard library includes data types for date and time data, as well as calendar-related functionality. 
To represent dates, the *datetime* type is often used:

In [None]:
from datetime import datetime
d = datetime.now()
d

We can also extract individual parts of the date:

In [None]:
d.year, d.month, d.day

In [None]:
d.hour, d.minute

We can create a new datetime by specifying either the date alone or the date + time

In [None]:
datetime(2016, 1, 7)

In [None]:
# specify a time too
datetime(2016, 1, 7, 13, 30)

Dates and times are stored to the microsecond. A *timedelta* is the temporal difference between two datetime values: 

In [None]:
from datetime import timedelta
d1 = datetime(2016, 1, 7, 11, 15)
d2 = datetime(2016, 2, 15, 11, 35)
diff = d2 - d1
print("Difference %s and %s = %d days, %d seconds" % (d1, d2, diff.days, diff.seconds) )

You can add or subtract a timedelta value to an existing datetime to get a new shifted date:

In [None]:
# add 0 days and 60 seconds to the existing date
d1 + timedelta(0,60)

In [None]:
# add 3 days and 0 seconds to the existing date
d1 + timedelta(3)

Python datetime values can be formatted as strings using special formatting codes. There are many different ways we can format the same datetime value:

In [None]:
d = datetime(2016, 3, 1, 9, 30)
print( d.strftime('%d/%m/%y') )
print( d.strftime('%Y-%m-%d') )
print( d.strftime('%Y-%m-%d %H:%M') )
print( d.strftime('%a %d %Y') )
print( d.strftime('%A %d %B, %Y') )

We can also use these codes to parse dates from strings:

In [None]:
value1 = "2016-01-03"
datetime.strptime(value1, '%Y-%m-%d')

In [None]:
value2 = "01/03/16 14:56"
datetime.strptime(value2, '%d/%m/%y %H:%M')

### Pandas Date and Time Data Types 
Pandas contains its own types and functionality for working with dates and times. The basic date/time type is a *Timestamp*:

In [None]:
t = pd.Timestamp(2018, 5, 1, 18, 35)
t

In [None]:
t.year, t.month, t.day, t.hour, t.minute, t.second

Pandas will also create attempt to a Timestamp by parsing a string containing a date:

In [None]:
t = pd.Timestamp( "13th April 2018" )
t

We can use the *pd.date_range()* function to generate a list of dates and times, based on a specified start date, number of time periods, and frequency. This is stored as a *DatetimeIndex*.

In [None]:
# generate 7 timestamps, incremented by 1 day
dates = pd.date_range('1st May 2018', periods=7, freq='D')
dates

In [None]:
# generate 10 timestamps, incremented by 1 week
dates = pd.date_range('1st May 2018', periods=10, freq='W')
dates

### Time Series in Pandas

The most basic kind of time series data in Pandas is a Series indexed by timestamps, which is often represented as Python strings or datetime values. 

As a simple example, below we create a time series with 12 dates and 12 corresponding random values:

In [None]:
import numpy as np
dates = pd.date_range('01 Jan 2017', periods=12, freq='D')
values = np.random.random(12)
ts = pd.Series(values,index=dates)
ts

In [None]:
p = ts.plot(figsize=(13,5),fontsize=14)

A Pandas time series can be indexed and sliced in the same way as a normal Series:

In [None]:
ts["2017-01-02"]

In [None]:
print(ts["2017-01-05":"2017-01-08"])

For longer time series, we can easily select slices of data for a specific month or year: 

In [None]:
# create random series with 500 points
lts = pd.Series(np.random.random(500), index=pd.date_range('1/1/2015', periods=500))
print(lts.head())
print(lts.tail())

In [None]:
lts["2016"].head()

In [None]:
lts["2016-03"].head()

In [None]:
p = lts["2016-03"].plot(figsize=(13,5),fontsize=14)

Time series data is ordered chronologically, so we can slice with timestamps not contained in a time series:

In [None]:
p = lts["2014-01":"2015-03"].plot(figsize=(13,5),fontsize=14)

*Resampling* is the process of converting time series data from one frequency to another. This is done via the resample() function.

We can downsample - aggregate higher frequency data to a lower frequency:

In [None]:
ts = pd.Series(np.random.randn(100), index=pd.date_range('1/1/2015', periods=100))
ts.head()

In [None]:
# Convert from day frequency to month (M) frequency, by averaging values
ts_monthly = ts.resample("M").mean()

We can also upsample a time series, converting lower frequency to higher frequency data.

In [None]:
ts = pd.Series(np.random.randn(5), index=pd.date_range('1/1/2015', periods=5))
ts

For example, we could upsample by hour (H). Note that the rows that are added in between have missing values (NaNs).

In [None]:
ts_hourly = ts.resample('H').mean()
ts_hourly.head()

### Analysing Temporal Datasets

To demonstrate the analysis of temporal data in Python, we will use an agricultural meat dataset (originall from the **ggplot** Python package). This dataset contains figures for livestock, dairy, and poultry production in the US over several decades.

We will load this data from a CSV file into a Pandas Data Frame:

In [None]:
# Note that we specify parse_dates to try to parse the index field (called "date") as a date.
df = pd.read_csv("agri-meat.csv",index_col="date",parse_dates=True)

We see that this dataset has one entry per month, from 1944 to 2012:

In [None]:
df.head()

In [None]:
df.tail()

We can produce a simple time series plot for the full dataset:

In [None]:
p = df.plot(figsize=(10, 5), fontsize=14)

We can also produce a plot for a specific time period:

In [None]:
p = df["1980":"2000"].plot(figsize=(10, 5), fontsize=14)

We can also look at a shorter time period, such as the months in a single year:

In [None]:
p = df["1980-1":"1980-12"].plot(figsize=(10, 5), fontsize=14)

Pandas has functionality for aggregating date and time based data. For example, we can group the data by year:

In [None]:
# aggregate the sum of values for each year
df_year = df.groupby(df.index.year).sum()
df_year.head()

In [None]:
p = df_year.plot(figsize=(10, 5), fontsize=14)

If we want to group the data by decade, we need to define a custom aggregation function which will take the year of a date and "floor" it - i.e. round it down to the nearest 10. So 1957 becomes 1950 etc.

In [None]:
def to_decade(date_value):
    return (date_value.year // 10) * 10

In [None]:
df_decade = df.groupby(to_decade).sum()

In [None]:
df_decade

Let's plot a comparison of production across the decades:

In [None]:
p = df_decade.plot(kind='bar',figsize=(12, 5), fontsize=14)

#### Moving Averages

One way to extract a trend from a time series is to use a moving average. This divides the series into overlapping regions, called windows, and computes the average of the values in each window.

A *rolling mean* is a simple approach which computes the mean of the values in each window. The size of the window is the number of values it will include. Pandas provides a rolling_mean() function, which takes a Series and a window size and returns a new Series.

In [None]:
# calculate and plot 10 year rolling mean beef production
rm = df["beef"].rolling(10).mean()
p = rm.plot(figsize=(12,5),fontsize=14)

Increasing the window size produces a smoother plot, with less noisy. But be careful not to "over-smooth" the data:

In [None]:
# calculate and plot 25 year rolling mean beef production
rm = df["beef"].rolling(25).mean()
p = rm.plot(figsize=(12,5),fontsize=14)