# Time series analysis (Pandas)

Nikolay Koldunov

koldunovn@gmail.com

This is part of [**Python for Geosciences**](https://github.com/koldunovn/python_for_geosciences) notes.

================

Here I am going to show just some basic [pandas](http://pandas.pydata.org/) stuff for time series analysis, as I think for the Earth Scientists it's the most interesting topic. If you find this small tutorial useful, I encourage you to watch [this video](http://pyvideo.org/video/1198/time-series-data-analysis-with-pandas), where Wes McKinney give extensive introduction to the time series data analysis with pandas.

On the official website you can find explanation of what problems pandas solve in general, but I can tell you what problem pandas solve for me. It makes analysis and visualisation of 1D data, especially time series, MUCH faster. Before pandas working with time series in python was a pain for me, now it's fun. Ease of use stimulate in-depth exploration of the data: why wouldn't you make some additional analysis if it's just one line of code? Hope you will also find this great tool helpful and useful. So, let's begin.

As an example we are going to use time series of [Arctic Oscillation (AO)](http://en.wikipedia.org/wiki/Arctic_oscillation) and [North Atlantic Oscillation (NAO)](http://en.wikipedia.org/wiki/North_Atlantic_oscillation) data sets.

## Module import

First we have to import necessary modules:

In [None]:
import pandas as pd
import numpy as np
pd.set_option('max_rows',15) # this limit maximum numbers of rows

And "switch on" inline graphic for the notebook:

In [None]:
%matplotlib inline

Pandas developing very fast, and while we are going to use only basic functionality, some details may still change in the newer versions.

In [None]:
pd.__version__

## Loading data

Now, when we are done with preparations, let's get some data. If you work on Windows download monthly AO data [from here](http://www.cpc.ncep.noaa.gov/products/precip/CWlink/daily_ao_index/monthly.ao.index.b50.current.ascii). If you on *nix machine, you can do it directly from ipython notebook using system call to wget command:

In [None]:
!wget http://www.cpc.ncep.noaa.gov/products/precip/CWlink/daily_ao_index/monthly.ao.index.b50.current.ascii

Pandas has very good IO capabilities, but we not going to use them in this tutorial in order to keep things simple. For now we open the file simply with numpy loadtxt:

In [None]:
ao = np.loadtxt('monthly.ao.index.b50.current.ascii')

Every line in the file consist of three elements: year, month, value:

In [None]:
ao[0:2]

And here is the shape of our array (note that shape of the file might differ in your case, since data updated monthly):

In [None]:
ao.shape

## Time Series

We would like to convert this data in to time series, that can be manipulated naturally and easily. First step, that we have to do is to create the range of dates for our time series. From the file it is clear, that record starts at January 1950 and ends at September 2013 (at the time I am writing this, of course). **You have to adjust the last date according to values in your file!** Frequency of the data is one month (freq='M'). 

In [None]:
dates = pd.date_range('1950-01', '2014-01', freq='M')

As you see syntax is quite simple, and this is one of the reasons why I love Pandas so much :) Another thing to mention, is that we put October 2003 instead of September because the interval is open on the right side. You can check if the range of dates is properly generated:

In [None]:
dates

In [None]:
dates.shape

Now we are ready to create our first time series. Dates from the *dates* variable will be our index, and AO values will be our, hm... values. We are going to use data only untill the end of 2013:

In [None]:
AO = pd.Series(ao[:768,2], index=dates)

In [None]:
AO

Now we can plot complete time series:

In [None]:
AO.plot()

or its part:

In [None]:
AO['1980':'1990'].plot()

or even smaller part:

In [None]:
AO['1980-05':'1981-03'].plot()

Reference to the time periods is done in a very natural way. You, of course, can also get individual values. By number: 

In [None]:
AO[120]

or by index (date in our case):

In [None]:
AO['1960-01']

And what if we choose only one year?

In [None]:
AO['1960']

Isn't that great? :)

One bonus example :)

In [None]:
AO[AO > 0]

## Data Frame

Now let's make live a bit more interesting and download more data. This will be NAO time series (Windowd users can get it [here](http://www.cpc.ncep.noaa.gov/products/precip/CWlink/pna/norm.nao.monthly.b5001.current.ascii)).

In [None]:
!wget http://www.cpc.ncep.noaa.gov/products/precip/CWlink/pna/norm.nao.monthly.b5001.current.ascii

Create Series the same way as we did for AO:

In [None]:
nao = np.loadtxt('norm.nao.monthly.b5001.current.ascii')
dates_nao = pd.date_range('1950-01', '2014-01', freq='M')
NAO = pd.Series(nao[:768,2], index=dates_nao)

Time period is the same:

In [None]:
NAO.index

Now we create Data Frame, that will contain both AO and NAO data. It sort of an Excel table where the first row contain headers for the columns and firs column is an index:

In [None]:
aonao = pd.DataFrame({'AO' : AO, 'NAO' : NAO})

One can plot the data straight away:

In [None]:
aonao.plot()

Or have a look at the first several rows:

In [None]:
aonao.head()

We can reference each column by its name:

In [None]:
aonao['NAO']

or as method of the Data Frame variable (if name of the variable is a valid python name):

In [None]:
aonao.NAO

We can simply add column to the Data Frame:

In [None]:
aonao['Diff'] = aonao['AO'] - aonao['NAO']
aonao.head()

And delete it:

In [None]:
del aonao['Diff']
aonao.tail()

Slicing will also work:

In [None]:
aonao['1981-01':'1981-03']

even in some crazy combinations:

In [None]:
import datetime
aonao.ix[(aonao.AO > 0) & (aonao.NAO < 0) 
        & (aonao.index > datetime.datetime(1980,1,1)) 
        & (aonao.index < datetime.datetime(1989,1,1)),
        'NAO'].plot(kind='barh')

Here we use special [advanced indexing attribute .ix](http://pandas.pydata.org/pandas-docs/stable/indexing.html#advanced-indexing-with-labels). We choose all NAO values in the 1980s for months where AO is positive and NAO is negative, and then plot them. Magic :)

## Statistics

Back to simple stuff. We can obtain statistical information over elements of the Data Frame. Default is column wise:

In [None]:
aonao.mean()

In [None]:
aonao.max()

In [None]:
aonao.min()

You can also do it row-wise:

In [None]:
aonao.mean(1)

Or get everything at once:

In [None]:
aonao.describe()

By the way getting correlation coefficients for members of the Data Frame is as simple as:

In [None]:
aonao.corr()

## Resampling

Pandas provide easy way to resample data to different time frequency. Two main parameters for resampling is time period you resemple to and the method that you use. By default the method is mean. Following example calculates annual ('A') mean:

In [None]:
AO_mm = AO.resample("A")
AO_mm.plot()

median:

In [None]:
AO_mm = AO.resample("A", how='median')
AO_mm.plot()

You can use your methods for resampling, for example np.max (in this case we change resampling frequency to 3 years):

In [None]:
AO_mm = AO.resample("3A", how=np.max)
AO_mm.plot()

You can specify several functions at once as a list:

In [None]:
AO_mm = AO.resample("A", how=['mean', np.min, np.max])
#AO_mm['1900':'2020'].plot(subplots=True)
AO_mm['1900':'2020'].plot()

That's it. I hope you at least get a rough impression of what pandas can do for you. Comments are very welcome (below). If you have intresting examples of pandas usage in Earth Science, we would be happy to put them on [EarthPy](http://earthpy.org).

## Links

[Time Series Data Analysis with pandas (Video)](http://www.youtube.com/watch?v=0unf-C-pBYE)

[Data analysis in Python with pandas (Video)](http://www.youtube.com/watch?v=w26x-z-BdWQ)

[Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)