# Pandas and time series

[`pandas`](http://pandas.pydata.org/) is a Python library for doing statistics and working with time series. Just as `numpy`, `pandas` is not part of the standard library but comes bundled with [Anaconda](01_anaconda.ipynb). `pandas` is conventionally imported as

    import pandas as pd
    
The main data structure in `pandas` is the __DataFrame__ which is a collection of __Series__. A __Series__ is similar to a one-dimension `numpy` __Array__, but has some added metadata and functionality. A __DataFrame__ resembles the way data are stored in SQL databases or spreadsheets. If you have seen data frames in `R`, they are quite similar.

In [None]:
import pandas as pd
pd.__version__

## Reading data with `pandas`

The `pandas` library comes with several functions for reading data in different formats. Try typing

    pd.read
    
and then hitting `<tab>` to see a list of `read`-functions in `pandas`. Here we will use the `pd.read_csv`-function for our examples. As with the `numpy`-functions, all the file handling is done by `pandas` so that we need only to pass it a filename. The following CSV-file is easily handled by the `pandas`-CSV-reader although it contains missing data, funky quotes and a newline in the middle of the description field.

In [None]:
!cat data/pandas_simple.csv

In [None]:
df = pd.read_csv('data/pandas_simple.csv')
df

Individual columns of the data frame (i.e. Series) can be accessed by name, using either dot- or square bracket-notation.

In [None]:
df.Year

In [None]:
df['Price']

The Series support some basic operations directly.

In [None]:
df.Year.min()

In [None]:
df.Price.median()

## Time Series

`pandas` has good support for working with time series.

In [None]:
co2 = pd.read_csv('data/co2-ppm-mauna-loa-19651980.csv',
                  index_col=0, parse_dates=True)
co2.head()

In [None]:
co2['CO2 (ppm) mauna loa, 1965-1980'].mean()

In [None]:
daily_co2 = co2.asfreq('1W', method='pad')
daily_co2.head()

See the [`pandas` documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html) for more information on Time Series

## Dealing with bad structure

Data often comes in a structure that is different from how we want to work with it. Let us look at an example of data in a very plain format, and how we can make it more usable for analysis.

The following is another data set from [Statistics Norway](http://www.ssb.no/). This is the [Population 1 January and population changes during the calendar year. Whole country, 1951 - latest year](http://data.ssb.no/api/v0/dataset/49626.csv?lang=en) dataset available as one of the [ready-made datasets](http://data.ssb.no/api/v0/dataset/?lang=en). Let's have a look at it:

In [None]:
!head data/pop_norway.csv

We can of course read the data using `pandas`.

In [None]:
pop = pd.read_csv('data/pop_norway.csv')
pop.head(8)

However, working with data in this form will be a pain. The year is not a proper date, and each time we look at a population value, we also need to check the `contents`-column in order to figure out how to interpret the value. Let us instead turn the `contents`-column into column headers.

We start by reading the data again, creating a _multi-index_:

In [None]:
pop = pd.read_csv('data/pop_norway.csv', parse_dates=[1, ], index_col=(0, 1, 2))
pop.head(8)

The functions `stack` and `unstack` can be used to _pivot_ data. I.e. moving data from indices to columns or vice versa.

In [None]:
pop = pop.unstack()
pop.head()

The table now looks much nicer and more convenient to work with. However, we are left with some unnecessary levels of information, which also makes the data harder to get to. These can be removed with the `droplevel`-function.

In [None]:
pop.index = pop.index.droplevel(level=0)
pop.head()

In [None]:
pop.columns = pop.columns.droplevel(level=0)
pop.head()

Here are some simple examples of analysis we can now easily do:

_Which years are __In-migration__ higher than __Live births__?_

In [None]:
pop[pop['In-migration'] > pop['Live births']]

_How has the ratio of deaths per 1000 inhabitants per year evolved?_

In [None]:
pop.death_ratio = pop.Deaths / (pop.Population / 1000)
pop.death_ratio

_Approximately when did Norway's population reach 5 million?_

We will do this analysis partly manually, to show off some of the ways of doing date indexing in `pandas`.

In [None]:
# Approximate daily population numbers by linear interpolation
daily = pop.asfreq('1d').interpolate().round()
daily.head()

From the table above, we can notice that the population surpassed 5 million people some time in 2012. In `pandas` we can show all data for 2012 simply by indexing with the string `'2012'`:

In [None]:
daily.Population['2012']

This is still a bit unruly. Let us resample again to monthly values:

In [None]:
daily.Population['2012'].resample('1m').min()

We are here showing the minimum values for each month (the months are represented by the last day in the month, not the date corresponding to the minimum value). Thus it seems that 5 million Norwegians was reached sometime in March 2012:

In [None]:
daily.Population['2012-03']

Based on this very simple method it seems that the population of Norway reached five million on March 20th, 2012.

Statistics Norway did of course look into this milestone in more detail. Back in February 2012, they _predicted_ that 5 million people would be reached on March 19th, 2012. Their best estimate for when the number was actually passed is March 17th, 2012. See [Slik beregnet vi når Norge ville passere 5 millioner](http://www.ssb.no/befolkning/artikler-og-publikasjoner/slik-beregnet-vi-naar-norge-ville-passere-5-millioner) if you are interested in more details (in Norwegian).