# Pandas and time series

[`pandas`](http://pandas.pydata.org/) is a Python library for doing statistics and working with time series. Just as `numpy`, `pandas` is not part of the standard library but comes bundled with [Anaconda](01_anaconda.ipynb). `pandas` is conventionally imported as

    import pandas as pd
    
The main data structure in `pandas` is the __DataFrame__ which is a collection of __Series__. A __Series__ is similar to a one-dimension `numpy` __Array__, but has some added metadata and functionality. A __DataFrame__ resembles the way data are stored in SQL databases or spreadsheets. If you have seen data frames in `R`, they are quite similar.

In [1]:
import pandas as pd
pd.__version__

'0.20.3'

## Reading data with `pandas`

The `pandas` library comes with several functions for reading data in different formats. Try typing

    pd.read
    
and then hitting `<tab>` to see a list of `read`-functions in `pandas`. Here we will use the `pd.read_csv`-function for our examples. As with the `numpy`-functions, all the file handling is done by `pandas` so that we need only to pass it a filename. The following CSV-file is easily handled by the `pandas`-CSV-reader although it contains missing data, funky quotes and a newline in the middle of the description field.

In [2]:
!cat data/pandas_simple.csv

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00


In [3]:
df = pd.read_csv('data/pandas_simple.csv')
df

Unnamed: 0,Year,Make,Model,Description,Price
0,1997,Ford,E350,"ac, abs, moon",3000.0
1,1999,Chevy,"Venture ""Extended Edition""",,4900.0
2,1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.0
3,1996,Jeep,Grand Cherokee,"MUST SELL!\nair, moon roof, loaded",4799.0


Individual columns of the data frame (i.e. Series) can be accessed by name, using either dot- or square bracket-notation.

In [4]:
df.Year

0    1997
1    1999
2    1999
3    1996
Name: Year, dtype: int64

In [5]:
df['Price']

0    3000.0
1    4900.0
2    5000.0
3    4799.0
Name: Price, dtype: float64

The Series support some basic operations directly.

In [6]:
df.Year.min()

1996

In [7]:
df.Price.median()

4849.5

## Time Series

`pandas` has good support for working with time series.

In [8]:
co2 = pd.read_csv('data/co2-ppm-mauna-loa-19651980.csv',
                  index_col=0, parse_dates=True)
co2.head()

Unnamed: 0_level_0,"CO2 (ppm) mauna loa, 1965-1980"
Month,Unnamed: 1_level_1
1965-01-01,319.32
1965-02-01,320.36
1965-03-01,320.82
1965-04-01,322.06
1965-05-01,322.17


In [9]:
co2['CO2 (ppm) mauna loa, 1965-1980'].mean()

328.4639583333334

In [10]:
daily_co2 = co2.asfreq('1W', method='pad')
daily_co2.head()

Unnamed: 0_level_0,"CO2 (ppm) mauna loa, 1965-1980"
Month,Unnamed: 1_level_1
1965-01-03,319.32
1965-01-10,319.32
1965-01-17,319.32
1965-01-24,319.32
1965-01-31,319.32


See the [`pandas` documentation](http://pandas.pydata.org/pandas-docs/stable/timeseries.html) for more information on Time Series