# Exercise 4 - Pandas data munging

In [1]:
import pandas as pd

### Dealing with dates

Start by reading in `data/gdp.csv`:

In [2]:
gdp = pd.read_csv('data/gdp.csv')

Take a look at its contents

In [3]:
gdp.head()

Unnamed: 0,DATE,GDP
0,1947-01-01,243.1
1,1947-04-01,246.3
2,1947-07-01,250.1
3,1947-10-01,260.3
4,1948-01-01,266.2


What type is the `DATE` column?

In [4]:
gdp.DATE.dtype

dtype('O')

Can you parse those into `datetime` format?

_Hint_: `pd.to_...`

In [5]:
pd.to_datetime(gdp.DATE).head()

0   1947-01-01
1   1947-04-01
2   1947-07-01
3   1947-10-01
4   1948-01-01
Name: DATE, dtype: datetime64[ns]

Now look at the arguments for `pd.read_csv()` and figure out how to parse the dates automatically

In [6]:
gdp = pd.read_csv('data/gdp.csv', parse_dates=['DATE'])
print(gdp.DATE.dtype)

datetime64[ns]


Take a look at [the docs](http://pandas.pydata.org/pandas-docs/stable/timeseries.html) at some point to see all of the things you can do with datetime...

### Merging data

Load in `data/gdp.csv` (Gross Domestic Product), `data/cpi.csv` (Consumer Price Index), and `data/rec.csv` (Recessions)

_Note_: all three csvs are in the same format

In [7]:
gdp = pd.read_csv('data/gdp.csv', parse_dates=['DATE'])
cpi = pd.read_csv('data/cpi.csv', parse_dates=['DATE'])
rec = pd.read_csv('data/rec.csv', parse_dates=['DATE'])

Merge GDP and CPI into a single `DataFrame`

In [8]:
data = pd.merge(gdp, cpi)
data.head()

Unnamed: 0,DATE,GDP,CPIAUCSL
0,1947-01-01,243.1,21.48
1,1947-04-01,246.3,22.0
2,1947-07-01,250.1,22.23
3,1947-10-01,260.3,22.91
4,1948-01-01,266.2,23.68


Now add recessions onto the `DataFrame`:

In [9]:
data = data.merge(rec)
data.head()

Unnamed: 0,DATE,GDP,CPIAUCSL,USREC
0,1947-01-01,243.1,21.48,0
1,1947-04-01,246.3,22.0,0
2,1947-07-01,250.1,22.23,0
3,1947-10-01,260.3,22.91,0
4,1948-01-01,266.2,23.68,0


What's the correlation between GDP and CPI?

In [10]:
data[['GDP','CPIAUCSL']].corr()

Unnamed: 0,GDP,CPIAUCSL
GDP,1.0,0.983289
CPIAUCSL,0.983289,1.0


In how many periods was a recession recorded?

In [11]:
data['USREC'].sum()

40

Get a list of all of the `DATE`s during which there was a recession (`USREC == 1`)

In [12]:
recession_dates = data.loc[data['USREC'] == 1, 'DATE']
recession_dates.head()

8    1949-01-01
9    1949-04-01
10   1949-07-01
11   1949-10-01
27   1953-10-01
Name: DATE, dtype: datetime64[ns]

Find the unique years in which there was a recession

_Hint_: Look at the methods of `recession_dates.dt.` (hit tab complete)

In [13]:
years = recession_dates.dt.year
print(years[0:10])

8     1949
9     1949
10    1949
11    1949
27    1953
28    1954
29    1954
43    1957
44    1958
45    1958
Name: DATE, dtype: int64


In [14]:
unique_years = years.unique()
print(unique_years)

[1949 1953 1954 1957 1958 1960 1961 1970 1974 1975 1980 1981 1982 1990 1991
 2001 2008 2009]


### Reshaping data

Start by adding separate `year` and `month` columns to the data:

In [15]:
data['year'] = data['DATE'].dt.year
data['month'] = data['DATE'].dt.month
data.head()

Unnamed: 0,DATE,GDP,CPIAUCSL,USREC,year,month
0,1947-01-01,243.1,21.48,0,1947,1
1,1947-04-01,246.3,22.0,0,1947,4
2,1947-07-01,250.1,22.23,0,1947,7
3,1947-10-01,260.3,22.91,0,1947,10
4,1948-01-01,266.2,23.68,0,1948,1


Index the data by `year` and `month`

In [16]:
data_indexed = data.set_index(['year','month'])
data_indexed.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,DATE,GDP,CPIAUCSL,USREC
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1947,1,1947-01-01,243.1,21.48,0
1947,4,1947-04-01,246.3,22.0,0
1947,7,1947-07-01,250.1,22.23,0
1947,10,1947-10-01,260.3,22.91,0
1948,1,1948-01-01,266.2,23.68,0


Drop the superfluous `DATE` column

In [17]:
data_indexed.drop('DATE', axis=1, inplace=True)
data_indexed.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,GDP,CPIAUCSL,USREC
year,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1947,1,243.1,21.48,0
1947,4,246.3,22.0,0
1947,7,250.1,22.23,0
1947,10,260.3,22.91,0
1948,1,266.2,23.68,0


Use `.stack()` to create a long version of the data:

In [18]:
long = data_indexed.stack()
long.head()

year  month          
1947  1      GDP         243.10
             CPIAUCSL     21.48
             USREC         0.00
      4      GDP         246.30
             CPIAUCSL     22.00
dtype: float64

And then use `.unstack()` to make the data wide, with one column per month

_Hint_: you need to pass an argument to `.unstack()` to identify which variable to make wide

In [19]:
long.unstack?

In [20]:
wide = long.unstack('month')
wide.head()

Unnamed: 0_level_0,month,1,4,7,10
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1947,GDP,243.1,246.3,250.1,260.3
1947,CPIAUCSL,21.48,22.0,22.23,22.91
1947,USREC,0.0,0.0,0.0,0.0
1948,GDP,266.2,272.9,279.5,280.7
1948,CPIAUCSL,23.68,23.82,24.4,24.31


Try using `data.pivot()` to produce a data frame that looks like:
- A row for each year
- One column per month
- Values filled in with the GDP of the corresonding year/month

In [21]:
data.pivot(index='year', columns='month', values='GDP').head()

month,1,4,7,10
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1947,243.1,246.3,250.1,260.3
1948,266.2,272.9,279.5,280.7
1949,275.4,271.7,273.3,271.0
1950,281.2,290.7,308.5,320.3
1951,336.4,344.5,351.8,356.6
