## Data Set Construction

**Functions**

`pd.read_csv`, `pd.read_excel`, `np.diff` or `DataFrame.diff`, `DataFrame.resample`

### Exercise 1

1. Download all available daily data for the S&P 500 and the Hang Seng Index from Yahoo! Finance. 
2. Import both data sets into Python. The final dataset should have a `DateTimeIndex`, and the date
   column should not be part of the `DataFrame`.
3. Construct weekly price series from each, using Tuesday prices (less likely to be a holiday).
4. Construct monthly price series from each using last day in the month.
5. Save the data to the HDF file "equity-indices.h5".


In [None]:
import pandas as pd

sp500 = pd.read_csv("data/GSPC.csv", parse_dates=True, index_col="Date")
hsi = pd.read_csv("data/HSI.csv", parse_dates=True, index_col="Date")

weekly_sp500 = sp500.resample("W-TUE").last()
weekly_hsi = hsi.resample("W-TUE").last()

monthly_sp500 = sp500.resample("M").last()
monthly_hsi = hsi.resample("M").last()

h5file = pd.HDFStore("data/equity-indices.h5", mode="w")
h5file.append("sp500", sp500)
h5file.append("weekly_sp500", weekly_sp500)
h5file.append("monthly_sp500", monthly_sp500)
h5file.append("hsi", sp500)
h5file.append("weekly_hsi", weekly_hsi)
h5file.append("monthly_hsi", monthly_hsi)
h5file.close()

sp500.tail()

In [None]:
weekly_sp500.tail()

In [None]:
monthly_sp500.tail()

### Exercise 2

Write a function that will correctly aggregate to weekly or monthly respecting the
aggregation rules

* High: `max`
* Low: `min`
* Volume: `sum`

The signature should be:

```python
def yahoo_agg(data, freq):
    <code here>
    return resampled_data
```


In [None]:
def yahoo_agg(data, freq):
    resampler = data.resample(freq)

    high = resampler.High.max()
    low = resampler.Low.min()
    vol = resampler.Volume.sum()
    # Start with last for all columns
    resampled_data = resampler.last()
    # Insert columns that use a different rule
    resampled_data["High"] = high
    resampled_data["Low"] = low
    resampled_data["Volume"] = vol

    return resampled_data


better_monthly_sp500 = yahoo_agg(sp500, "M")

monthly_sp500[["High", "Low", "Volume"]].tail()

In [None]:
better_monthly_sp500[["High", "Low", "Volume"]].tail()

### Exercise 3

1. Import the Fama-French benchmark portfolios as well as the 25 sorted portfolios at both the
   monthly and daily horizon from [Ken French"s Data Library](http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html).
   **Note** It is much easier to clean to data file before importing than to find the precise
   command that will load the unmodified data.
2. Import daily FX rate data for USD against AUD, Euro, JPY and GBP from the [Federal Reserve Economic Database (FRED)](http://research.stlouisfed.org/fred2/categories/94). Use Excel (xls) rather than csv files.
3. Save the data to the HDF files "fama-french.h5" and "fx.h5"

In [None]:
yen_dollar = pd.read_excel(
    "data/DEXJPUS.xls", index_col="observation_date", skiprows=10
)
dollar_aud = pd.read_excel(
    "data/DEXUSAL.xls", index_col="observation_date", skiprows=10
)
dollar_euro = pd.read_excel(
    "data/DEXUSEU.xls", index_col="observation_date", skiprows=10
)
dollar_pound = pd.read_excel(
    "data/DEXUSUK.xls", index_col="observation_date", skiprows=10
)

fx = pd.concat([yen_dollar, dollar_aud, dollar_euro, dollar_pound], axis=1)
print(fx.tail())
fx.to_hdf("data/fx.h5", "fx")

In [None]:
# These files have all been cleaned to have only the data and headers
ff_5x5 = pd.read_csv("data/25_Portfolios_5x5.CSV", index_col=0)
ff_factors = pd.read_csv("data/F-F_Research_Data_Factors.CSV", index_col=0)
ff = pd.concat([ff_factors, ff_5x5], axis=1)

dates = []
for value in ff.index:
    # Values are YYYYMM
    year = value // 100
    month = value % 100
    dates.append(pd.Timestamp(year=year, month=month, day=1))
ff.index = dates
ff.tail()

In [None]:
# This is a "trick" to get the index to have the last day in the month.
ff = ff.resample("M").last()

ff.to_hdf("data/ff.h5", "ff")

ff.tail()

In [None]:
ff.index

In [None]:
# These files have all been cleaned to have only the data and headers
ff_5x5_daily = pd.read_csv("data/25_Portfolios_5x5_daily.CSV", index_col=0)
ff_factors_daily = pd.read_csv("data/F-F_Research_Data_Factors_daily.CSV", index_col=0)
ff_daily = pd.concat([ff_factors_daily, ff_5x5_daily], axis=1)


dates = []
for value in ff_daily.index:
    # Values are YYYYMMDD
    year = value // 10000
    month = (value // 100) % 100
    day = value % 100
    dates.append(pd.Timestamp(year=year, month=month, day=day))
ff_daily.index = dates
ff_daily.to_hdf("data/ff_daily.h5", "ff_daily")

ff_daily.tail()

### Exercise 3 (Alternative method)

1. Install and use `pandas-datareader` to repeat the previous exercise.

#### Preliminary Step

You must first install the module using 

```
pip install pandas-datareader
``` 

from the command line. Then you can run this code. **Note**: Running this code requires access
to the internet.

In [None]:
import pandas_datareader as pdr

# Conservative start date to get all data
yen_dollar = pdr.get_data_fred("DEXJPUS", start="1950")
dollar_aud = pdr.get_data_fred("DEXUSAL", start="1950")
dollar_euro = pdr.get_data_fred("DEXUSEU", start="1950")
dollar_pound = pdr.get_data_fred("DEXUSUK", start="1950")
fx = pd.concat([yen_dollar, dollar_aud, dollar_euro, dollar_pound], axis=1)
fx.to_hdf("data/fx-pdr.h5", "fx")
fx.tail()

In [None]:
ff_factors = pdr.get_data_famafrench("F-F_Research_Data_Factors", start="1920")
ff_5x5 = pdr.get_data_famafrench("25_Portfolios_5x5", start="1920")
# The function returns all of the tables in each file.  We want the first, [0]
ff_factors = ff_factors[0]
ff_5x5 = ff_5x5[0]
ff = pd.concat([ff_factors, ff_5x5], axis=1)
ff.to_hdf("data/ff-pdr.h5", "ff")
ff.tail()

### Exercise 4
Download data on 1 year and 10 year US government bond rates from FRED, and 
construct the term premium as the different in yields on 10 year and 1 year
bonds. Combine the two yield series and the term premium into a `DataFrame`
and save it as HDF.

In [None]:
# No need to import here since pandas and pandas-datareader previously imported

# Conservative start date to get all data
gs10 = pdr.get_data_fred("GS10", start="1950")
gs1 = pdr.get_data_fred("GS1", start="1950")

term = gs10["GS10"] - gs1["GS1"]
term.name = "TERM"
combined = pd.DataFrame([term, gs10["GS10"], gs1["GS1"]]).T
combined.tail()

In [None]:
combined.index

In [None]:
# Trick to ensure the index has the frequency MS, Month Start
combined = combined.resample("MS").last()
combined.to_hdf("data/term-premium.h5", "term_premium")
combined.index