Fama-Macbeth Regression/ao-luo-hw7.Rmd

```{r setup, include=FALSE}
library(dplyr)
library(broom)
library(ggplot2)

load(file = './annual_set.Rdata')
load(file = './clean_annual_financials.Rdata')
load(file = './PanelData[1995-2014].Rdata')
load(file = './rel_error.Rdata')
load(file = './truncated_rel_error.Rdata')
load(file = './avg_coef.Rdata')
```

# Homework 7, by Ao Luo, March 9, 2018

## Executive Summary
In this homework, we are running Fama-Macbeth regression to predict cross-sectional stock returns with independent variables: firm-marketcap, price-normalized accruals, the earnings-price ratio, and 1/price. The data sets originally come from <a href="https://wrds-web.wharton.upenn.edu/wrds/">WRDS</a>. The prediction of year T+2 returns is based on the year T variable as a firm could end its fiscal year in May and report results in October. The performance of the prediction is not as poor as I expected. The the Fama-Macbeth regression runs on the panel data with 826 samples accross 20 years produces the coefficients of [0.00000133, -0.563, -3.02, 0.484] for the above variables respectively. Measurement error due to bid/ask gap, sample selection, assumption of uncorrelated crossectional samples and treatment of missing data may explain our findings.

## Introduction & Discussion
### Data Set

The WRDS Compustat data for all stocks contains the fundamental informations including, gvkey for each ticker, fiscal year, account receivable, accrual expense, income taxes payable, depreciation and amortization,net income, common shares outstanding, share price, marketcap and eps. The data is collected from 1973 through 2018, which contains 431858 observations in total.

All the data sets can be downloaded [HERE](https://drive.google.com/open?id=13aYEW4vYC4CWvekUYirM5Dtv1RsHBBCg)

### Data Processing
We notice that due to financial report format, items may be missing as the company report in different sections. We recalculate marketcap and eps based on the available fundamental financials. Then we take a union of two columns to maximize the avaiable samples. The fiscal years with missing values and infinite values are omitted.

### Performance Analysis
With the coefficients derived from panel data, we predict the time T+2 returns with the time T variables of the cleaned annual data set. As a result, the prediction is not as ideal as I expected in that it predicts reutrns for 15361 firms only using the data of 826 firms with a standard error of 366.87. We calculate the relative error in percentages for the predicted return over historical return. And the positively skewed distribution of the relative error is shown below (after truncation of top 5% and tail 5% values due to near-zero historical return). After truncation we have standard error of 5.13, which is far better than before. However, if we check the standard deviation of the coefficients, they are huge compared to the mean values. And this does hurt the prediction power. A more reliable result can be provided using larger size of panel data. Besides, the assumption of uncorrelated crossection samples may not be true in the selected samples.

And the showcase of the annual data can be seen in the figures below. The table for the above two stocks are also attached to illustrate the data structure. The Rdata for all stocks is stored online and you can access it from [HERE](https://drive.google.com/open?id=1r4DB9YSqYHzpy-8OpAEm0RT5kjYxqdeq).

## Tables and Figures
### Fama-Macbeth Coefficient
```{r coefficients, echo=FALSE}
knitr::kable(avg_coef, align = 'c',  format = "html")
```

### Distribution of Relative Error
```{r error-distribution,echo = FALSE}
knitr::kable(summary(errors), align = 'l', format = 'html', caption = 'Summary of Relative Error in Percentags and Residuals', col.names = c('given key', 'relative error', 'resid'))

knitr::kable(summary(truncated_errors), align = 'l', format = 'html', caption = 'Summary of Truncated Relative Error in Percentags and Residuals', col.names = c('given key', 'relative error', 'resid'))

truncated_errors %>%
  ggplot(aes(error_percent)) + geom_histogram(binwidth = 10) + ggtitle('Distribution of Relative Error (truncated)') + xlab("Relative Error [%]")
```
## Computer Code
<pre>
  library(data.table)
  library(dplyr)
  library(broom)
  library(ggplot2)
  
  ## Step 1: Raw Data Clean
  # read in raw data from wrds
  # data from 1973 - 2018
  raw_data <- fread('./58059f88545b4ce2.csv')
  
  # check basic characteristics of raw data
  str(raw_data)
  head(raw_data)
  summary(raw_data)
  dim(raw_data)
  
  # save if in conveience of faster retrieving
  save(raw_data, file = './annual_set.Rdata')
  
  # subset the values we are interested in
  annual_set <- subset(raw_data, select = c('gvkey', 'fyear', 'rect', 'xacc', 'dp', 'txp', 'epspi', 'ni', 'csho', 'prcc_f', 'mkvalt'))
  
  # transform into predicting variables
  finlset <- annual_set[, .((prcc_f*csho), (mkvalt), (rect-dp-txp-xacc), (epspi/prcc_f), (ni/csho/prcc_f), (prcc_f), (1/prcc_f)),
                       by = list(gvkey, fyear)]
  colnames(finlset) <- c('gvkey', 'fyear', 'mktcap1', 'mktcap2', 'netaccrual', 'epratio1', 'epratio2',  'price', 'invprice')
  
  # unionize to maximize the available financial data
  finlset <- finlset %>%
    mutate(mktcap = if_else(!is.na(mktcap1), mktcap1, mktcap2),
           epratio = if_else(!is.na(epratio1), epratio1, epratio2)) %>%
    mutate(normaccrual = netaccrual / mktcap)
  
  # create entries for return and forward return
  finlset <- finlset %>%
    group_by(gvkey) %>%
    mutate(return = price/lag(price),
           fwd_return = lead(price,2)/lead(price))
  
  # filter out all NA values
  finlset <- finlset %>%
    filter(!is.na(return), 
           !is.na(fwd_return), 
           !is.na(mktcap), 
           !is.na(epratio), 
           !is.na(invprice), 
           !is.na(normaccrual), 
           price > 0,
           mktcap > 0,
           is.finite(epratio),
           is.finite(invprice)) %>%
    select(gvkey, fyear, return, fwd_return, mktcap, normaccrual, epratio, invprice)
  
  save(finlset, file = './clean_annual_financials.Rdata')
  
  ## Step 2: Making panel data
  # as there is always a trade-off between number of samples and length of time interval
  # we pick 20 years as time window to collect as many samples as possible
  year_samples <- rep(0, 25)
  for(year in 1973:1997){
    ss <- finlset %>%
      group_by(gvkey) %>%
      filter(all(seq(year, year+19) %in% fyear))
    year_samples[year-1972] <- length(unique(ss$gvkey))
  }
  
  # as a result, we are interested in the 20 consecutive years from 1995 to 2014
  start_year <- 1972 + which(year_samples == max(year_samples))
  end_year <- start_year + 19
  
  # then make the panel data
  panelset <- finlset %>%
    group_by(gvkey) %>%
    filter(all(seq(start_year, end_year) %in% fyear), fyear >= start_year, fyear <= end_year)
  # save the panel data
  save(panelset, file = './PanelData[1995-2014].Rdata')
  
  ## Step 3: Fama-Macbeth Regression
  # substep 1: crossectional regression
  # run regression of R_T on the variables at t = T-2
  models <- panelset %>%
    group_by(fyear) %>%
    do(fit_fyear = lm(fwd_return ~ mktcap + normaccrual + epratio + invprice, data = .))
  
  # get the coefficients by group of fyear in a tidy data_frame
  coef_fyear = tidy(models, fit_fyear)
  
  # substep 2: average over time
  # str(coef_fyear)
  avg_coef <- coef_fyear %>%
    group_by(term) %>%
    summarize(Mean = mean(estimate), Std = sd(estimate))
  
  save(avg_coef, file = 'avg_coef.Rdata')
  # subsetp 3: running forecast of return
  # and compare it with historical return
  predictset <- finlset %>%
    group_by(gvkey) %>%
    mutate(predict_return = avg_coef[['Mean']][1] + 
             avg_coef[['Mean']][2] * lag(epratio,2) + 
             avg_coef[['Mean']][3] * lag(invprice,2) +
             avg_coef[['Mean']][4] * lag(mktcap,2) +
             avg_coef[['Mean']][5] * lag(normaccrual,2)) %>%
    mutate(error_percent = (predict_return/return - 1)*100)
  
  # exclude the errors coming from zero historical return
  errors <- predictset %>%
    filter(is.finite(error_percent), !is.na(error_percent)) %>%
    select(gvkey, error_percent)
  summary(errors$error_percent)
  
  # truncate the errors coming from almost zero return
  qbottom_errors <- quantile(errors$error_percent,0.05)
  qtop_errors <- quantile(errors$error_percent, 0.95)
  truncated_errors <- errors %>%
    filter(error_percent > qbottom_errors, error_percent < qtop_errors)
  summary(errors$error_percent)
  
  save(truncated_errors, file = 'rel_error.Rdata')
  
  # plot of histogram to show the distribution of errors
  truncated_errors %>%
    ggplot(aes(error_percent)) + geom_histogram(binwidth = 10)

</pre>