/
ao-luo-hw7.Rmd
175 lines (141 loc) · 8.54 KB
/
ao-luo-hw7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
```{r setup, include=FALSE}
library(dplyr)
library(broom)
library(ggplot2)
load(file = './annual_set.Rdata')
load(file = './clean_annual_financials.Rdata')
load(file = './PanelData[1995-2014].Rdata')
load(file = './rel_error.Rdata')
load(file = './truncated_rel_error.Rdata')
load(file = './avg_coef.Rdata')
```
# Homework 7, by Ao Luo, March 9, 2018
## Executive Summary
In this homework, we are running Fama-Macbeth regression to predict cross-sectional stock returns with independent variables: firm-marketcap, price-normalized accruals, the earnings-price ratio, and 1/price. The data sets originally come from <a href="https://wrds-web.wharton.upenn.edu/wrds/">WRDS</a>. The prediction of year T+2 returns is based on the year T variable as a firm could end its fiscal year in May and report results in October. The performance of the prediction is not as poor as I expected. The the Fama-Macbeth regression runs on the panel data with 826 samples accross 20 years produces the coefficients of [0.00000133, -0.563, -3.02, 0.484] for the above variables respectively. Measurement error due to bid/ask gap, sample selection, assumption of uncorrelated crossectional samples and treatment of missing data may explain our findings.
## Introduction & Discussion
### Data Set
The WRDS Compustat data for all stocks contains the fundamental informations including, gvkey for each ticker, fiscal year, account receivable, accrual expense, income taxes payable, depreciation and amortization,net income, common shares outstanding, share price, marketcap and eps. The data is collected from 1973 through 2018, which contains 431858 observations in total.
All the data sets can be downloaded [HERE](https://drive.google.com/open?id=13aYEW4vYC4CWvekUYirM5Dtv1RsHBBCg)
### Data Processing
We notice that due to financial report format, items may be missing as the company report in different sections. We recalculate marketcap and eps based on the available fundamental financials. Then we take a union of two columns to maximize the avaiable samples. The fiscal years with missing values and infinite values are omitted.
### Performance Analysis
With the coefficients derived from panel data, we predict the time T+2 returns with the time T variables of the cleaned annual data set. As a result, the prediction is not as ideal as I expected in that it predicts reutrns for 15361 firms only using the data of 826 firms with a standard error of 366.87. We calculate the relative error in percentages for the predicted return over historical return. And the positively skewed distribution of the relative error is shown below (after truncation of top 5% and tail 5% values due to near-zero historical return). After truncation we have standard error of 5.13, which is far better than before. However, if we check the standard deviation of the coefficients, they are huge compared to the mean values. And this does hurt the prediction power. A more reliable result can be provided using larger size of panel data. Besides, the assumption of uncorrelated crossection samples may not be true in the selected samples.
And the showcase of the annual data can be seen in the figures below. The table for the above two stocks are also attached to illustrate the data structure. The Rdata for all stocks is stored online and you can access it from [HERE](https://drive.google.com/open?id=1r4DB9YSqYHzpy-8OpAEm0RT5kjYxqdeq).
## Tables and Figures
### Fama-Macbeth Coefficient
```{r coefficients, echo=FALSE}
knitr::kable(avg_coef, align = 'c', format = "html")
```
### Distribution of Relative Error
```{r error-distribution,echo = FALSE}
knitr::kable(summary(errors), align = 'l', format = 'html', caption = 'Summary of Relative Error in Percentags and Residuals', col.names = c('given key', 'relative error', 'resid'))
knitr::kable(summary(truncated_errors), align = 'l', format = 'html', caption = 'Summary of Truncated Relative Error in Percentags and Residuals', col.names = c('given key', 'relative error', 'resid'))
truncated_errors %>%
ggplot(aes(error_percent)) + geom_histogram(binwidth = 10) + ggtitle('Distribution of Relative Error (truncated)') + xlab("Relative Error [%]")
```
## Computer Code
<pre>
library(data.table)
library(dplyr)
library(broom)
library(ggplot2)
## Step 1: Raw Data Clean
# read in raw data from wrds
# data from 1973 - 2018
raw_data <- fread('./58059f88545b4ce2.csv')
# check basic characteristics of raw data
str(raw_data)
head(raw_data)
summary(raw_data)
dim(raw_data)
# save if in conveience of faster retrieving
save(raw_data, file = './annual_set.Rdata')
# subset the values we are interested in
annual_set <- subset(raw_data, select = c('gvkey', 'fyear', 'rect', 'xacc', 'dp', 'txp', 'epspi', 'ni', 'csho', 'prcc_f', 'mkvalt'))
# transform into predicting variables
finlset <- annual_set[, .((prcc_f*csho), (mkvalt), (rect-dp-txp-xacc), (epspi/prcc_f), (ni/csho/prcc_f), (prcc_f), (1/prcc_f)),
by = list(gvkey, fyear)]
colnames(finlset) <- c('gvkey', 'fyear', 'mktcap1', 'mktcap2', 'netaccrual', 'epratio1', 'epratio2', 'price', 'invprice')
# unionize to maximize the available financial data
finlset <- finlset %>%
mutate(mktcap = if_else(!is.na(mktcap1), mktcap1, mktcap2),
epratio = if_else(!is.na(epratio1), epratio1, epratio2)) %>%
mutate(normaccrual = netaccrual / mktcap)
# create entries for return and forward return
finlset <- finlset %>%
group_by(gvkey) %>%
mutate(return = price/lag(price),
fwd_return = lead(price,2)/lead(price))
# filter out all NA values
finlset <- finlset %>%
filter(!is.na(return),
!is.na(fwd_return),
!is.na(mktcap),
!is.na(epratio),
!is.na(invprice),
!is.na(normaccrual),
price > 0,
mktcap > 0,
is.finite(epratio),
is.finite(invprice)) %>%
select(gvkey, fyear, return, fwd_return, mktcap, normaccrual, epratio, invprice)
save(finlset, file = './clean_annual_financials.Rdata')
## Step 2: Making panel data
# as there is always a trade-off between number of samples and length of time interval
# we pick 20 years as time window to collect as many samples as possible
year_samples <- rep(0, 25)
for(year in 1973:1997){
ss <- finlset %>%
group_by(gvkey) %>%
filter(all(seq(year, year+19) %in% fyear))
year_samples[year-1972] <- length(unique(ss$gvkey))
}
# as a result, we are interested in the 20 consecutive years from 1995 to 2014
start_year <- 1972 + which(year_samples == max(year_samples))
end_year <- start_year + 19
# then make the panel data
panelset <- finlset %>%
group_by(gvkey) %>%
filter(all(seq(start_year, end_year) %in% fyear), fyear >= start_year, fyear <= end_year)
# save the panel data
save(panelset, file = './PanelData[1995-2014].Rdata')
## Step 3: Fama-Macbeth Regression
# substep 1: crossectional regression
# run regression of R_T on the variables at t = T-2
models <- panelset %>%
group_by(fyear) %>%
do(fit_fyear = lm(fwd_return ~ mktcap + normaccrual + epratio + invprice, data = .))
# get the coefficients by group of fyear in a tidy data_frame
coef_fyear = tidy(models, fit_fyear)
# substep 2: average over time
# str(coef_fyear)
avg_coef <- coef_fyear %>%
group_by(term) %>%
summarize(Mean = mean(estimate), Std = sd(estimate))
save(avg_coef, file = 'avg_coef.Rdata')
# subsetp 3: running forecast of return
# and compare it with historical return
predictset <- finlset %>%
group_by(gvkey) %>%
mutate(predict_return = avg_coef[['Mean']][1] +
avg_coef[['Mean']][2] * lag(epratio,2) +
avg_coef[['Mean']][3] * lag(invprice,2) +
avg_coef[['Mean']][4] * lag(mktcap,2) +
avg_coef[['Mean']][5] * lag(normaccrual,2)) %>%
mutate(error_percent = (predict_return/return - 1)*100)
# exclude the errors coming from zero historical return
errors <- predictset %>%
filter(is.finite(error_percent), !is.na(error_percent)) %>%
select(gvkey, error_percent)
summary(errors$error_percent)
# truncate the errors coming from almost zero return
qbottom_errors <- quantile(errors$error_percent,0.05)
qtop_errors <- quantile(errors$error_percent, 0.95)
truncated_errors <- errors %>%
filter(error_percent > qbottom_errors, error_percent < qtop_errors)
summary(errors$error_percent)
save(truncated_errors, file = 'rel_error.Rdata')
# plot of histogram to show the distribution of errors
truncated_errors %>%
ggplot(aes(error_percent)) + geom_histogram(binwidth = 10)
</pre>