# Time Series

We often deal with time series data: data which come from units of interest that are repeatedly measured over time. We have have time series data whenever we do outcome monitoring, for example. Government agencies produce time serieses: they release information at regular intervals described some relevant characteristic of a country or state. As we discussed in class and covered in readings, we need to deal with the statistical issues derived from the data generating process of repeatedly measuring the same unit (or units if we have panel or TSCS, or pCSTS data).

In this activity, we will cover the basics of analyzing time series data in R using data on the U.S. economy and data from the OECD. First, we'll examine the relationship between the [Univeristy of Michigan's Consumer Sentiment data series](https://fred.stlouisfed.org/series/UMCSENT), made avaialable by the FRED^[I used the `fredr` package to download the data. It does require registering with the FRED to receive an API key.].

There are a number of packages we'll use to analyze time series data. Install them (only if you have not already done so), and load them.

In [None]:

#install.packages("tseries")
#install.packages("dynlm")
#install.packages("prais")
#install.packages("xts")
#install.packages("lmtest")

lib.list <- c("tidyverse", "tseries", "plm", "modelsummary")

# lapply means "list or vector apply". It lets you pass a
# list or vector of data, objects, names into a function
# here we pass our list of packages into the library() 
# function. library requires just the text, so we use
# the character.only argument. 
lapply(lib.list, library, character.only = TRUE)


Now we can import our data, currently stored on GitHub.

In [None]:
# load consumer sentiment data and check

economy <- read_csv("https://raw.githubusercontent.com/bowendc/512_labs/refs/heads/main/con.sent.csv")

head(economy)

`con.sent` is the consumer sentiment index score from the University of Michigan Consumer Sentiment survey data. Higher values describe more positive assessments of the economy. `un.rate` is the unemployment rate, from the Bureau of Labor Statistics. Notice that the observation is the *month* - we're using estimates of national consumer sentiment measured monthly over time. So `date` defines the observation in the dataset.

We can quickly visualize how consumer sentiment changes over time using `ggplot`.

In [None]:
ggplot(economy, mapping = aes(x = date, y = con.sent)) +
    geom_line(color = "maroon", linewidth = 1.5) +
    labs(
        y = "Consumer Sentiment Index (Monthly)",
        x = "Date"
    ) +
    theme_minimal() 

It looks to me like there is positive autocorrelation in this series. That is, if one month is higher, than the next month will also be higher, on average. We can test autocorrelation using `acf()` function.

In [None]:
acf(economy$con.sent)

The graph above (known as a *correlogram* shows the correlation between the value of time series at time $_t$ and lags of the series. In this case, there is a positive correlation ($r>.4$) even after 26 months! The regular decrease in the correlation suggests that current values of $y$ are heavily dependent on the previous value of $y$ ($y_{t-1}$). It is *autoregressive*.  

We could, however, transform this measure by examining the *partial autocorrelation function*, which presents the correlation between $y$ and the lagged $y$ presented on the x-axis, after controlling for all the lagged versions  of $y$ at lower values of the lag. So, lag = 3 would show the correlation represented by the regression coefficient between $y$ and $y_{t-3}$ while controlling for $y_{t-1}$ and $y_{t-2}$.  

In [None]:
pacf(economy$con.sent)

To continue modeling consumer sentiment over time, we need to process our data more using R's time series functions. This can get quite complex, because different functions expect different stored data (some expect time series data, some expect data frames or vectors), and creating lagged and differenced variables will change the number of observations in a vector. To add to the confusion, some functions are named the same in different packages. There is a `lag()` function in the `stats` package (part of base R) as well as in `dplyr` (part of `tidyverse`). It's confusing. Below I highlight the simpliest way that I know of to model a time series. Note that we'll be using different functions when we model panel time series data later in this activity. 

First, let's tell R the nature of the time series that we have and store our key variables as time series objects in separate vectors, not in a dataframe. 

In [None]:
# ts() function records the vector as a time series
# the start() argument and the frequency argument
# note that we begin the series at the beginning of 1990
# and that the series has 12 time periods per year (months)
# we grab each vector with the df$variable.name notation

con.sent <- ts(economy$con.sent, start(1990, 0), frequency = 12)
un.rate <- ts(economy$un.rate, start(1990, 0), frequency = 12)

# check to make sure the vector looks the way we want. 
# notice that R knows we have monthly data
print(con.sent)

Now that we have our two central variables stored as separate vectors and recognized as time series data, we can create differenced and lagged variables that we can use for analysis. I think it is easiest to make these variables and then use your created data rather than using the difference and lag functions directly inside the model.  

In [None]:
# note the lag() function. "-" denotes the previous value
# so -1 means the previous time period, -2 means two time
# periods in the past.
l.con.sent <- stats::lag(con.sent, -1)
l2.con.sent <- stats::lag(con.sent, -2)
d.con.sent <- con.sent - stats::lag(con.sent, -1)
ld.con.sent <- stats::lag(d.con.sent, -1)
l.un.rate <- stats::lag(un.rate, -1)
l2.un.rate <- stats::lag(un.rate, -2)
d.un.rate <- un.rate - stats::lag(un.rate, -1)
ld.un.rate <- stats::lag(d.un.rate, -1)


If you check your enironment window, you should see several new vectors called *l.con.sent*, *l2.con.sent* and the like. *l* is being used to describe a lagged value, while *d* vectors are differenced. 

Now we have 8 new vectors, all coded as time series. But the time periods don't all line up, because some are differenced and some have lags. Differencing and lagging will lead to dropped data. We can us the `ts.intserect` function to align our vectors to the correct time periods and then store as a data frame for analysis.

In [None]:
economy2 <- ts.intersect(con.sent, l.con.sent, l2.con.sent, 
                        d.con.sent, ld.con.sent, 
                        un.rate, l.un.rate, l2.un.rate, 
                        d.un.rate, ld.un.rate,
                        dframe = TRUE)

# check to make sure all the differencing and lagging works as you want it to!
head(economy2)

## Autoregressive (AR) Models

$AR(\rho)$ models model a time series as a function of $\rho$ lags of the series and random error. The partial autocorrelation function above suggests that the simplest AR model, AR(1), might fit our series well, since after controlling for $y_{t-1}$, there is rarely a significant relationship between the series and higher-order lags. Let's run the AR model.


In [None]:
# we can estimate using OLS. Check out ar.ols() for fitting a more general ar(p) model

ar1 <- lm(con.sent ~ l.con.sent, data = economy2)
summary(ar1)

ar2 <- lm(con.sent ~ l.con.sent + l2.con.sent, data = economy2)
summary(ar2)

## Autoregressive Distriuted Lag (ADL) Models

Autoregressive distributed lag models include predictors in the autoregressive lag framework.

In [None]:
adl <- lm(con.sent ~ l.un.rate, data = economy2)
summary(adl)

In [None]:
adl2 <- lm(con.sent ~ l.un.rate + l.con.sent, data = economy2)
summary(adl2)


The Augmented Dickey-Fuller test is evaluates whether your time series is stationary. The null hypothesis is non-stationarity. The code below runs the test twice; the first test uses the raw series, the second uses the differenced series. 

In [None]:
adf.test(economy2$con.sent)

In [None]:
adf.test(economy2$d.con.sent)

Interpret both tests in a Markdown text chunk. Which series should we be using in our analysis? 

We can first-difference this model by regressing the change (difference) in consumer sentiment on a number of candidate specifications: 

1. lagged unemployment rate; 
1. differenced unemployment rate;
2. difference unemployment rate and the lagged differenced unemployment rate.

In [None]:

fd <- lm(d.con.sent ~ l.un.rate, data = economy2)
fd1 <- lm(d.con.sent ~ d.un.rate, data = economy2)
fd2 <- lm(d.con.sent ~ d.un.rate + ld.un.rate, data = economy2)

Create a regression table using `modelsummary()` that presents `fd`, `fd1`, and `fd2` model results. The table should have stars to denote p-values next to model coefficients and include standard errors in parentheses below the coefficients. Include a title for the table in the code chunk in Quarto that will show up in your pdf file when you render the document. Check out [this page](https://quarto.org/docs/authoring/tables.html#computations) for details on how to do this. 

Then, using Markdown, interpret your regression results from the three models. 

## Panel Data 

As policy researchers, we often deal with panel data. That is, we have data over time on multiple units. And we are interested in the over-time dynamics and the cross-sectional relationships across units. 

In this exercise, let's examine data from the Organization for Economic Coordination and Development (OECD). The outcome variable is economic productivity (GDP per hour)

Let's use a couple of `for` loops to get and process the data. Loops are nice ways to simplify your code: instead of copy and pasteing multiple lines of code, you can loop over a set of object and plug those objects into the lines of code. The first loop below imports the data using `paste0()` (function that allows you to create character strings without a separator character) and `assign()` (can provide a name for an object). The second loop uses `get()` and `assign()` to take the dataframes, select just the needed variables, rename the core data, and store back in the dataframe.

In [None]:
# import the csv files from GitHub - each variable is stored in its own csv file

# loop over three datasets/variables
for(i in c("gdp_perhour", "life_exp", "ppp_spend")){
        file <- paste0("https://raw.githubusercontent.com/bowendc/512_labs/refs/heads/main/",
        i, "_2002-2020.csv") # store url to `file`
        df <- read_csv(file) # call our dataframe `df`
        assign(i, df)        # apply the name stored in i to df
}

# quick processing of the education spending data to include just 
# highered ed spending
ppp_spend <- ppp_spend |> filter(`Education level` == "Tertiary education")

# loop over dataframes to keep only necessary vars and rename main
# variable same as name of dataframe
for(i in c("gdp_perhour", "life_exp", "ppp_spend")){
  df <- get(i)
  df <- df |> 
  select(REF_AREA, `Reference area`, TIME_PERIOD, OBS_VALUE) |>
  rename(!!i := OBS_VALUE)    # to dynamically change a variable name in 
  assign(i, df)               # rename or mutate, use !! in front of the looped
}                             # object and use := instead of =

Now we have three separate dataframes, each containing one of our variables of interest. Let's combine them into one dataframe for analysis. Since we want observations to remain in the dataset if they are in ANY of the three dataframes, we'll use `full_join()`.

In [None]:
# combine dataframes into one. 
# full_join() keeps observations without matches in both datasets
# REF_AREA is the country code and TIME_PERIOD is the year.
# both are needed to match the data together.
prod <- full_join(gdp_perhour, life_exp, by = join_by(REF_AREA, TIME_PERIOD))

# we run it again to join in ppp
prod <- full_join(prod, ppp_spend, by = join_by(REF_AREA, TIME_PERIOD))

# let's check to make sure everythign worked: 
head(prod)

Here we can conduct a number of different types of regression models using `plm()` (panel linear models). We can add fixed effects (FE) by country using `effect = c("individual")` and `model = "within"`. If we set `model = "pooling"`, we get normal OLS estimates.

Also of note: `plm()` has a more intuitive lag and differencing notation, which allows us to easily use the `lag()` and `diff()` functions inside of a `plm()` call.

In [None]:

# OLS
m1 <- plm(gdp_perhour ~ lag(life_exp) + lag(ppp_spend), data = prod, 
                      model = "pooling",
                      index = c("REF_AREA", "TIME_PERIOD"))
# FE by country
m2 <- plm(gdp_perhour ~ lag(life_exp) + lag(ppp_spend), data = prod, 
                      model = "within",
                      index = c("REF_AREA", "TIME_PERIOD"),
                      effect = c("individual"))

# FE by country, includes lagged outcome var
m2b <- plm(gdp_perhour ~ lag(life_exp) + lag(ppp_spend) + lag(gdp_perhour), data = prod, 
                      model = "within",
                      index = c("REF_AREA", "TIME_PERIOD"),
                      effect = c("individual"))

# TWFE: FE by country and time period
m3 <- plm(gdp_perhour ~ lag(life_exp) + lag(ppp_spend), data = prod, 
                      model = "within",
                      index = c("REF_AREA", "TIME_PERIOD"),
                      effect = c("twoway"))

# TWFE, includes lagged outcome
m3b <- plm(gdp_perhour ~ lag(life_exp) + lag(ppp_spend) + lag(gdp_perhour), data = prod, 
                      model = "within",
                      index = c("REF_AREA", "TIME_PERIOD"),
                      effect = c("twoway"))

modelsummary(list(m1, m2, m2b, m3, m3b))


We can use the Breusch-Godfrey/Wooldridge test to check for serial correlation. This time the null hypothesis is *no serial correlation*. 

In [None]:
#Breusch/Godfrey test for serial correlation
pbgtest(m3)
pbgtest(m3b)

Do our models suffer from serially correlated errors? Interpret the test in Markdown block.

In [None]:
# we can use plm with model = "fd" to run a first difference regression. This will 
# difference the outcome variable for us.
# I'm also nesting diff() inside of lag() to get change measures of the 
# predictor variables, lagged a year. 
m4 <- plm(gdp_perhour ~ lag(diff(life_exp)) + lag(diff(ppp_spend)) , data = prod, 
                      model = "fd",
                      index = c("REF_AREA", "TIME_PERIOD"))

pbgtest(m4)
summary(m4)

Interpret the results of Model 5 in some Markdown text. Is a lagged change in life expectancy associated with increases in economic productivity? If so, how much? Does our first difference model have a serial correlation problem? 