Feature engineering cumulative impact
================================
*A companion notebook of R code to the articles:*

* [Part I](TBD)
* [Part II](TBD)

# Part I

In [81]:
library(dplyr)
library(splines)
library(ggplot2)

## Movitating Articles

[Fatigue and fitness modelled from the effects of training on performance](https://www.researchgate.net/profile/Robin_Candau/publication/15242395_Fatigue_and_fitness_modelled_from_the_effects_of_training_on_performance/links/55720f2608ae7536374cdc09/Fatigue-and-fitness-modelled-from-the-effects-of-training-on-performance.pdf) (Busso, Candau, and Lacour 1994, referred to as "BCL94" in the companion article

[Modeling human performance in running](https://www.researchgate.net/publication/20910238_Modeling_human_performance_in_running) (Morton, Clarke and Banister 1990)

[Convolution notes from MIT's Jeremy Orloff](https://math.mit.edu/~jorloff/suppnotes/suppnotes03/i.pdf)


In [82]:
train_df <- data.frame(day = 1:259, day_of_week = 0:258 %% 7)
train_df$period <- ifelse(train_df$day <= 147, "build-up", "competition")
train_df$w <- with(train_df, w <-
  -24 * (day_of_week == 0) +
   12 * (day_of_week == 1) +
    8 * (day_of_week == 2) +
    0 * (day_of_week == 3) +
    6 * (day_of_week == 4) +
   -8 * (day_of_week == 5) +
    6 * (day_of_week == 6))

train_df$w <- rpois(nrow(train_df),
                    train_df$w + ifelse(train_df$period == "build-up", 34, 24))

cat("Average training intensity during build up phase:",
    round(mean(train_df$w[train_df$day <= 147]), 2), "\n")
cat("Average training intensity during competition phase:",
    round(mean(train_df$w[train_df$day > 147]), 2))

ggplot(train_df, aes(x = day, y = w)) +
  geom_bar(aes(fill = period), stat = "identity") +
  ggtitle("Simulated daily training intensities for hammer thrower") +
  xlab("Day") + ylab("Training intensity") +
  theme(text = element_text(size = 16))


Average training intensity during build up phase: 34.45 
Average training intensity during competition phase: 23.55

ERROR: Error in png(tf, width, height, "in", pointsize, bg, res, antialias = antialias): unable to start png() device


plot without title

Fitness and Fatigue follow Exponential Decay: $g(t) = k \exp(-t / τ)$

In [None]:
# Exponential decay and fitness-fatigue profiles
exp_decay <- function(t, tau) {
  exp(-t / tau)
}

grid_df <- rbind(data.frame(day = 1:259, level = 400 * exp_decay(1:259, 13),
                            type = "fatigue"),
                 data.frame(day = 1:259, level = 100 * exp_decay(1:259, 60),
                            type = "fitness"))

ggplot(grid_df, aes(x = day, y = level)) +
  geom_line(aes(color = type), size = 1.5) +
  ggtitle("Responses of fitness and fatigue to training impulse") +
  xlab("Day (n)") + ylab("Level of fitness or fatigue") +
  theme(text = element_text(size = 16))


The effect of either fitness or fatigue at a given point in time is expressed as a convolution
of the training history with the relevant decay function:
$$
\sum_{i=1}^{n-1} w_i \exp \left(\frac{-(n-i)}{\tau} \right)
$$

In [None]:
convolve_training <- function(training, n, tau) {
  sum(training[1:(n - 1)] * exp_decay((n - 1):1, tau))
}

fitness <- sapply(1:nrow(train_df),
                  function(n) convolve_training(train_df$w, n, 60))

fatigue <- sapply(1:nrow(train_df),
                  function(n) convolve_training(train_df$w, n, 13))


The final expected performance function for our simulated hammer thrower is:
$$
\text{E}(p_n) = 496 + 0.07 \sum_{i=1}^{n-1} w_i \exp \left(\frac{-(n-i)}{60} \right) - 0.27 \sum_{i=1}^{n-1} w_i \exp \left(\frac{-(n-i)}{13} \right) 
$$

In [None]:
E_perf <- 496 + .07 * fitness - .27 * fatigue

set.seed(45345)
train_df$perf <- E_perf + 7.0 * rnorm(nrow(train_df))

components_df <- rbind(
  data.frame(level = .27 * fatigue, day = train_df$day, type = "fatigue"),
  data.frame(level = .07 *fitness, day = train_df$day, type = "fitness"),
  data.frame(level = E_perf - 496, day = train_df$day, type = "performance"))

ggplot(components_df, aes(x = day, y = level)) +
  geom_col(data = train_df, aes(x = day, y = w), color = "grey", width = .2) +
  geom_line(aes(color = type), size = 1.5) +
  annotate("text", label = "training intensities", x = 70, y = 25,
           color = "grey32") +
  ggtitle("Modeled fitness, fatigue and relative performance") +
  xlab("Day (n)") + ylab("Component level on performance scale") +
  theme(text = element_text(size = 16))

In [None]:
# Recover parameters using non-linear regression
rss <- function(theta) {
  int  <- theta[1] # performance baseline
  k1   <- theta[2] # fitness weight
  k2   <- theta[3] # fatigue weight
  tau1 <- theta[4] # fitness decay
  tau2 <- theta[5] # fatigue decay

  fitness <- sapply(1:nrow(train_df),
                    function(n) convolve_training(train_df$w, n, tau1))

  fatigue <- sapply(1:nrow(train_df),
                    function(n) convolve_training(train_df$w, n, tau2))

  perf_hat <- int + k1 * fitness - k2 * fatigue
  return(sum((train_df$perf - perf_hat) ^ 2))
}


optim_results <- optim(c(400, .05, .15, 20, 5), rss, method = "BFGS",
                       hessian = TRUE, control = list(maxit = 1000))
                    
print(optim_results$convergence) # 0 means algorithm as converged

In [None]:
VarCov <- solve(optim_results$hessian)
parm_names <- c("baseline", "fitness weight", "fatigue weight",
                "fitness time const", "fatigue time const")
for (i in 1:5) {
  cat(parm_names[i], "estimate:", round(optim_results$par[i], 2),
      ", std.err.:", round(sqrt(diag(VarCov))[i], 2), "\n")
}

In [None]:
get_performance <- function(theta) {
  int  <- theta[1] # performance baseline
  k1   <- theta[2] # fitness weight
  k2   <- theta[3] # fatigue weight
  tau1 <- theta[4] # fitness decay
  tau2 <- theta[5] # fatigue decay

  fitness <- sapply(1:nrow(train_df),
                    function(n) convolve_training(train_df$w, n, tau1))

  fatigue <- sapply(1:nrow(train_df),
                    function(n) convolve_training(train_df$w, n, tau2))

  int + k1 * fitness - k2 * fatigue
}
                    
train_df$perf_hat <- get_performance(optim_results$par)


In [None]:
ggplot(train_df) +
  geom_point(aes(x = day, y = perf)) +
  geom_line(aes(x = day, y = perf_hat), color = "blue", size = .9) +
  ggtitle("Performance, observed and modeled") +
  xlab("Day (n)") + ylab("Performance") +
  theme(text = element_text(size = 16))

# Part II
## Using splines to approximate the decay function

In Part I, we had *fitness* and *fatigue* features of the form:
$$
\sum_{i=1}^{n-1} w_i \exp \left(\frac{-(n-i)}{\tau} \right),
$$
where the exponential decay of both fitness and fatige from initial levels arose from a first order linear dynamic system. But what if that model is not right? Suppose instead that the true "decay function," which I'll hereby refer to as a "lag distribution" (See [Sims 1971](https://www.jstor.org/stable/1913265?seq=1#page_scan_tab_contents)) since it might be strictly decreasing, is an arbitrary smooth, continuous function $f(x)$. As a smooth, continuous function, $f(x)$ is a good candidate for approximation via basis functions taking the form
$$
\eta(t) = \theta_1 + \sum_{j=2}^p \theta_j g_j(t),
$$
and in this article we'll consider parametric cubic splines for the job. 

Consider our general purpose convolution-based feature
$$r_s = \sum_{i = 0} ^ {s-1} w_i \eta(s - i), s = 1, \ldots n$$.

Expanding $\eta(t)$ leads to
$$
r_s = \theta_1 \sum_{i=1}^{s-1}  w_i + \sum_{j=2}^p \theta_j \sum_{i = 1}^{s-1} w_i g_j(s-i).
$$

Thus, implementing this convolution based feature amounts to a linear regression with the following $p$ covariates:

\begin{align*}
z_1 &= \theta_1 \sum_{i=1}^{s-1} w_i, \\
z_j &= \sum_{i = 1}^{s-1} w_i g_j(s-i), \; j = 2, \ldots, p.
\end{align*}

Having multiple convolution based covariates based on training intensity is reminiscent of *fitness* and *fatigue*, which were exponentials with different decay time constants convolved with training intensity. 


###TODO: confirm that equally spaced unit interval disclaimer is made in Part's 1 and 2.





In [None]:
combined_fn <- function(t) {
  0.07 * exp_decay(t, 60) - 0.27 * exp_decay(t, 13)
}

combined_fn(1:259)
plot(iris)
