### The problem
- Suppose that we have two sets of data $X$ and $Y$ where each consists of ordered streams of datapoints $(x_1, x_2,\ldots, x_t, \ldots)$ and  $(y_1, y_2,\ldots, y_s, \ldots)$ for example consecutive frames from a video
- The goal is to learn a mapping from the domain $X$ to the domain $Y$.
- A naive approach would treat the set of sequences as set of the elements of all the sequences, disregarding the ordering of groups of elements.
- For example we would treat a set of video clips as a set of the individual frames in the video clips
- It turns out that taking advantage of the temporal ordering yields much better results.

### The model
- Recurrent temporal predictor $P$:
    - Takes as input the sequence $x_1,...,x_t$ and predicts the next element $x_{t+1}$ conditioned on the previous elements. 
    $$x_{t+1} = P_X(x_{1:t})$$
    - Loss function
        $$L_\tau(P_X) = \sum_t\lVert x_{t+1} - P_X(x_{1:t})\rVert^2$$
- Based on this a recycle loss is defined
TODO: loss formula
- Here the predictor takes as input a sequence of generated samples $G_Y(x_1),...,G_Y(x_t)$ to predict the next one.
- The generator $G_Y$ maps from $X$ to $Y$.
- The predicted samples are then mapped back to $X$ via $G_X$ and the  loss between these and the elements of the original sequence is minimised.
- The complete loss function:
TODO: put the description under the term
$$\underset{\text{generator loss for $G_X$}} {L_g(G_X, D_X)}
+\underset{\text{generator loss for $G_Y$}}{L_g(G_Y, D_Y)}\\
+ \lambda_{rx}L_r(G_X, G_Y, P_Y) \text{ }\text{ }\text{ }\text{ }\text{recycle loss for the mapping $Y \longrightarrow X$}\\
+ \lambda_{ry}L_r(G_Y, G_X, P_X) \text{ }\text{ }\text{ }\text{ }\text{recycle loss for the mapping $X \longrightarrow Y$}\\
+ \lambda_{\tau y}L_\tau(P_X)\text{ }\text{ }\text{ }\text{ }\text{recurrent loss for $X$}\\
+ \lambda_{\tau x}L_\tau(P_Y)\text{ }\text{ }\text{ }\text{ }\text{recurrent loss for $Y$}\\
$$


### Generating sequences
- A naive approach is to generate a video frame by frame where $y_t = G_Y(x_t)$.
- However we could also incorporate temporal information by using $P_Y$ to smooth the output:

    $$y_t = f(G_Y(x_t),P_Y(G_Y(x_{1:t-1})))$$ 
    
- Here $f$ could be simple averaging: 

    $$y_t = \frac{G_Y(x_t) + P_Y(G_Y(x_{1:t-1}))}{2}$$
    
- It could also be a non-linear function and possibly one that is learned.


### Implementation details
- Spatial translation model uses CycleGAN
- Temporal prediction model uses Pix2Pix
- Discriminator is a $70 \times 70$ PatchGAN
- Same network architecture for $G_X$ and $G_Y$
- Input size is $256 \times 256$
- Temporal predictors
    - U-Net architecture
    - Input is last two frames (does this mean P(x_{1:t}) = P(x_{t-2}, x_{t-1})$$
- All the loss weights $\lambda_s = 10$

