## Objectives

* Understand RNNs and the differences in structure from CNNs
* Cover a short history of RNNs for time series forecasting and prediction
* Cover challenges with trianing RNNs and deploying
* Cover CNN and Transformer models for time series prediction and how they address RNN challenges. were state of the art for text and time seriesb

### RNN Architectures for forecasting and change detection in 1-D and imagery sequences

Up until now we have discussed dealing with fixed-length datasets consisting of 3 band images. CNNs work great with these fixed length datasets, but when we need to understand the relationship between images in a sequence, we need to adopt a more complex structure that models the interaction along the time deimsnion. 

Recurrent Neural Networks, or RNNs, address modeling variable length sequences. These can be text, 1 dimensional time series, or sequences of images, including video. Unlike CNNs, RNNs have a concept of "hidden state" for each time step, which is the output of the computation from all of the hidden layer sin the RNN. This hidden state is passed as an input to the next time step in an RNN inaddition to the sequence input. Below, is an example using an RNN for next letter prediction to model the letter "s" in the word "dogs".

:::{figure-md} RNNFig
<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*dyfwJJuGT2Svy10iYAVDEQ.png" width="450px">

[Ben Khuong, The Basics of Recurrent Neural Networks (RNNs)](https://miro.medium.com/v2/resize:fit:720/format:webp/1*dyfwJJuGT2Svy10iYAVDEQ.png)
:::

Here we see that the hidden states, "h", must be computed from individual samples from the sequence. The nodes of the hidden layers, "a", represent the learned weights and biases for a particular hidden state. Because we need to compute the hidden states for each sample given the preceding hidden state, we cannot compute hidden states in parallel, and can only parallelize the computation of one hidden state at a time. In other words, each state must be computed asynchronously.

This has susbtantial implications for the computational burden of RNNs, it is difficult to parallelize RNN computation, making them take longer to train for non-recurrent architectures. We'll revisit this limitations when discussing alternatives to RNNs.

The recurrent connections above relate each time step to it's adjacent time steps. This is akin to the network having a very short term meory biases toward length t=1. While previous inputs farther than t=1 can have an influence on a particular hidden state, regular RNNs suffer from the vanishing (or exploding) gradient problem in that the partial derivatives (gradients) for parameters that are very far away can become very small, eventually reaching 0, or very large. However, when dealing with land cover modeling in remote sensing time series, we often need to model processes that occur over many different time scales, such as the influence of seasonality, a preceding year's snowfall or precipitation effects on the current year's vegetation, and the influence of yesterday's commercial logging activity on today's forest cover. These complex, multi-scale interactions require architectures that can model both short term and longer term connections between samples without numerical instability. Long Term, Short Term memory networks, or LSTMs, were developed as an answer to these more complex sequence modeling problems and to address the issue of vanishing gradients. While LSTMs do not handle the issue of computational training burden, let's review them so we can compare how other architectures address the challenge of computational burden.

### Long Term Short Term Memory Networks (Hochreiter and Schmidhuber, 1997)

LSTM's introduce the concept of "state" to the RNN computation, which we'll refer to as the "memory block". The memory block allows an LSTM to keep track of what learned context to forget, what to add or update, and how much of it to pass on to the next time step. These three pieces of information flows are controlled by *gates*. These gates reduce the vanishing gradient problem by allowing inputs to flow through the network without computing uninformative gradients, and by reducing the distance of connections between pieces of a sequence that are far away. These gates are applied multiplicitavely and are computed with sigmoid activation functions. A helpful analogy is that the memory block state is the RAM of your computer, keeping updated information readily accessible to the LSTM.

:::{figure-md} LSTMRNNFig
<img src="https://d2l.ai/_images/lstm-2.svg" width="450px">

[Long Short-Term Memory (LSTM)](https://d2l.ai/_images/lstm-2.svg) from "Dive into Deep Learning" by d2l.ai, used under CC BY-SA 4.0
:::

In the above figure we see:

1. The input gate controls how much of the input should be used to influence the internal state of the current memory cell.
1. The forget gate controls if the input hidden state should be forgotten so it does not influence subsequent hidden states.
1. The output gate deterimens if the output should be influenced by the current memory cell. 

LSTMs and related RNN architectures such as bi-directional RNNs (Schuster and Paliwal, 1997) were state of the art for sequence prediction from 2011 until 2017, when Transformers were introduced. They're still relevant today in that aspects of recurrent connections are influential in transformer and CNN-based architectures. However, the core limitation of RNNs, that they cannot be traine din parallel, makes them less desirable of an option for training larger models on large image datasets. Let's look at how Transformer and other simpler architectures address this problem and show state of the art performance on sequence prediction in general and image sequence prediction.


### Encoder Decoder Architectures and The Transformer (Vaswani et al. 2017)

For nearly 30 years, CNNs have dominated computer vision and LSTMs have dominated natural language processing. While mahy advancements have been made that we have discusse din prior lessons, including new activation funct6ions (ReLU), training improvements like Batch Norm, and architectural improvements like residual connections, the same classical architectures (CNNs, and RNNs) have maintined their presence in the state of the art. The deep learning landscape has experienced a somewhat recent sea change in fundamantel computing architecture with the introduction of the Transforme and self-attention.

Originally in Vaswani et al. 2017 the transformer was introduced for transuction tasks: converting input natural language sequences to output natural language sequences, i.e. next word prediction. After some time, in 2020, Dosovitsky et al. 2020 showed that the transformer architecture could achieve near state of the art results on image classification compared to more complex CNN-based architectures, with lower computational cost to train.



When dealing with image time series, we typically either want to

1. do pixel-wise segmentation of the time series at each time step, producing an equivalent length time series of maps
1. predict a change map for the time series, which can will be of a different variable length depending on how we measure change.

In either case, we are converting an image sequence to another image sequence. Even in the case of single date imagery, we are converting a sequence of bands to a a sequence of length one. Sequence to sequence models can be addressed by Encoder-Decoder architectures, where an Encoder computes image features and a decoder uses those image features to make a prediction. Encoder-Decoder models are powerful because each section can be trained and used for inference independently.










### References

[i] [Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need.](https://arxiv.org/abs/1706.03762)
[ii] [Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Uszkoreit, J. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.](https://arxiv.org/abs/2010.11929)

ConvLSTM and alternatives for time series prediction
* The "curse of dimensionality" comes into play even more so when working with deep time series and deep learning.
* The "curse of dimensionality" is a term often used in machine learning, statistics, and data analysis to refer to the various challenges and problems that arise when dealing with high-dimensional data. As the dimensionality of a dataset increases, the volume of the space increases exponentially, which leads to various issues. 
* Challenges and aspects related to the curse of dimensionality:
  * Sparse Data: As the number of dimensions increases, the amount of data needed to fill the space grows exponentially. Consequently, data becomes sparse. This means that most of the possible combinations of values are not observed, making it harder to identify patterns.
  * Distance Measures Become Less Informative: In high dimensions, the distances between data points tend to converge, i.e., all pairs of points seem "equally far apart". This is problematic for algorithms like k-nearest neighbors, where distance is a crucial factor.
  * Increased Computation: With more dimensions, the computational complexity of many algorithms increases, often exponentially. This means that processing times become longer, and more memory and storage space are needed.
  * Risk of Overfitting: With many dimensions, there's an increased likelihood of overfitting the model to the training data. This is because with more features, the model can fit noise or random fluctuations in the training data, leading to poorer generalization to new or unseen data.
  * Decreased Model Performance: With an increasing number of irrelevant features, the performance of many machine learning algorithms can degrade.
  * Intuition Breaks Down: It's difficult to visualize or comprehend data in very high dimensions, which makes it challenging to understand the structure of the data or the behavior of a model.
  * Harder to Ensure Quality: As the number of dimensions increase, ensuring data quality across all dimensions becomes more challenging. Missing or erroneous data can impact results more drastically.
* Combating the Curse:
  * Feature Selection: This involves identifying and using only the most informative features or dimensions.
  * Dimensionality Reduction: Techniques like Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and autoencoders can be used to reduce the number of dimensions while retaining most of the data's variability.
  * Regularization: Methods like L1 (Lasso) and L2 (Ridge) regularization can be applied to penalize complex models, especially in regression contexts, to prevent overfitting.
  * Sampling: Techniques like random sampling or clustering can be used to reduce data size, and thus dimensionality, without losing too much information.
  * Ensemble Methods: Combining multiple models (like in random forests) can mitigate some of the risks of high dimensionality.
  * Domain Knowledge: Incorporating knowledge from the domain of application can help in intelligently selecting features and understanding the data's underlying structure.
* working with time series is difficult in deep learning, the curse of dimensionality comes into play if you try to learn on too many features with too little data
* high computational / dataset burden for image timeseries / video prediction
* in the remote sensing domain, due to dataset size, best to try smaller models first that convolve over the time dimension instead of the spatial dimensions
* See https://developmentseed.org/blog/2023-06-29-time-travel-pixels and https://arxiv.org/abs/2207.13159 and https://arxiv.org/abs/2304.14065

Vision Transformers and CNN alternatives
* computational burden is also quadratic with number of self attention layers
* ConvMixer is an alternative with lower computational burden, that is CNN and MLP based, available from the Keras team: https://huggingface.co/keras-io/convmixer





Architecture for image-to-image tasks, no fully connected layers because it doesn't compress feature information into a single category.
Downsamples the image, then upsamples it into a segmentation mask.
[Insert: Reference to U-Net Paper]


ViT (Vision Transformer):
Uses attention to focus on certain parts of images.
Image split into patches, flattened, and transformed to embeddings.
Pre-training and fine-tuning mechanisms are similar as they are for CNNs.
Notable Paper: "An image is worth 16x16 words: Transformers for image recognition at scale".


### Advanced Concepts: Deep learning architectures for segmentation and change detection in image time series

ConvLSTM:
Effective for time series data in computer vision.
Knowledge Distillation:
Teacher-student networks where the student (smaller) network tries to imitate the teacher (larger) network.
SCCANet (Spatial and Spectral Channel Attention Network):
An evolved form of U-Net with attention mechanisms.
Stacked AutoEncoder (SAM?):
Focuses on lossless compression.
Comparison with PCA and deep autoencoders.
Introduction of Variational autoencoders.
RaVAEn: Unsupervised learning for change detection.
Image Retrieval:
Using vector databases for similarity computations.
Image Captioning:
[Insert: Brief Explanation and Examples]


5. Generative Models

Generative Adversarial Networks (GANs):
Introduction to Generator and Discriminator networks.
DCGANs (Deep Convolutional GANs).
Applications:
Generating human images ("This person does not exist").
GAN-based style transfer.
Age transitions using GANs.
Super-resolution.
Challenges:
Formulating networks to "compete" and achieve desired results.

# Spatio-temporal transformer STT

Spatial transformer

Set and t models outperform spatial models

Adversarial Label-efficient satellit image change detection
- Manual inspection out of reach
- Fully automatic solutions reach limitation. Irrelevant changes. Illumination occlusion, noise, blurring
- Remove irrelevant changes with normalization, registration, 
- Reducing the variability
- Or model the variability, remove with ML, DNNs SVMs
- Hard to know how to model variability
- Change detection method with active learning
- Q&A model
- Finds relevant exemplars to label using adversarial criterion that mixes diversity, representativity

# Change detection

unet+MSOF

FC+Siamese

CLM: novel model for multi scale features and multi level context

EMS-CDNet = uses partition unit for faster feature extraction and merging

But it struggled with unstrictly registered datasets

M2-CDNET addresses this. Deformable convolution learns offsets between given features to alleviate pixel displacement caused by registration errors. As the scale shrinks…

Future work
NERF, reduce registration errors and projection bias

en.skyearth.org

Zhangyj


New talk: VHR change detection

HCGMNet-CD



# Challenges in forecasting and detection for natural disasters

Big data
Labeling - imbalance, noisy labels
Generalization
Uncertainty in forecasting
Stochastic nature, complex, non-linear interactions between earth system variables
Effuse system for fire prediction in Europe assumes pine fuel everywhere

Firecube: A daily data cube for the modeling and analysis of wildfires in Greece. Dataset is on zenodo

Model types
Pixel-wise = RF, XGBOOST
Temporal pixel-wise: LSTM
Spatiotemporal: ConvLSTM

Showed convlstm does best, better than xgboost!

Predictive accuracy isn;’t enough, fire brigade wants to know why. Explainable ai
Can see what model is paying attention to, which variables and which times

Bayesian neural network for uncertainty estimation

Instead of scalar weights that are constant, replace them with distributions

Used a Unet to segment wildfire forecast

Unet doesn’t account for teleconnections. Memory effects from drought from last year

TeleViT Teleconnection Vision transformer

More resilient to long forecasting windows!

62 to 61% AUPRC % instead of 62 to 58% for Unet ++ for long forecasting window

Annotated interferograms of volcanos, the segmentation, labels for deformation, activity type

ODC and ESDL are full data cube frameworks

Datacube challenges, autocorrelation
Difficult to model high dim data. East to overfit and need to regularize
Computation issues

Float 32 are 4 bytes each.

Probabilistic ML hands on

Discard fraction of most uncertain samples, inference on more certain samples should lead to lower loss. Shows that uncertainty estimate is more accurate if loss decreases with higher certainty sample inference

SEASFIRE

Uncertainty estimation

1. Copernicus foundational models for dim reduction, multi modal data fusion
2. Out of distribution generalization
3. Earth as a graph,capture teleconnections. Ssls for earth as a graph, long term interactions
4. Causality and physics, guided ml

* Structure of intro to deep learning
* Convolution nns
    * Convolution + subsampling + () + … + fully connected
* Convolved feature creation graphic gif
* Activation
    * Simplifies learning, makes learning faster
    * Avoids saturation issues
* Subsampling, example: max pooling. Scale invariance. Params: type filter size, stride
* Lacuna et al 1998 LeNet first CNN
* Inception (GoogleNet 2014) convolutions with 3x3 and 5x5
* Residuals = bypass some convolutions and add skip connections. ResNet
* Allows very deep networks. 150 layers
* Kaiming He 2016 CVPR “deep residual learning for image recognition”
* DenseNet. Pushes residual connections by connecting everything to everything
* Unet - arch that goes from images to images (paper?)
* Shrinks image down that samples it back up
* ViT
    * Attention makes system focus on specific instances
    * Split image into fixed patches, sizes
    * Flatten image patches
    * Create features, embeddings for patches
    * Positional embeddings
    * Feed embedding sequence to transformer encoder
    * Pretrain it with image labels
    * Retrain many samples, fine-tune
    * An image is worth 16x16 words: Transformers for image recognition at scale
* ConvLSTM for time series
* Knowledge distillation, teacher student networks. smaller student network tries to approximate teacher
* Show different date types, classification to instance segmentation
* Segnet = cnn encoder decoder. Cone + batch norm + relu. Pooling. Upsampling. Softmax
* Segnet different from net slightly. No bottleneck for segment. Segment is fully connected. 2017
* Unet more popular than segment
* 3D Unet possible. 4D Unet possible

* SCCANet Spatial and Spectral Channel Attention Network
* The size of intermediate layer sis very trial and error and empirical if it works
* Basically still a Unet
* There’s a lot of work that is basically just Unet or X Architecture that is published
* Focus on using established architectures published from reliable sources in easy to use libraries/ml frameworks

Stacked AutoEncoder (SAM?)
* Lossless compression
* Comparison to PCA, retrieving input in the output
* Even with 730 or 30 intermediate dimensions
* Deep auto encoders
* Intermediate representations are decoded
* Network extracts higher level features 
* Variational auto encoders
    * Encoding distribution is regularized during training to ensure latent space can generate new data
    * Difference between auto encoder to variational auto encoder
* RaVAEn unsupervised learning for change detection
    * Detecting floods without labels
    * Show many examples of unlabeled flooding
    * Time series of images that include after event image
    * Is it necessary to order the after event image as the final image in the time series?
    * Need geographic representation
    * Used Sentinel-2. 4 before event 1 after event
    * Cosine baseline vs cosine embeddings?
    * Cosine embedding matched the true label more
* Image retrieval - vector database
    * Query image to do similarity computation
* Image captioning
* Generators?
    * GANS start from random inputs and generate images
    * Adversarial part is that two networks compete. One generates images, another network generates predictions if results from counterpart are fake or not
    * Generator network and discriminator network
    * DCGANs, deep convolutional generative adversarial networks
    * This person does not exist, old news, better models now
    * Gan based style transfer
    * Gan transitions (aging)
    * No ground truth!
    * Gans for super resolution. Nothing new in terms of arch. CNNS, max pooling, RELU, sigmoid, etc.
    * Problem statement with GANs is formulating networks to connect and battle each other to arrive at a specific objective


# GeoFM SAM notes

Presto
* Operates on pixel time series of multi sensors, multi channel
* Channel group embedding, positional, embedding for lagoon, and month embedding
* Somewhat marginal but clear improvement over other approach (also time series approach)
* Geographic representation across hemispheres and ecoregion
* Used dynamic world as an input and to stratify
* Presto can be used as feature extractor (for random forest and regression) or fine-tuning the encoder and linear transformation
* Main comparison is to Task Informed Meta Learning
* Presto is fully self supervised, computationally more efficient than image based approaches
* Also beats SatMAE
* Also tested fully supervised presto to measure effect of arch vs the self supervision training regime
* Didn’t really understand all presto tables, especially table 6
* Self supervised presto beats fully supervised presto for particular tasks


MedSAM
* Strong qualitative improvement in fuzzy boundary objects
* Text encoding with CLIP, todo read
* SAM can generate object, part, and subpart masks, 4 each. Todo enable?
* The mode matters a lot. What’s best for wall to wall mapping…. Segment anything mode has no semantic labels
* What’s best for fuzzy objects? Probably box? Sometimes multi point? When does part and subpart matter?
* Box == less trial and error
* They freeze the encoder
* Only mask decoder fine tuned
* Pregenerated all training image embeddings
* Image encoder resizes images by default to 3x1024x1024, yuck?
Through the fine-tuning of SAM on medical image datasets, MedSAM has greatly enhanced the model’s ability to identify challenging segmentation targets. Specifically, MedSAM has demonstrated three significant improvements over the pre-trained SAM. Firstly, MedSAM has improved the model’s ability to identify small objects, even when multiple segmentation targets are present within the bounding box prompt. Secondly, the model has shown more robustness towards weak boundaries in various modalities, such as lesion and left ventricle segmen- tation in ultrasound and brain MR images, respectively. Finally, MedSAM has effectively reduced interference from high-contrast objects surrounding the seg- mentation target, resulting in fewer outliers. 
* Dice similarity saw huge improvement for nearly all categories, table 1 and 2
* Scribble based prompt, interesting idea
https://github.com/MIC-DKFZ/napari-sam 

Sam-adapter-med
* Different group almost same time
* Different “fine-tuning” method with prompting. Throws away sAM decoder and encoder is frozen
* Uses two MLP layers to supply prompts to the encoder
* They seem to get great results, unclear if they are not using any prompts at all? Seems like these predictions are promptless.
* So better results on specific tasks with adapter fine tuning for each task. Less generalizability for each adapted model and you lose promptability?

Personalize-SAM
- Lightning fast finetuning variant PerSAM-F
- Training free variant PerSAM using one example image and mask
- Personalizing sam to segment unique visual concepts (your pet dog)
- Learns the best mask scale for a particular problem, handling the object/part/subpart selection choice that SAM offers
- This seems unique from the adapter and traditional fine tuning approach
- “achieves leading performance on our annotated PerSeg dataset” lol
- Mainly compares to seggpt no their own dataset, not other segmentation approaches like sam adapter
