# TensorFlow advanced applications with deep learning object detection, regression, time-series analysis, hyper-parameter tuning

## Objectives

- Understand the limits of deep learning models built to work with 3D imagery and variants that have been developed for 4D time series
- Understand theory, use cases, and architecture options for time series modeling and change detection


- Understand difference between semantic segmentation and object detection, labeling and modeling challenges with each approach
- Cover R-CNN family, Yolo object detection families of architectures.
- Understand theory, use cases, and architecture options for regression
- Understand when to hyperparameter tune, and the pros/cons of different approaches.

### Semantic Segmentation with U-Net

In Lesson 1a Introduction to ML Neural Networks and Deep Learning, we introduced the U-Net architecture, a popular architecture in remote sensing. U-Net has been onw of the most popular architectures for segmentation fo remote sensing images because:

1. Since U-Net is fully convolutional, it requires fewer parameters to train than models with multiple heads (R-CNNs) or models with many fully connected layers.
2. U-Net's skip connections learns powerful features across many spatial scales that preserve high resolution features 
3. the U-Net architecture can handle very high resolution images without saturating GPU memory, unlike other frameworks that learn many features per each section of an image.

U-net is therefore one of the most popular architectures for segmentation in very high resolution imagery (and low resolution imagery as well). At Development Seed, we've used CNN-based U-Net architectures for segmentation of supraglacial lakes https://developmentseed.org/blog/2022-12-13-segmenting-supraglacial-lakes

The traditional U-net architecture has been adapted and improved to work with high resolution 2D and 3D imagery. Of these SSCA-Net has been a popular option which uses self and channel attention.

:::{figure-md} SCCANetFig
<img src="./images/scca.jpg" width="450px">

[SCCA Image](./images/scca.jpg) from "SSCA-Net: Simultaneous Self- and Channel-Attention Neural Network for Multiscale Structure-Preserving Vessel Segmentation"
:::

In the above figure, the SCCANet reflects the structure of a U-Net with some modifications:
1. The typical initial residual blocks in the encoder have been replaced by an RFU block, which is simply a 3x3 convolution followed by batch normalization and RELU.
2. They use a squeeze and excitation pyramid pooling (SEPP) module at the end of the encoder, which makes use of atrous convolutions and a spatial pyramid to account for multiscale features.
3. The decoder includes SCA modules that use self and channel attention to efficiently model long range dependencies in feature maps that are learned from previous convolution operations.


However, when it comes to modeling with image time series, a traditional 3D U-Net won't work, since it's structure does not account for the time dimension. Extensions of U-Net and other fully convolutional networks have been developed to address 4D data cubes (time, bands, height, width). These approaches either use CNNs to model the relationships along a time sequence or variants of attention.

Recent approaches in this vein include:

1. [ScaleMAE - A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning](https://arxiv.org/pdf/2212.14532.pdf)
2. [SatMAE - Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery](https://arxiv.org/pdf/2207.08051.pdf)

Of these, ScaleMAE is more recent and shows favorable performance relative to SatMAE.

:::{figure-md} ScaleMAEFig
<img src="[./images/scca.jpg](https://ai-climate.berkeley.edu/scale-mae-website/static/images/scale-teaser.png)" width="450px">

[ScaleMAE Image]([./images/scca.jpg](https://ai-climate.berkeley.edu/scale-mae-website/static/images/scale-teaser.png)) from "A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning"
:::

Scale-MAE has some important constraints:
* the resolution of the sensor inputs must match.
* GPU memory use is large relative to traditional MAE
* It takes a really long time to train, relative to networks for 3D (this is a constraint for most networks that work with the space, time, and band dimension)
* 


### Object Detection and Instance Segmentation: Counting and mapping extent of instances


* Yolo-V5 for wildlife detection, object detection




### Mask R-CNN for object detection and instance segmentation.
1. Useful when your mapping targets are objects, meaning they have relatively simple boundaries, are not extremely disjoint, and occur within a well defined range of spatial scales and aspect ratios

https://meetingorganizer.copernicus.org/EGU23/EGU23-16932.html
https://docs.google.com/presentation/d/18wM5h4qxR3wev3Ix9HS0K3Z3E8Cjr2z2KoFXgrVCeF4/edit#slide=id.g23992ac5da2_4_25

### Time series and change detection


tinyCD - change detection of fires with mixingmaskattention: https://github.com/developmentseed/chabud2023/blob/main/chabud/tinycd_model.py

Presto - https://arxiv.org/pdf/2304.14065.pdf

Operates on pixel time series of multi sensors, multi channel
Channel group embedding, positional, embedding for lagoon, and month embedding
Somewhat marginal but clear improvement over other approach (also time series approach)
Geographic representation across hemispheres and ecoregion
Used dynamic world as an input and to stratify
Presto can be used as feature extractor (for random forest and regression) or fine-tuning the encoder and linear transformation
Main comparison is to Task Informed Meta Learning
Presto is fully self supervised, computationally more efficient than image based approaches
Also beats SatMAE
Also tested fully supervised presto to measure effect of arch vs the self supervision training regime
Didn’t really understand all presto tables, especially table 6
Self supervised presto beats fully supervised presto for particular tasks

Spatio-temporal transformer STT

ConvLSTM: Effective for time series data in computer vision. Knowledge Distillation: Teacher-student networks where the student (smaller) network tries to imitate the teacher (larger) network. SCCANet (Spatial and Spectral Channel Attention Network): An evolved form of U-Net with attention mechanisms. Stacked AutoEncoder (SAM?): Focuses on lossless compression. Comparison with PCA and deep autoencoders. Introduction of Variational autoencoders. RaVAEn: Unsupervised learning for change detection. Image Retrieval: Using vector databases for similarity computations. Image Captioning: [Insert: Brief Explanation and Examples]

Firecube: A daily data cube for the modeling and analysis of wildfires in Greece. Dataset is on zenodo

Model types Pixel-wise = RF, XGBOOST Temporal pixel-wise: LSTM Spatiotemporal: ConvLSTM

Showed convlstm does best, better than xgboost!


M2-CDNET 