# Text Detection in Video Using Deep Learning
### 2018/02/01 Xiaolong Li

## 1. Datasets
Currently we have these available datasets:
#### Video 
- 2007 [Merino](https://www.cs.bris.ac.uk/Research/Vision/texttrack/) Scene text Video 
- 2011 [Minetto](http://www.liv.ic.unicamp.br/~minetto/datasets/text/VIDEOS/)
- 2013 [ICDAR 2013](http://dagdata.cvc.uab.es/icdar2013competition/?ch=3&com=introduction) 
- 2014 [Merino-Gracia](http://nf.ull.es/research/eav/text/tracking) 
- 2014 [YouTube Video Text](http://academictorrents.com/details/156802226bcf5747e0bea4e4f14c03b3b952de80) 
- 2015 [ICDAR 2015](http://rrc.cvc.uab.es/?ch=3&com=introduction)

#### Video frames
- 2010 [SVT](http://vision.ucsd.edu/~kai/grocr)
- 2017 [CURE-TSR: Challenging Unreal and Real Environments for Traffic Sign Recognition](https://openreview.net/forum?id=Hy4q48h3Z)

#### Notes
 `YouTube Video Text`
datasets are used for text detection, tracking and recognition in video. Some videos have several text regions that are sometimes affected by natural noise, distortion, blurring, substantial changes in illumination and occlusion. Specifically, the YouTube Video Text dataset contains 30 videos collected from YouTube. The text contents can be further divided into two categories, graphic text (e.g., captions, songs title, and logos) and scene text (e.g. street signs, business signs, and words on t-shirts

The `ICDAR 2013 dataset` (Robust Reading Competition
Challenge 3: Text in Videos) is used to evaluate the performance of video scene text detection, tracking and recog- nition. This database includes 28 video sequences, of which 13 videos are for training, and the rest are for testing. These videos cover different scripts and languages (Spanish, French, English and Japanese) and were captured with different types of cameras. More recently, the ICDAR 2015 Robust Reading Competition released an updated version of the `ICDAR 2013 video dataset`. The `ICDAR 2015 dataset` includes a training set of 25 videos (13450 frames in total) and a test set of 24 videos (14374 frames in total). The dataset was collected by organizers from different countries and includes text in different languages. The video sequences correspond to 7 high-level tasks in both indoor and outdoor scenarios. Moreover, 4 different cameras are used for capturing different sequence.

## 2. Research Papers
- 2017 [Attention-based Extraction of Structured Information from Street View Imagery](https://research.googleblog.com/2017/05/updating-google-maps-with-deep-learning.html) Google Inc.
- 2017 [Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework With Dynamic Programming](http://ieeexplore.ieee.org/document/7903596/) USTC
- 2017 [deep image prior](https://dmitryulyanov.github.io/deep_image_prior) might be useful for denoising
- 2016 [Spatial Transformer Networks](https://github.com/tensorflow/models/tree/master/research/transformer) Google Deep Mind. might be useful for 3D manipulation 

## 3 - Observations
Previous video text detection methods are mainly `tracking based text detection methods` and categorized into:
- temporal-spatial information based methods
- fusion based methods

### Pros and Cons:
- The first one adopted four image frames from different viewing angles, and use CNN + RNN to produce refined text detection results, the accuracy is boosted

However, the temporal information in sequential video frames is not used, also they don't care the spatial transform;

- The second one 
  - raises a pipeline to deal with video data in dynamic programming;
  - multiple features extraction and matching, filtering technologies are adopted to increase the accuracy;
  - global graph optimization to choose optimal tracking trajectory;
  - use simple CNN to do the candidate filtering and classification. 
  
However the computation seems to be very expensive, and they don't apply end-to-end deep learning approach, which would be more powerful;

   

## 4 - Current Implementation

Zero-padding adds zeros around the border of an image:

<img src="images/nn.png" style="width:1000px;height:600px;">
<caption><center> <u> <font color='purple'> **Figure 1** </u><font color='purple'>  : **FFnet Example**<br> Model (3 channels, RGB) with 2-stages FFnet Network. </center></caption>

The main idea tested here is exploring the most efficient classifier with feature extractor:

- 2 Conv layers + 2 stages of FFnet + 3 FC layers

- It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.

**Training Details**: 
<img src="images/training.png" style="width:800px;height:400px;">

**Current classification results**: 
<img src="images/accu.png" style="width:800px;height:400px;">

## Next steps
- Know more about tensorflow model reuse and retraining. 
- Use Dr. Ahmed's code to generate heat-map and test bounding box generation 
- Explore sequence data processing with deep learning methods, like RNN, Bayes networks
- Run model of spatial transformer, see the possibility to integrate it 

# - Reference