FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation [ICME'23]

Abstract

Current video text spotting methods can achieve preferable performance, powered with sufficient labeled training data. However, labeling data manually is time-consuming and labor-intensive. To overcome this, using low-cost synthetic data is a promising alternative. This paper introduces a novel video text synthesis technique called FlowText, which utilizes optical flow estimation to synthesize a large amount of text video data at a low cost for training robust video text spotters. Unlike existing methods that focus on image-level synthesis, FlowText concentrates on synthesizing temporal information of text instances across consecutive frames using optical flow. This temporal information is crucial for accurately tracking and spotting text in video sequences, including text movement, distortion, appearance, disappearance, shelter, and blur. Experiments show that combining general detectors like TransDETR with the proposed FlowText produces remarkable results on various datasets, such as ICDAR2015video and ICDAR2013video.

Get Started

Environment Setup

FlowText is based on the segmentation model Mask2fomer, depth estimation model Monodepth2, optical flow estimation model GMA, synthesis engine SynthText. To setup the environment of FlowText, we use conda to manage our dependencies. Our developers use CUDA 11.1 to do experiments. You can specify the appropriate cudatoolkit version to install on your machine in the requirements.txt file, and then run the following commands to install FlowText:

conda create -n flowtext python=3.8
conda activate flowtext

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

git clone https://github.com/callsys/FlowText
cd FlowText
pip install -r requirements.txt

cd segmentation/mask2former/modeling/pixel_decoder/ops/
sh make.sh

Download Models

To run FlowText, you need to download some files (Google Drive), which mainly contain the font file for the synthesized text, the text source, and the weight of the models. Once you have downloaded the files, link them to the FlowText directory:

ln -s path/to/FlowText_data FlowText/data

Generate Synthetic Videos with Full Annotations

Generate Synthetic video with demo video assets/demo.mp4 and output to result to assets:

python gen.py

Generate Synthetic video with given video video.mp4, frame range start,end,interval, save path save and random seed seed:

python gen.py --video video.mp4 --range start,end,interval --save save --seed seed

For example:

python gen.py --video assets/demo.mp4 --range 0,400,5 --save assets/result --seed 16

Output Format

The format of the file output by gen.py is as follows:

result
|
└─── 00000000.jpg
└─── 00000001.jpg
└─── 00000002.jpg
└─── ......
└─── 00000079.jpg
└─── ann.json
└─── viz.mp4
└─── viz_ann.mp4

where xxx.jpg denotes the synthetic video video frames, ann.json is the annotation file, viz.mp4 is the synthetic video and viz_ann.mp4 is the synthetic video with visualized annotations.

Citation

If you use FlowText in your research or wish to refer to the results, please use the following BibTeX entry.

@inproceedings{zhao2023flowtext,
  title={FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation},
  author={Yuzhong Zhao and Weijia Wu and Zhuang Li and Jiahong Li and Weiqiang Wang},
  journal={ICME},
  year={2023}
}

@article{zhao2023flowtext,
  title={FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation},
  author={Yuzhong Zhao and Weijia Wu and Zhuang Li and Jiahong Li and Weiqiang Wang},
  journal={arXiv preprint arXiv:2305.03327},
  year={2023}
}

Organization

Affiliations: University of Chinese Academy of Sciences, Zhejiang University, MMU of Kuaishou Technology

Authors: Yuzhong Zhao (zhaoyuzhong20@mails.ucas.ac.cn), Weijia Wu (weijiawu@zju.edu.cn), Zhuang Li (lizhuang@kuaishou.com) Jiahong Li (lijiahong@kuaishou.com) Weiqiang Wang (wqwang@ucas.ac.cn)

Acknowledgement

Code is largely based on SynthText, and models are borrowed from Mask2fomer,Monodepth2 and GMA.

Work is fully supported by MMU of Kuaishou Technology.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
assets		assets
depth		depth
flow		flow
prep_scripts		prep_scripts
segmentation		segmentation
.gitattributes		.gitattributes
README.md		README.md
colorize_poisson.py		colorize_poisson.py
common.py		common.py
environment.yml		environment.yml
functions.py		functions.py
gen.py		gen.py
invert_font_size.py		invert_font_size.py
params.py		params.py
poisson_reconstruct.py		poisson_reconstruct.py
ransac.py		ransac.py
requirements.txt		requirements.txt
synth_utils.py		synth_utils.py
synthgen.py		synthgen.py
text_utils.py		text_utils.py

callsys/FlowText

Folders and files

Latest commit

History

Repository files navigation

FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation [ICME'23]

Abstract

Get Started

Environment Setup

Download Models

Generate Synthetic Videos with Full Annotations

Output Format

Citation

Organization

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Languages