Skip to content

X-PLUG/Youku-mPLUG

Repository files navigation

Youku-mPLUG 10M Chinese Large-Scale Video Text Dataset

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Pre-training Dataset and Benchmarks Download Link HERE

Paper

examples for youku-mplug

What is Youku-mPLUG?

We release the public largest Chinese high-quality video-language dataset (10 million) named Youku-mPLUG, which is collected from a well-known Chinese video-sharing website, named Youku, with strict criteria of safety, diversity, and quality.

examples for youku-mplug

examples for youku-mplug

Examples of video clips and titles in the proposed Youku-mPLUG dataset.

We provide 3 different downstream multimodal video benchmark datasets to measure the capabilities of pre-trained models. The 3 different tasks include:

  • Video Category Prediction:Given a video and its corresponding title, predict the category of the video.
  • Video-Text Retrieval:In the presence of some videos and some texts, use video for text retrieval and text for video retrieval.
  • Video Captioning:In the presence of a video, describe the content of the video.

examples for youku-mplug downstream dataset

Data statistics

The dataset contains 10 million videos in total, which are of high quality and distributed in 20 super categories can 45 categories.

statistics

The distribution of categories in Youku-mPLUG dataset.

Zero-shot Capability

case1 case2

Download

You can download all the videos and annotation files through this link

Setup

Note: Due to a bug in megatron_util, after installing megatron_util, it is necessary to replace conda/envs/youku/lib/python3.10/site-packages/megatron_util/initialize.py with the initialize.py in the current directory.

conda env create -f environment.yml
conda activate youku
pip install megatron_util==1.3.0 -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

# For caption evaluation
apt-get install default-jre

mPLUG-Video (1.3B / 2.7B)

Pre-train

First you should download GPT-3 1.3B & 2.7B checkpoint from Modelscope. The pre-trained model can be downloaded Here (1.3B) and Here (2.7B).

Running the pre-training of mPLUG-Video as:

exp_name='pretrain/gpt3_1.3B/pretrain_gpt3_freezeGPT_youku_v0'
PYTHONPATH=$PYTHONPATH:./ \
python -m torch.distributed.launch --nproc_per_node=8 --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  --nnodes=$WORLD_SIZE \
  --node_rank=$RANK \
  --use_env run_pretrain_distributed_gpt3.py \
  --config ./configs/${exp_name}.yaml \
  --output_dir ./output/${exp_name} \
  --enable_deepspeed \
  --bf16
  2>&1 | tee ./output/${exp_name}/train.log

Benchmarking

To perform downstream fine-tuning. We take Video Category Prediction as an example:

exp_name='cls/cls_gpt3_1.3B_youku_v0_sharp_2'
PYTHONPATH=$PYTHONPATH:./ \
python -m torch.distributed.launch --nproc_per_node=8 --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT \
  --nnodes=$WORLD_SIZE \
  --node_rank=$RANK \
  --use_env downstream/run_cls_distributed_gpt3.py \
  --config ./configs/${exp_name}.yaml \
  --output_dir ./output/${exp_name} \
  --enable_deepspeed \
  --resume path/to/1_3B_mp_rank_00_model_states.pt \
  --bf16
  2>&1 | tee ./output/${exp_name}/train.log

Experimental results

Below we show the results on the validation sets for reference.

Video category prediction results on the validation set. Video retrieval results on the validation set.

mPLUG-Video (BloomZ-7B)

We build the mPLUG-Video model based on mPLUG-Owl. To use the model, you should first clone the mPLUG-Owl repo as

git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG-Owl/mPLUG-Owl

The instruction-tuned checkpoint is available on HuggingFace. For finetuning the model, you can refer to mPLUG-Owl Repo. To perform video inference you can use the following code:

import torch
from mplug_owl_video.modeling_mplug_owl import MplugOwlForConditionalGeneration
from transformers import AutoTokenizer
from mplug_owl_video.processing_mplug_owl import MplugOwlImageProcessor, MplugOwlProcessor

pretrained_ckpt = 'MAGAer13/mplug-youku-bloomz-7b'
model = MplugOwlForConditionalGeneration.from_pretrained(
    pretrained_ckpt,
    torch_dtype=torch.bfloat16,
    device_map={'': 0},
)
image_processor = MplugOwlImageProcessor.from_pretrained(pretrained_ckpt)
tokenizer = AutoTokenizer.from_pretrained(pretrained_ckpt)
processor = MplugOwlProcessor(image_processor, tokenizer)

# We use a human/AI template to organize the context as a multi-turn conversation.
# <|video|> denotes an video placehold.
prompts = [
'''The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <|video|>
Human: 视频中的女人在干什么?
AI: ''']

video_list = ['yoga.mp4']

# generate kwargs (the same in transformers) can be passed in the do_generate()
generate_kwargs = {
    'do_sample': True,
    'top_k': 5,
    'max_length': 512
}
inputs = processor(text=prompts, videos=video_list, num_frames=4, return_tensors='pt')
inputs = {k: v.bfloat16() if v.dtype == torch.float else v for k, v in inputs.items()}
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
    res = model.generate(**inputs, **generate_kwargs)
sentence = tokenizer.decode(res.tolist()[0], skip_special_tokens=True)
print(sentence)

Citing Youku-mPLUG

If you find this dataset useful for your research, please consider citing our paper.

@misc{xu2023youku_mplug,
    title={Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks},
    author={Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Chenliang Li, Qi Qian, Que Maofei, Ji Zhang, Xiao Zeng, Fei Huang},
    year={2023},
    eprint={2306.04362},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}