This repository contains the implementation of methods in "Cross-Lingual Cross-Platform Rumor Verification Pivoting on Multimedia Content".
- Python 2.7
- Pytorch
- scikit-learn
- Theano
- Keras (with Theano backend)
- Pandas
- ...
Three sub-datasets of our CCMR dataset are saved in the folder CCMR as three json files (lists of json objects), "CCMR/CCMR_Twitter.txt", "CCMR_Google.txt" and "CCMR_Baidu.txt".
For CCMR Twitter, each tweet is saved as a json object with keys "tweet_id", "content", "image_id", "event", and "timestamp". For CCMR Google and Baidu, each webpage is saved as a json object with keys "url", "title", "image_id", and "event". The values of "image_id" are lists of image or video names from VMU 2015 dataset. All of those image files and video URLs are available in "images.zip".
-
To reproduce experiments results, simply run main.py.
-
Download parallel English and Mandarin sentence of news and microblogs from UM-Corpus and save them in a folder named 'UM_Corpus'.
-
Run prepare_UM_Corpus.py to split and tokenize the data in UM-Corpus.
-
Run train_multilingual_embedding.py to train the multilingual sentence embedding.
-
Run prepare_FNC_split.py to tokenize, embed and split the data from Fake News Challenge.
-
Run train_agreement_classifier.py to train the agreement classifier.
-
Run prepare_CCMR.py to tokenize the CCMR dataset.
-
Run extract_clcp_feats.py to extract all cross-lingual cross-platform features and splits of the data we need for experiments. CLCP saves the available output file.
-
Play with main.py and other scripts to test everything from the Paper.