# Multi-class classification of Cofacts articles

This Colab notebook is created for classifying article tags defined by Cofacts.

For detailed information, you can find it out at: https://github.com/cofacts/rumors-ai/tree/master/ai_model/models/model_A

### Tags definition

![Defintion](https://github.com/cofacts/rumors-ai/blob/master/data_exploration/img/Tags_definition.png?raw=true)

## A. Get the raw data and preprocess it for the model input

* If you are interested in data exploration, you can find it in: https://github.com/cofacts/rumors-ai/tree/master/data_exploration

### Step1: Download the labeled raw data from Cofacts Github and unzip it

In [0]:
!wget https://github.com/cofacts/rumors-ai/raw/master/ai_model/data/raw_data/raw_data.zip

--2020-05-14 11:09:59--  https://github.com/cofacts/rumors-ai/raw/master/ai_model/data/raw_data/raw_data.zip
Resolving github.com (github.com)... 140.82.118.3
Connecting to github.com (github.com)|140.82.118.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cofacts/rumors-ai/master/ai_model/data/raw_data/raw_data.zip [following]
--2020-05-14 11:09:59--  https://raw.githubusercontent.com/cofacts/rumors-ai/master/ai_model/data/raw_data/raw_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14700548 (14M) [application/zip]
Saving to: ‘raw_data.zip’


2020-05-14 11:10:00 (53.3 MB/s) - ‘raw_data.zip’ saved [14700548/14700548]



In [0]:
!unzip -o raw_data.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: raw_data/28156.json     
  inflating: __MACOSX/raw_data/._28156.json  
  inflating: raw_data/25764.json     
  inflating: __MACOSX/raw_data/._25764.json  
  inflating: raw_data/27759.json     
  inflating: __MACOSX/raw_data/._27759.json  
  inflating: raw_data/24876.json     
  inflating: __MACOSX/raw_data/._24876.json  
  inflating: raw_data/30970.json     
  inflating: __MACOSX/raw_data/._30970.json  
  inflating: raw_data/20589.json     
  inflating: __MACOSX/raw_data/._20589.json  
  inflating: raw_data/25271.json     
  inflating: __MACOSX/raw_data/._25271.json  
  inflating: raw_data/20073.json     
  inflating: __MACOSX/raw_data/._20073.json  
  inflating: raw_data/24899.json     
  inflating: __MACOSX/raw_data/._24899.json  
  inflating: raw_data/17300.json     
  inflating: __MACOSX/raw_data/._17300.json  
  inflating: raw_data/28443.json     
  inflating: __MACOSX/raw_data/._28443.json  
  inflating

### Step2: Preprocess the raw data from JSON to CSV format
* This step will load in all the json files first
* Then, put their information (i.e. file name, article contents, tags) into corresponding csv columns

In [0]:
# coding=utf-8

# import packages
import os
import json
import argparse

import pandas as pd
import numpy as np


data_dir = 'raw_data'
output_dir = 'processed_data'


files = os.listdir(data_dir)
files = [file for file in files if 'json' in file]


# load tags and text information of the json files
define_columns = ['id', 'text', 'label']

data_list = []
for file in files:

    with open(os.path.join(data_dir, file), 'r') as f:
        data = json.load(f)

    label = [data['tags'][0]]

    data_list.append([file, data['text']] + label)

df_data = pd.DataFrame(data_list, columns=define_columns)


# create train/dev/test csv files for modulized BERT
num_files = df_data.shape[0]
train_ratio = 0.7
dev_ratio = 0.1
test_ratio = 0.2

if os.path.exists(output_dir) == False:
    os.mkdir(output_dir)

df_data[:int(train_ratio*num_files)].to_csv(os.path.join(output_dir, 'train.csv'))
df_data[int(train_ratio*num_files):int((train_ratio+dev_ratio)*num_files)].to_csv(os.path.join(output_dir, 'dev.csv'))
df_data[int((train_ratio+dev_ratio)*num_files):].to_csv(os.path.join(output_dir, 'test.csv'))



In [0]:
# take a look at head rows of the df_data
df_data.head()

Unnamed: 0,id,text,label
0,22666.json,癌細胞遍佈全身卻奇蹟痊癒　癌末男羞認吃了狗狗的「這個」\nhttp://sg.newsrep...,3
1,19441.json,昨晚九點左右 有人打爆我的車窗😭\n拿掉車裡所有值錢的東西\n幸虧老天有眼！他的電話竟然留在...,14
2,20626.json,袁世凱當皇帝時，將“元宵”改名為“湯圓”，為的是避諱“袁消”之意。\n毛澤東到過幾次河南，但...,11
3,21838.json,http://t.aloaqw.com/index/index/details/goods_...,14
4,20016.json,你以為中國大陸可怕的是遼寧艦！？\n不、中國大陸可怕的是全球視野！\n中國大陸悄悄建了兩個人...,0


## B. Using HuggingFace framework to build BERT model for our task

### Step1: Install Transformers from HuggingFace's Github repo

In [0]:
!git clone https://github.com/huggingface/transformers
!pip install ./transformers/

Cloning into 'transformers'...
remote: Enumerating objects: 24, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 26433 (delta 12), reused 18 (delta 8), pack-reused 26409[K
Receiving objects: 100% (26433/26433), 15.89 MiB | 13.56 MiB/s, done.
Resolving deltas: 100% (18408/18408), done.
Processing ./transformers
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 5.9MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/3b/88/49e772d686088e1278766ad68a463513642a2a877487decbd691dec02955/sentencepiece-0.1.90-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 38.5MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pytho

### Step2 : Train your own model from Google's BERT pretrained
* This step will take about 15~20 mins on single T4 GPU 
* Please **do not** use CPU only machine, it may take **forever** to train a model 

In [0]:
# Download the python files customized by Cofacts to make life much more easier 
!wget https://raw.githubusercontent.com/cofacts/rumors-ai/master/ai_model/models/model_A/utils_multi_label_classification.py
!wget https://raw.githubusercontent.com/cofacts/rumors-ai/master/ai_model/models/model_A/run_multi_label_classification.py

--2020-05-14 11:35:05--  https://raw.githubusercontent.com/cofacts/rumors-ai/master/ai_model/models/model_A/utils_multi_label_classification.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9394 (9.2K) [text/plain]
Saving to: ‘utils_multi_label_classification.py’


2020-05-14 11:35:05 (86.8 MB/s) - ‘utils_multi_label_classification.py’ saved [9394/9394]

--2020-05-14 11:35:07--  https://raw.githubusercontent.com/cofacts/rumors-ai/master/ai_model/models/model_A/run_multi_label_classification.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Lengt

In [0]:
# train the BERT model, it takes about 15~20 mins for training 3 epochs
!python ./run_multi_label_classification.py \
--task_name cofacts \
--model_name_or_path bert-base-chinese \
--do_train \
--do_eval \
--data_dir ./processed_data/ \
--learning_rate 1e-4 \
--num_train_epochs 3 \
--max_seq_length 128 \
--output_dir models_bert/ \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output

2020-05-14 11:36:48.587682: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
05/14/2020 11:36:50 - INFO - transformers.training_args -   PyTorch: setting up devices
05/14/2020 11:36:50 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='models_bert/', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluate_during_training=False, per_gpu_train_batch_size=16, per_gpu_eval_batch_size=16, gradient_accumulation_steps=2, learning_rate=0.0001, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir=None, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False)
05/14/2020 11:36:51 - INFO - filelock -   Lock 139815188044544 acquired on /root/.cache/torch/transforme

### Step3: Predict labels of train/dev/test.csv files to see our model performance
* For the training:
  * Accuracy = 0.915
  * Inference time = 13.3 ms/article
* For the dev:
  * Accuracy = 0.755
  * Inference time = 12.7 ms/article
* For the testing:
  * Accuracy = 0.802
  * Inference time = 13.1 ms/article

In [0]:
# training csv prediction
!python ./run_multi_label_classification.py \
--task_name cofacts \
--model_name_or_path models_bert \
--do_eval \
--data_dir ./processed_data/ \
--predict_file train.csv \
--per_gpu_eval_batch_size=16 \
--output_dir ./prediction/

2020-05-14 11:59:57.250181: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
05/14/2020 11:59:59 - INFO - transformers.training_args -   PyTorch: setting up devices
05/14/2020 11:59:59 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./prediction/', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, per_gpu_train_batch_size=8, per_gpu_eval_batch_size=16, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir=None, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False)
05/14/2020 11:59:59 - INFO - transformers.configuration_utils -   loading configuration file models_ber

In [0]:
# dev csv prediction
!python ./run_multi_label_classification.py \
--task_name cofacts \
--model_name_or_path models_bert \
--do_eval \
--data_dir ./processed_data/ \
--predict_file dev.csv \
--per_gpu_eval_batch_size=16 \
--output_dir ./prediction/

2020-05-14 12:02:55.240103: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
05/14/2020 12:02:57 - INFO - transformers.training_args -   PyTorch: setting up devices
05/14/2020 12:02:57 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./prediction/', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, per_gpu_train_batch_size=8, per_gpu_eval_batch_size=16, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir=None, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False)
05/14/2020 12:02:57 - INFO - transformers.configuration_utils -   loading configuration file models_ber

In [0]:
# testing csv prediction
!python ./run_multi_label_classification.py \
--task_name cofacts \
--model_name_or_path models_bert \
--do_eval \
--data_dir ./processed_data/ \
--predict_file test.csv \
--per_gpu_eval_batch_size=16 \
--output_dir ./prediction/

2020-05-14 12:03:27.512145: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
05/14/2020 12:03:29 - INFO - transformers.training_args -   PyTorch: setting up devices
05/14/2020 12:03:29 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='./prediction/', overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluate_during_training=False, per_gpu_train_batch_size=8, per_gpu_eval_batch_size=16, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir=None, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False)
05/14/2020 12:03:29 - INFO - transformers.configuration_utils -   loading configuration file models_ber

### Step4: We are done! You have trained a multi-class classifier for rumor articles. You can check the prediction result files for details. 