# Tweet Turing Test: Detecting Disinformation on Twitter  

|          | Group #2 - Disinformation Detectors                     |
|---------:|---------------------------------------------------------|
| Members  | John Johnson, Katy Matulay, Justin Minnion, Jared Rubin |
| Notebook | `xx_fine_tuner.ipynb`                                   |
| Purpose  | A notebook to fine-tune BERT models.                    |

(todo: description)

Based on tutorial from: https://huggingface.co/docs/transformers/training

# 1 - Setup

In [33]:
# imports from Python standard library

# imports requiring installation
#   connection to Google Cloud Storage
from google.cloud import storage            # pip install google-cloud-storage
from google.oauth2 import service_account   # pip install google-auth

#  data science packages
import numpy as np
import pandas as pd
import tensorflow as tf

import evaluate
from datasets import Dataset, ClassLabel
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

In [3]:
# imports from tweet_turing.py
import tweet_turing as tur      # note - different import approach from prior notebooks

# imports from tweet_turing_paths.py
from tweet_turing_paths import local_data_paths, local_snapshot_paths, gcp_data_paths, \
    gcp_snapshot_paths, gcp_project_name, gcp_bucket_name, gcp_key_file

In [4]:
# pandas options
pd.set_option('display.max_colwidth', None)

## Local or Cloud?

Decide here whether to run notebook with local data or GCP bucket data
 - if the working directory of this notebook has a "../data/" folder with data loaded (e.g. working on local computer or have data files loaded to a cloud VM) then use the "local files" option and comment out the "gcp bucket files" option
 - if this notebook is being run from a GCP VM (preferrably in the `us-central1` location) then use the "gcp bucket files" option and comment out the "local files" option

In [5]:
# option: local files
#local_or_cloud: str = "local"   # comment/uncomment this line or next

# option: gcp bucket files
local_or_cloud: str = "cloud"   # comment/uncomment this line or previous

# don't comment/uncomment for remainder of cell
if (local_or_cloud == "local"):
    data_paths = local_data_paths
    snapshot_paths = local_snapshot_paths
elif (local_or_cloud == "cloud"):
    data_paths = gcp_data_paths
    snapshot_paths = gcp_snapshot_paths
else:
    raise ValueError("Variable 'local_or_cloud' can only take on one of two values, 'local' or 'cloud'.")
    # subsequent cells will not do this final "else" check

In [6]:
# this cell only needs to run its code if local_or_cloud=="cloud"
#   (though it is harmless if run when local_or_cloud=="local")
gcp_storage_client: storage.Client = None
gcp_bucket: storage.Bucket = None

if (local_or_cloud == "cloud"):
    #gcp_storage_client = tur.get_gcp_storage_client(project_name=gcp_project_name, key_file=gcp_key_file)
    gcp_storage_client = tur.get_gcp_storage_client(project_name=gcp_project_name)
    gcp_bucket = tur.get_gcp_bucket(storage_client=gcp_storage_client, bucket_name=gcp_bucket_name)

# 2 - Load Dataset

Starting with the ten-percent sample with NLP-preprocessing completed from notebook **`04_nlp_preprocess.ipynb`**.

In [7]:
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "data_sample_ten_percent_NLP_preprocessed.parquet.gz"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    df = pd.read_parquet(parq_path, engine='pyarrow')
elif (local_or_cloud == "cloud"):
    df = tur.get_gcp_object_from_parq_as_df(bucket=gcp_bucket, object_name=parq_path)

In [8]:
df.head(3)

Unnamed: 0,external_author_id,author,content,region,language,following,followers,updates,post_type,is_retweet,...,has_url,emoji_text,emoji_count,publish_date,class,following_ratio,class_numeric,RUS_lett_count,content_demoji,content_no_emoji
0,23785050,radiowoody,"To live dangerously on Friday the 13th, we're doing the radio show from the UNLUCKIEST place on earth! The @TennesseeTitans Locker Room!",Nashville Tennessee,en,2585,5710,2,,0.0,...,0,[],0,2013-12-13 10:03:43+00:00,Verified,0.452635,0,0,"To live dangerously on Friday the 13th, we're doing the radio show from the UNLUCKIEST place on earth! The @TennesseeTitans Locker Room!","To live dangerously on Friday the 13th, we're doing the radio show from the UNLUCKIEST place on earth! The @TennesseeTitans Locker Room!"
1,59020162,matthewpouliot,@legsanity I like it. Almost like a free Gio. Pujols is still about as good of a bet as Gonzalez the rest of the way.,Florida,en,999,12637,0,replied_to,0.0,...,0,[],0,2015-04-26 20:13:58+00:00,Verified,0.079047,0,0,@legsanity I like it. Almost like a free Gio. Pujols is still about as good of a bet as Gonzalez the rest of the way.,@legsanity I like it. Almost like a free Gio. Pujols is still about as good of a bet as Gonzalez the rest of the way.
2,1656024374,IMISSOBAMA,Man servants can have a good purpose as long as they come with cash and don't touch me ever.,United States,en,473,760,4122,RETWEET,1.0,...,0,[],0,2016-12-24 13:12:00+00:00,Troll,0.621551,1,0,Man servants can have a good purpose as long as they come with cash and don't touch me ever.,Man servants can have a good purpose as long as they come with cash and don't touch me ever.


In [9]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362314 entries, 0 to 362313
Data columns (total 24 columns):
 #   Column              Non-Null Count   Dtype              
---  ------              --------------   -----              
 0   external_author_id  362314 non-null  string             
 1   author              362314 non-null  string             
 2   content             362314 non-null  string             
 3   region              344249 non-null  string             
 4   language            362314 non-null  category           
 5   following           362314 non-null  uint64             
 6   followers           362314 non-null  uint64             
 7   updates             362314 non-null  uint64             
 8   post_type           154729 non-null  category           
 9   is_retweet          362314 non-null  float64            
 10  account_category    362314 non-null  category           
 11  tweet_id            362314 non-null  string             
 12  tco1_step1      

In [10]:
df['class'].unique()

['Verified', 'Troll']
Categories (2, object): ['Troll', 'Verified']

In [11]:
df.loc[df['emoji_count'] > 0, ['content', 'content_demoji', 'content_no_emoji']].sample(5)

Unnamed: 0,content,content_demoji,content_no_emoji
218744,The future is female &amp; we make history everyday. Happy #WomensHistoryMonth �✊� https://t.co/MbKatuKSpW,The future is female &amp; we make history everyday. Happy #WomensHistoryMonth �:raised fist:� https://t.co/MbKatuKSpW,The future is female &amp; we make history everyday. Happy #WomensHistoryMonth �� https://t.co/MbKatuKSpW
273179,@melindagates Thx for highlighting - we 💛this! Maybe the next Millie is here! https://t.co/evyuBRJgUW 💡,@melindagates Thx for highlighting - we :yellow heart:this! Maybe the next Millie is here! https://t.co/evyuBRJgUW :light bulb:,@melindagates Thx for highlighting - we this! Maybe the next Millie is here! https://t.co/evyuBRJgUW
232730,"Shhh, you! I thought we were friends 😜 https://t.co/eOu0IOVmD5","Shhh, you! I thought we were friends :winking face with tongue: https://t.co/eOu0IOVmD5","Shhh, you! I thought we were friends https://t.co/eOu0IOVmD5"
304524,It smells like garlic bread at the @ymca right now word to @DMX 👃🍞,It smells like garlic bread at the @ymca right now word to @DMX :nose::bread:,It smells like garlic bread at the @ymca right now word to @DMX
35606,"RT ericbolling: 💥BOOM💥 .nytimes influential Best Seller list: ""The Swamp"" earned its 5th straight week! Thanks to… https://t.co/eBdETSSjHM","RT ericbolling: :collision:BOOM:collision: .nytimes influential Best Seller list: ""The Swamp"" earned its 5th straight week! Thanks to… https://t.co/eBdETSSjHM","RT ericbolling: BOOM .nytimes influential Best Seller list: ""The Swamp"" earned its 5th straight week! Thanks to… https://t.co/eBdETSSjHM"


# 3 - Choose Dataset Fields and Model

## 3.1 - Set Args

To make the subsequent encoding/training code more modular, set as many args as we can within this cell.

In [56]:
# where to store the output of the model (as a subfolder of ../data/models/)
output_dir_name = 'dist-test1'

# which columns from dataframe will be used
content_column = 'content_no_emoji'
class_column = 'class_numeric'        # Assumes: 0=authentic, 1=troll

# select pre-trained model
#pretrained_model_name = 'Twitter/twhin-bert-base'    # https://huggingface.co/Twitter/twhin-bert-base
#pretrained_model_name = 'bert-base-cased'            # https://huggingface.co/bert-base-uncased
pretrained_model_name = 'distilbert-base-uncased'    # https://huggingface.co/distilbert-base-uncased

# these are passed on to model object as keyword args
extra_model_args = {
    'num_labels': 2
}

# these are passed on to trainer object as keyword args
extra_train_args = {
    'output_dir': f'../data/models/{output_dir_name}',
    'num_train_epochs': 2,
}

# maximum tweets (per class) used for fine tuning (set to None for no limit)
#  e.g. if this value is 5000, a maximum of 5000 troll and 5000 authentic tweets will be used
#       for a total of 10,000 tweets used for fine tuning
max_tweets_per_class = 5000
sampling_random_seed = 42

# for train/test split
train_test_random_seed = 3    # for reproducability, and "the number of the counting shall be three"
test_fraction = 0.20          # within range (0.0, 1.0)

In [57]:
# for model summary we can track how long it took to encode and train
time_encoding = None
time_training = None

## 3.2 - Convert Pandas Dataframe to 🤗 Dataset

In [58]:
# create a view (not a copy) of dataframe
if (max_tweets_per_class is None):
    df_view = df[[content_column, class_column]]
else:
    df_view = pd.concat(
        [
            df.loc[df[class_column] == 1, [content_column, class_column]].sample(n=max_tweets_per_class, random_state=sampling_random_seed),
            df.loc[df[class_column] == 0, [content_column, class_column]].sample(n=max_tweets_per_class, random_state=sampling_random_seed)
        ], 
        ignore_index=True
    )

# convert to 🤗 Dataset object
dataset = Dataset.from_pandas(df_view) \
            .rename_columns({content_column: "text", class_column: "label"}) \
            .cast_column("label", ClassLabel(names=['authentic', 'troll']))

# check results
assert (dataset.features['label'].str2int('authentic') == 0) and (dataset.features['label'].str2int('troll') == 1), 'class labels mismatched'
dataset[0]

Casting the dataset:   0%|          | 0/10 [00:00<?, ?ba/s]

{'text': 'On #MuslimWomensDay, these women are empowering themselves and fighting back against Islamophobia. https://t.co/Y5NXaTHjZi',
 'label': 1}

## 3.3 - Train/Test Split

In [59]:
dataset_split = dataset.train_test_split(
    test_size=test_fraction,
    seed=train_test_random_seed,
)

# check output
dataset_split

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

## 3.4 - Tokenize / Encode

In [61]:
# # create the tokenizer to prepare text for model
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [62]:
# create a tokenizer function
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, return_tensors='pt')

In [63]:
time_encoding_start = pd.Timestamp.now()

# encode the training and test sets
#tokenized_datasets = dataset_split.map(tokenize_function, batched=True, fn_kwargs={'tokenizer': tokenizer})
tokenized_datasets = dataset_split.map(tokenize_function, batched=True)

time_encoding_stop = pd.Timestamp.now()
time_encoding = time_encoding_stop - time_encoding_start

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [64]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
})

## 3.5 - Model

In [65]:
# create the model
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name,
    **extra_model_args,
)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifi

In [66]:
# setup the training arguments
training_args = TrainingArguments(
    **extra_train_args
)

In [67]:
# setup training metric
metric = evaluate.load('accuracy')   # TODO -> study this more

def compute_metrics(eval_pred):    # TODO -> convert to pure function
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)    # TODO -> study this more
    return metric.compute(predictions=predictions, references=labels)

In [68]:
time_training_start = pd.Timestamp.now()

# setup the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    compute_metrics=compute_metrics,
)

# execute the training
trainer.train()

time_training_stop = pd.Timestamp.now()
time_training = time_training_stop - time_training_start

print(time_training)

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 8000
  Num Epochs = 2
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2000
  Number of trainable parameters = 66955010


Step,Training Loss
500,0.4355
1000,0.3653
1500,0.2584
2000,0.2295


Saving model checkpoint to ../data/models/dist-test1/checkpoint-500
Configuration saved in ../data/models/dist-test1/checkpoint-500/config.json
Model weights saved in ../data/models/dist-test1/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ../data/models/dist-test1/checkpoint-1000
Configuration saved in ../data/models/dist-test1/checkpoint-1000/config.json
Model weights saved in ../data/models/dist-test1/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ../data/models/dist-test1/checkpoint-1500
Configuration saved in ../data/models/dist-test1/checkpoint-1500/config.json
Model weights saved in ../data/models/dist-test1/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to ../data/models/dist-test1/checkpoint-2000
Configuration saved in ../data/models/dist-test1/checkpoint-2000/config.json
Model weights saved in ../data/models/dist-test1/checkpoint-2000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




0 days 00:12:25.324531


## 3.6 - Save fine-tuned model

In [73]:
trainer.save_model()    # defaults to self.args.output_dir

Saving model checkpoint to ../data/models/dist-test1
Configuration saved in ../data/models/dist-test1/config.json
Model weights saved in ../data/models/dist-test1/pytorch_model.bin


## 3.7 - Evaluate fine-tuned model

In [75]:
# if evaluating immediately after fine-tuning
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 8


{'eval_loss': 0.4639328718185425,
 'eval_accuracy': 0.8565,
 'eval_runtime': 30.43,
 'eval_samples_per_second': 65.725,
 'eval_steps_per_second': 8.216,
 'epoch': 2.0}

In [77]:
predictions = trainer.predict(tokenized_datasets['test'])

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2000
  Batch size = 8


In [81]:
predictions.metrics

{'test_loss': 0.4639328718185425,
 'test_accuracy': 0.8565,
 'test_runtime': 30.4455,
 'test_samples_per_second': 65.691,
 'test_steps_per_second': 8.211}

In [74]:
# reload model, if evaluation is being performed separately from training
model_dir = "../data/models/dist-test1"

model = AutoModelForSequenceClassification.from_pretrained(model_dir)

loading configuration file ../data/models/dist-test1/config.json
Model config DistilBertConfig {
  "_name_or_path": "../data/models/dist-test1",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading weights file ../data/models/dist-test1/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at ../data/models/dist-test1.

In [None]:
# TODO - setup eval for re-loaded model

TODO: setup way to archive the saved model files.

For now:
`tar -czvf dist-test1.tar.gz --exclude='*checkpoint*' dist-test1`