# Tweet Turing Test: Detecting Disinformation on Twitter  

|          | Group #2 - Disinformation Detectors                     |
|---------:|---------------------------------------------------------|
| Members  | John Johnson, Katy Matulay, Justin Minnion, Jared Rubin |
| Notebook | `xx_fine_tuner.ipynb`                                   |
| Purpose  | A notebook to fine-tune BERT models.                    |

(todo: description)

Based on tutorial from: https://huggingface.co/docs/transformers/training

# 1 - Setup

In [2]:
# imports from Python standard library
import os

# imports requiring installation
#   connection to Google Cloud Storage
from google.cloud import storage            # pip install google-cloud-storage
from google.oauth2 import service_account   # pip install google-auth

#  data science packages
import numpy as np
import pandas as pd

# 🤗 (huggingface) packages
import evaluate
from datasets import Dataset, ClassLabel
from transformers import AutoTokenizer, BertTokenizerFast, BertweetTokenizer, DistilBertTokenizerFast, RobertaTokenizerFast, XLMRobertaTokenizerFast
from transformers import AutoModelForSequenceClassification, BertForSequenceClassification
from transformers import TrainingArguments, Trainer

In [3]:
# imports from tweet_turing.py
import tweet_turing as tur      # note - different import approach from prior notebooks

# imports from tweet_turing_paths.py
from tweet_turing_paths import local_data_paths, local_snapshot_paths, gcp_data_paths, \
    gcp_snapshot_paths, gcp_project_name, gcp_bucket_name, gcp_key_file

In [4]:
# pandas options
pd.set_option('display.max_colwidth', None)

## Local or Cloud?

Decide here whether to run notebook with local data or GCP bucket data
 - if the working directory of this notebook has a "../data/" folder with data loaded (e.g. working on local computer or have data files loaded to a cloud VM) then use the "local files" option and comment out the "gcp bucket files" option
 - if this notebook is being run from a GCP VM (preferrably in the `us-central1` location) then use the "gcp bucket files" option and comment out the "local files" option

In [5]:
# option: local files
local_or_cloud: str = "local"   # comment/uncomment this line or next

# option: gcp bucket files
#local_or_cloud: str = "cloud"   # comment/uncomment this line or previous

# don't comment/uncomment for remainder of cell
if (local_or_cloud == "local"):
    data_paths = local_data_paths
    snapshot_paths = local_snapshot_paths
elif (local_or_cloud == "cloud"):
    data_paths = gcp_data_paths
    snapshot_paths = gcp_snapshot_paths
else:
    raise ValueError("Variable 'local_or_cloud' can only take on one of two values, 'local' or 'cloud'.")
    # subsequent cells will not do this final "else" check

In [6]:
# this cell only needs to run its code if local_or_cloud=="cloud"
#   (though it is harmless if run when local_or_cloud=="local")
gcp_storage_client: storage.Client = None
gcp_bucket: storage.Bucket = None

if (local_or_cloud == "cloud"):
    gcp_storage_client = tur.get_gcp_storage_client(project_name=gcp_project_name, key_file=gcp_key_file)
    gcp_bucket = tur.get_gcp_bucket(storage_client=gcp_storage_client, bucket_name=gcp_bucket_name)

# 2 - Load Dataset

Starting with the ten-percent sample with NLP-preprocessing completed from notebook **`04_nlp_preprocess.ipynb`**.

In [7]:
# note this cell requires package `pyarrow` to be installed in environment
parq_filename: str = "data_sample_ten_percent_NLP_preprocessed.parquet.gz"
parq_path: str = f"{snapshot_paths['parq_snapshot']}{parq_filename}"

if (local_or_cloud == "local"):
    df = pd.read_parquet(parq_path, engine='pyarrow')
elif (local_or_cloud == "cloud"):
    df = tur.get_gcp_object_from_parq_as_df(bucket=gcp_bucket, object_name=parq_path)

In [8]:
df.head(3)

Unnamed: 0,external_author_id,author,content,region,language,following,followers,updates,post_type,is_retweet,...,has_url,emoji_text,emoji_count,publish_date,class,following_ratio,class_numeric,RUS_lett_count,content_demoji,content_no_emoji
0,23785050,radiowoody,"To live dangerously on Friday the 13th, we're doing the radio show from the UNLUCKIEST place on earth! The @TennesseeTitans Locker Room!",Nashville Tennessee,en,2585,5710,2,,0.0,...,0,[],0,2013-12-13 10:03:43+00:00,Verified,0.452635,0,0,"To live dangerously on Friday the 13th, we're doing the radio show from the UNLUCKIEST place on earth! The @TennesseeTitans Locker Room!","To live dangerously on Friday the 13th, we're doing the radio show from the UNLUCKIEST place on earth! The @TennesseeTitans Locker Room!"
1,59020162,matthewpouliot,@legsanity I like it. Almost like a free Gio. Pujols is still about as good of a bet as Gonzalez the rest of the way.,Florida,en,999,12637,0,replied_to,0.0,...,0,[],0,2015-04-26 20:13:58+00:00,Verified,0.079047,0,0,@legsanity I like it. Almost like a free Gio. Pujols is still about as good of a bet as Gonzalez the rest of the way.,@legsanity I like it. Almost like a free Gio. Pujols is still about as good of a bet as Gonzalez the rest of the way.
2,1656024374,IMISSOBAMA,Man servants can have a good purpose as long as they come with cash and don't touch me ever.,United States,en,473,760,4122,RETWEET,1.0,...,0,[],0,2016-12-24 13:12:00+00:00,Troll,0.621551,1,0,Man servants can have a good purpose as long as they come with cash and don't touch me ever.,Man servants can have a good purpose as long as they come with cash and don't touch me ever.


In [9]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 362314 entries, 0 to 362313
Data columns (total 24 columns):
 #   Column              Non-Null Count   Dtype              
---  ------              --------------   -----              
 0   external_author_id  362314 non-null  string             
 1   author              362314 non-null  string             
 2   content             362314 non-null  string             
 3   region              344249 non-null  string             
 4   language            362314 non-null  category           
 5   following           362314 non-null  uint64             
 6   followers           362314 non-null  uint64             
 7   updates             362314 non-null  uint64             
 8   post_type           154729 non-null  category           
 9   is_retweet          362314 non-null  float64            
 10  account_category    362314 non-null  category           
 11  tweet_id            362314 non-null  string             
 12  tco1_step1      

In [10]:
df['class'].unique()

['Verified', 'Troll']
Categories (2, object): ['Troll', 'Verified']

In [11]:
df.loc[df['emoji_count'] > 3, ['content', 'content_demoji', 'content_no_emoji', 'emoji_count']].sample(5, random_state=3)

Unnamed: 0,content,content_demoji,content_no_emoji,emoji_count
232301,RT DebAlwaystrump: Congress looking CORRUPT♨ 2 America😠thinks we are STUPID⤵ MUELLER is a DIRTY COP🚨 Russia was 4 … https://t.co/sVXXXBIjSh,RT DebAlwaystrump: Congress looking CORRUPT:hot springs: 2 America:angry face:thinks we are STUPID:right arrow curving down: MUELLER is a DIRTY COP:police car light: Russia was 4 … https://t.co/sVXXXBIjSh,RT DebAlwaystrump: Congress looking CORRUPT 2 Americathinks we are STUPID MUELLER is a DIRTY COP Russia was 4 … https://t.co/sVXXXBIjSh,4
203651,RT Laura_K69: GOOD LUCK!! 🤣🤣🤣🤣 https://t.co/pc4FYSEscE,RT Laura_K69: GOOD LUCK!! :rolling on the floor laughing::rolling on the floor laughing::rolling on the floor laughing::rolling on the floor laughing: https://t.co/pc4FYSEscE,RT Laura_K69: GOOD LUCK!! https://t.co/pc4FYSEscE,4
32502,"TRIGGER WARNING: Beware before reading my new @Townhallcom column, ""Dating Tips For Prominent Democrats."" It yields maximum kurtness. 💥🔥💥🔥💥🔥💥🤷🏼‍♂️🔥💥🔥💥🔥💥 https://t.co/9EGBRNlw1j","TRIGGER WARNING: Beware before reading my new @Townhallcom column, ""Dating Tips For Prominent Democrats."" It yields maximum kurtness. :collision::fire::collision::fire::collision::fire::collision::man shrugging: medium-light skin tone::fire::collision::fire::collision::fire::collision: https://t.co/9EGBRNlw1j","TRIGGER WARNING: Beware before reading my new @Townhallcom column, ""Dating Tips For Prominent Democrats."" It yields maximum kurtness. https://t.co/9EGBRNlw1j",14
186365,RT @BNPPARIBASOPEN: How does @EVesnina001 lose her #BNPPO17 🏆 midway through her champion's press conference? 🎥😂➡️ https://t.co/RbDT7nlRt7,RT @BNPPARIBASOPEN: How does @EVesnina001 lose her #BNPPO17 :trophy: midway through her champion's press conference?\n\n:movie camera::face with tears of joy::right arrow: https://t.co/RbDT7nlRt7,RT @BNPPARIBASOPEN: How does @EVesnina001 lose her #BNPPO17 midway through her champion's press conference?\n\n https://t.co/RbDT7nlRt7,4
303189,RT @funsized411: 💯💯💯💯💯 RT @TheMisterMarcus: I'm okay with Duke losing even if it ruins my bracket.,RT @funsized411: :hundred points::hundred points::hundred points::hundred points::hundred points: RT @TheMisterMarcus: I'm okay with Duke losing even if it ruins my bracket.,RT @funsized411: RT @TheMisterMarcus: I'm okay with Duke losing even if it ruins my bracket.,5


# 3 - Choose Dataset Fields and Model

## 3.1 - Set Args

To make the subsequent encoding/training code more modular, set as many args as we can within this cell.

In [23]:
# descriptive name for this fine-tuning run
run_name = "distilbert-base-unc-2k - 20feb2023"

# where to store the output of the model (as a subfolder of ../data/models/)
output_dir_name = 'distilbert-base-unc-2k'

# which columns from dataframe will be used
content_column = 'content_no_emoji'
class_column = 'class_numeric'          # Assumes: 0=authentic, 1=troll
pk_column = 'tweet_id'                  # Used to identify which tweets were used for fine tuning and exclude them from later testing

# create choices for pre-trained model
pretrained_models = {
    'bert-base-uncased': {
        'name': 'bert-base-uncased',          # https://huggingface.co/bert-base-uncased
        'tokenizer': BertTokenizerFast,
        'model': BertForSequenceClassification,
    },
    'distilbert-base-uncased': {                    # https://huggingface.co/distilbert-base-uncased
        'name': 'distilbert-base-uncased',
        'tokenizer': DistilBertTokenizerFast,
        'model': BertForSequenceClassification,
    },
    'roberta-base': {                               # https://huggingface.co/roberta-base
        'name': 'roberta-base',
        'tokenizer': RobertaTokenizerFast,          # note: roberta-base is case sensitive
        'model': BertForSequenceClassification,
    },
    'vinai/bertweet-base': {                        # https://huggingface.co/vinai/bertweet-base
        'name': 'vinai/bertweet-base',
        'tokenizer': BertweetTokenizer,             # note: bertweet-base is case sensitive
        'model': BertForSequenceClassification,
    },
    'Twitter/twhin-bert-base': {                    # https://huggingface.co/Twitter/twhin-bert-base
        'name': 'Twitter/twhin-bert-base',
        'tokenizer': XLMRobertaTokenizerFast,       # twhin-bert's pre-training tokenizer is 'xlm-roberta-base' according to https://arxiv.org/pdf/2209.07562v1.pdf
        'model': BertForSequenceClassification,
    },
}

# select the pre-trained model
pretrained_model_choice = 'bert-base-uncased'
pretrained_model = pretrained_models[pretrained_model_choice]

# these are passed on to tokenizer object as keyword args
common_tokenizer_args = {
    'padding': 'max_length', 
    'truncation': True, 
    'return_tensors': 'pt', 
    'max_length': 256,
}

# these are passed on to model object as keyword args
common_model_args = {
    'num_labels': 2,
    #'output_hidden_states': True,
}

# these are passed on to trainer object as keyword args
#   docs: https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments
common_train_args = {
    #### model output
    'run_name': run_name,
    'output_dir': f'../data/models/{output_dir_name}',
    'save_strategy': 'epoch',
    'save_total_limit': 1,
    #### training hyperparams
    'num_train_epochs': 5,
    # 'per_device_train_batch_size': 16,
    # 'per_device_eval_batch_size': 16,
    # 'warmup_steps': 500,
    # 'weight_decay': 0.01,
    #### evaluation during training
    # 'evaluation_strategy': 'steps',
    # 'eval_steps': 100,
    'evaluation_strategy': 'epoch',
    'logging_strategy': 'epoch',
    'log_level': 'warning',
}

# maximum tweets (per class) used for fine tuning but not including evaluation (set to None for no limit)
#  e.g. If this value is 5000, a maximum of 5000 troll and 5000 authentic tweets will be used
#       for a total of 10,000 tweets used for fine tuning. If `test_fraction` is 0.2, then 2,000 additional
#       tweets will be used for testing the fine tuned model, so 12,000 total tweets will be ingested.
max_tweets_per_class = 1000     # assumed to be less than total number of tweets per class in `df` (or else pandas yells)
sampling_random_seed = 42

# for train/test split
train_test_random_seed = 3    # for reproducability, and "the number of the counting shall be three"
test_fraction = 0.20          # within range (0.0, 1.0)

## 3.2 - Convert Pandas Dataframe to 🤗 Dataset

In [24]:
# for model summary we can track how long it took to encode and train
time_encoding = None
time_training = None

# create a view (not a copy) of dataframe
if (max_tweets_per_class is None):
    df_view = df[[content_column, class_column]]
else:
    n_tweets = int(max_tweets_per_class * (1.0 + test_fraction))     # "gross up" the number of tweets ingested (see section 3.1 above)

    df_view = pd.concat(
        [
            df.loc[df[class_column] == 1, [content_column, class_column, pk_column]].sample(n=n_tweets, random_state=sampling_random_seed),
            df.loc[df[class_column] == 0, [content_column, class_column, pk_column]].sample(n=n_tweets, random_state=sampling_random_seed)
        ], 
        ignore_index=True
    )

# convert to 🤗 Dataset object
dataset = Dataset.from_pandas(df_view) \
            .rename_columns({content_column: "text", class_column: "label"}) \
            .cast_column("label", ClassLabel(names=['authentic', 'troll']))

# check results
assert (dataset.features['label'].str2int('authentic') == 0) and (dataset.features['label'].str2int('troll') == 1), 'class labels mismatched'
dataset[0]

Casting the dataset:   0%|          | 0/3 [00:00<?, ?ba/s]

{'text': 'On #MuslimWomensDay, these women are empowering themselves and fighting back against Islamophobia. https://t.co/Y5NXaTHjZi',
 'label': 1,
 'tweet_id': '846549212082909184'}

## 3.3 - Train/Test Split

In [25]:
if (max_tweets_per_class is None):
    test_size = test_fraction
else:
    test_size = int(max_tweets_per_class * test_fraction * 2)   # "2" for our two classes

dataset_split = dataset.train_test_split(
    test_size=test_size,
    shuffle=True,
    seed=train_test_random_seed,
    stratify_by_column='label'
)

# check output
dataset_split

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'tweet_id'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'tweet_id'],
        num_rows: 400
    })
})

## 3.4 - Tokenize / Encode

In [26]:
# create the tokenizer to prepare text for model
#tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
tokenizer = pretrained_model['tokenizer'].from_pretrained(pretrained_model['name'])

In [27]:
# create a tokenizer function
def tokenize_function(examples):    # todo -> convert to pure function
    return tokenizer(examples['text'], **common_tokenizer_args)

In [28]:
time_encoding_start = pd.Timestamp.now()

# encode the training and test sets
#tokenized_datasets = dataset_split.map(tokenize_function, batched=True, fn_kwargs={'tokenizer': tokenizer})
tokenized_datasets = dataset_split.map(tokenize_function, batched=True)

time_encoding_stop = pd.Timestamp.now()
time_encoding = time_encoding_stop - time_encoding_start

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [29]:
print(time_encoding)
print(tokenized_datasets)

0 days 00:00:00.839275
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'tweet_id', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'tweet_id', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 400
    })
})


## 3.5 - Model

In [30]:
# create the model
# model = AutoModelForSequenceClassification.from_pretrained(
#     pretrained_model_name,
#     **extra_model_args,
# )

model = pretrained_model['model'].from_pretrained(
    pretrained_model['name'],
    **common_model_args,
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [31]:
# setup the training arguments
training_args = TrainingArguments(
    **common_train_args
)

In [32]:
# setup training / evaluation metric
#   Docs: https://huggingface.co/docs/evaluate/package_reference/main_classes#evaluate.combine
#   Each of these metrics corresponds to a script from huggingface, below are the links for each script.
#       accuracy:   https://huggingface.co/spaces/evaluate-metric/accuracy
#       f1:         https://huggingface.co/spaces/evaluate-metric/f1
#       precision:  https://huggingface.co/spaces/evaluate-metric/precision
#       recall:     https://huggingface.co/spaces/evaluate-metric/recall
metric_list = ['accuracy', 'f1', 'precision', 'recall']

metric = evaluate.combine(evaluations=metric_list)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [33]:
time_training_start = pd.Timestamp.now()

# setup the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    compute_metrics=compute_metrics,
)

# execute the training
trainer.train()

time_training_stop = pd.Timestamp.now()
time_training = time_training_stop - time_training_start

print(time_training)



  0%|          | 0/1250 [00:00<?, ?it/s]

{'loss': 0.5135, 'learning_rate': 4e-05, 'epoch': 1.0}


  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.49622851610183716, 'eval_accuracy': 0.795, 'eval_f1': 0.8093023255813954, 'eval_precision': 0.7565217391304347, 'eval_recall': 0.87, 'eval_runtime': 3.7264, 'eval_samples_per_second': 107.342, 'eval_steps_per_second': 13.418, 'epoch': 1.0}
{'loss': 0.3693, 'learning_rate': 3e-05, 'epoch': 2.0}


  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.5828663110733032, 'eval_accuracy': 0.805, 'eval_f1': 0.8133971291866028, 'eval_precision': 0.7798165137614679, 'eval_recall': 0.85, 'eval_runtime': 3.7277, 'eval_samples_per_second': 107.306, 'eval_steps_per_second': 13.413, 'epoch': 2.0}
{'loss': 0.2102, 'learning_rate': 2e-05, 'epoch': 3.0}


  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.798835813999176, 'eval_accuracy': 0.825, 'eval_f1': 0.8356807511737089, 'eval_precision': 0.7876106194690266, 'eval_recall': 0.89, 'eval_runtime': 3.6724, 'eval_samples_per_second': 108.921, 'eval_steps_per_second': 13.615, 'epoch': 3.0}
{'loss': 0.071, 'learning_rate': 1e-05, 'epoch': 4.0}


  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.9535841941833496, 'eval_accuracy': 0.815, 'eval_f1': 0.8140703517587939, 'eval_precision': 0.8181818181818182, 'eval_recall': 0.81, 'eval_runtime': 3.7126, 'eval_samples_per_second': 107.742, 'eval_steps_per_second': 13.468, 'epoch': 4.0}
{'loss': 0.0237, 'learning_rate': 0.0, 'epoch': 5.0}


  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 1.0114675760269165, 'eval_accuracy': 0.825, 'eval_f1': 0.8167539267015708, 'eval_precision': 0.8571428571428571, 'eval_recall': 0.78, 'eval_runtime': 3.6425, 'eval_samples_per_second': 109.815, 'eval_steps_per_second': 13.727, 'epoch': 5.0}
{'train_runtime': 338.8163, 'train_samples_per_second': 29.515, 'train_steps_per_second': 3.689, 'train_loss': 0.2375522357940674, 'epoch': 5.0}
0 days 00:05:39.287998


## 3.6 - Save fine-tuned model

In [22]:
trainer.save_model()    # defaults to self.args.output_dir

Saving model checkpoint to ../data/models/distilbert-base-unc-2k
Configuration saved in ../data/models/distilbert-base-unc-2k\config.json
Model weights saved in ../data/models/distilbert-base-unc-2k\pytorch_model.bin


## 3.7 - Evaluate fine-tuned model

In [23]:
# if evaluating immediately after fine-tuning
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 400
  Batch size = 8


  0%|          | 0/50 [00:00<?, ?it/s]

{'eval_loss': 0.7843887805938721,
 'eval_accuracy': 0.815,
 'eval_f1': 0.809278350515464,
 'eval_precision': 0.8870056497175142,
 'eval_recall': 0.7440758293838863,
 'eval_runtime': 3.9316,
 'eval_samples_per_second': 101.741,
 'eval_steps_per_second': 12.718,
 'epoch': 3.0}

In [27]:
predictions = trainer.predict(tokenized_datasets['test'])

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 400
  Batch size = 8


  0%|          | 0/50 [00:00<?, ?it/s]

In [28]:
predictions.metrics

{'test_loss': 0.8962348699569702,
 'test_accuracy': 0.8,
 'test_runtime': 3.8037,
 'test_samples_per_second': 105.162,
 'test_steps_per_second': 13.145}

In [74]:
# reload model, if evaluation is being performed separately from training
model_dir = "../data/models/dist-test1"

model = AutoModelForSequenceClassification.from_pretrained(model_dir)

loading configuration file ../data/models/dist-test1/config.json
Model config DistilBertConfig {
  "_name_or_path": "../data/models/dist-test1",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading weights file ../data/models/dist-test1/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at ../data/models/dist-test1.

In [None]:
# TODO - setup eval for re-loaded model

TODO: setup way to archive the saved model files.

For now:
`tar -czvf dist-test1.tar.gz --exclude='*checkpoint*' dist-test1`