<a href="https://colab.research.google.com/github/devindatt/NLP_Transfer_Learning/blob/main/MMAI_894_Assignment_Ex3_TL_DevinDatt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Info and Instructions

## 1 Your Objective (**Read Carefully**)
The year is 2051, fake news has drastically destabilized our political system.
70% of Canadians believe at least one of the major conspiracy theories circulating the country at any given time.

A new type of political party has developed (lets call them Party Q), that relies heavily on fake news and country division and fear
to win the support of voters. 
Party Q has been gaining major traction in Canadian politics at all levels, 
and certain provinces are currently being occupied by Party Q Political candidates.

The current prime-minister is a stable moderate but there are fears around the rising support for the
a Party Q opposition even at the federal level.

You've been asked by the Canadian Government to build a proof of concept model to detect fake news. 
If successful this model will be deployed and applied to every political speech/comment/post made in this country
at all levels of government, it will be used for both real-time fact checking, and flagging of facts to be sent to proffessional fact checkers.

The fate of our nation rests in your capable hands.

The prime minister needs 3 results from your model:
1. Needs to flag false posts ("pants-fire" or "false") with a recall of at least 70% (these will be sent to proffessional fact checkers)
2. Needs to flag "true" posts with a precision of at least 95% (these will be used in real-time to verify facts during presentations)
3. Needs to flag "pants-fire" posts with a precision of at least 95% (these will be used in real-time to contradict statements during presentations)
(See dataset information for more clarification around labels)

## 2 Dataset Information:
"We consider six fine-grained labels for
the truthfulness ratings: pants-fire, false, barelytrue, half-true, mostly-true, and true. The distribution of labels in the LIAR dataset is relatively
well-balanced: except for 1,050 pants-fire cases,
the instances for all other labels range from 2,063
to 2,638." - https://arxiv.org/pdf/1705.00648.pdf

## 3 Submission Instructions (**Read Carefully**)
- To submit:
  1. you cannot edit this notebook directly. **Save a copy** to your drive, and make sure to identify yourself in the title using name and student number
  2. **Ensure** you have implemented all the nececessary functions
  3. **Provide answers** to the questions in the conclusion cell
  4. Unlike previous assignments, please **submit all three formats: .py, .ipynb, and html** (see https://torbjornzetterlund.com/how-to-save-a-google-colab-notebook-as-html/)
    - The notebook and html submissions should show the completion of your best performing run
  5. **Ensure** your nNotebook can _restart and run all_
  6. The mark will be assessed on the implementation of the functions with #TODO
  7. **Do not change anything outside the marked functions**  unless in the further exploration section
  8.  Do not use any additional libraries than the ones listed below (you may import additional modules from those libraries if needed)
  9. The mark is primarily based on correctness. However, since you are responsible for optimally tuning this model, meeting high performance is required, you should be able to at least match the results given in the paper.

Changing your run time in colab to GPU will speed up the training drastically


In [61]:
!pip install datasets
!pip install transformers
!pip install pandas



In [63]:
from datasets import load_dataset
import matplotlib.pyplot as plt
import tensorflow.keras as keras
import pandas as pd

try: # this is only working on the 2nd try in colab :)
  from transformers import DistilBertTokenizer, TFDistilBertModel
except Exception as err: # so we catch the error and import it again
  from transformers import DistilBertTokenizer, TFDistilBertModel

import numpy as np
import tensorflow.keras as keras
from tensorflow.keras.layers import Dense, Input, Dropout
from pandas_profiling import ProfileReport

dbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')


# Data Preparation

## Clean the text and your targets
Hints: 
1. Use the exploration cell to explore the data and identify cleaning steps
2. Inspect the tokenized sentences and ensure they make sense and can leverage already trained word embeddings
3. These resources will help you understand what type of cleaning will be required and how you can encode your text for the network:
    - a) Preprocessing: https://huggingface.co/transformers/preprocessing.html
    - b) Summary of tokenizers (DistilBERT uses WordPiece): https://huggingface.co/transformers/tokenizer_summary.html#wordpiece
4. Consider the text length, is this too big/small for DistilBERT? what impact would padding/truncation have?
5. In load data you generated a profiling report of this dataset, might be helpful to review that as well

In [64]:
def prepare_raw_data(df):
  raw_data = df.loc[:, ["id", "statement", "label"]]
  raw_data["label"] = raw_data["label"].astype('category')
  return raw_data

def load_data(save_dir="./"):
  dataset = load_dataset("liar")
  train = prepare_raw_data(pd.DataFrame(dataset["train"]))
  val = prepare_raw_data(pd.DataFrame(dataset["validation"]))
  test = prepare_raw_data(pd.DataFrame(dataset["test"]))
  return train, val, test
         
def clean_data(raw_data):
  # TODO: What data cleaning/filtering should you consider?
  # Hint: check for duplicates or contradictions
  # Hint: What is the minimum and maximum lengths of the statements?
  # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION



  clean_data = raw_data

  return clean_data

def extract_raw_text_and_y(clean_data):
  raw_text, raw_y = clean_data["statement"].values, clean_data["label"].values
  return raw_text, raw_y

def encode_text(text):
    # TODO: encode text using dbert_tokenizer
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION

    encoded_input = dbert_tokenizer(text, return_tensors="tf")
#    outputs = model(**inputs)
    print(encoded_input)
    input_ids = encoded_input['input_ids']
    attention_mask = encoded_input['attention_mask']

    return input_ids, attention_mask
#    return 0

#encoded_input = tokenizer("Hello, I'm a single sentence!")
#{'input_ids': [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102],
# 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
# 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


def prepare_target(raw_y):
    # TODO: convert labels to 0/1
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # NOTE: labels map as follows: ['false', 'half-true', 'mostly-true', 'true', 'barely-true', 'pants-fire']
    # y should have:
    # column 0 = "pants-fire" or "false" posts
    # column 1 = "true" posts
    # column 2 = "pants-fire"

    df = pd.DataFrame(raw_y, columns=['labels'])
  
    df['column_0'] = np.select([df['labels'].isin([0,1])], [1])#, default=0)
    df['column_1'] = np.select([df['labels'].isin([2,3,4,5])], [1])#, default=0)
    df['column_2'] = np.select([df['labels'].eq(0)], [1])#, default=0)

    y = df.drop('labels', axis=1)

    return y


In [65]:
load_data()

Using custom data configuration default
Reusing dataset liar (/root/.cache/huggingface/datasets/liar/default/1.0.0/1a6abd9863f27194da30fcb66988477abfa3780df3b0ad1d0032979c48ec7918)


(               id                                          statement label
 0       2635.json  Says the Annies List political group supports ...     0
 1      10540.json  When did the decline of coal start? It started...     1
 2        324.json  Hillary Clinton agrees with John McCain "by vo...     2
 3       1123.json  Health care reform legislation is likely to ma...     0
 4       9028.json  The economic turnaround started at the end of ...     1
 ...           ...                                                ...   ...
 10264   5473.json  There are a larger number of shark attacks in ...     2
 10265   3408.json  Democrats have now become the party of the [At...     2
 10266   3959.json  Says an alternative to Social Security that op...     1
 10267   2253.json  On lifting the U.S. Cuban embargo and allowing...     0
 10268   1155.json  The Department of Veterans Affairs has a manua...     5
 
 [10269 rows x 3 columns],
               id                                          

In [66]:
train, val, test = load_data()

Using custom data configuration default
Reusing dataset liar (/root/.cache/huggingface/datasets/liar/default/1.0.0/1a6abd9863f27194da30fcb66988477abfa3780df3b0ad1d0032979c48ec7918)


In [71]:
raw_train = prepare_raw_data(train)
raw_val = prepare_raw_data(val)
raw_test = prepare_raw_data(test)

In [12]:
raw_train['label'].value_counts()

1    2123
0    1998
2    1966
3    1683
4    1657
5     842
Name: label, dtype: int64

In [72]:
train_raw_x, train_raw_y = extract_raw_text_and_y(clean_data(raw_train))
val_raw_x, val_raw_y = extract_raw_text_and_y(clean_data(raw_val))
test_raw_x, test_raw_y = extract_raw_text_and_y(clean_data(raw_test))

In [73]:
train_raw_x

array(['Says the Annies List political group supports third-trimester abortions on demand.',
       'When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration.',
       'Hillary Clinton agrees with John McCain "by voting to give George Bush the benefit of the doubt on Iran."',
       ...,
       'Says an alternative to Social Security that operates in Galveston County, Texas, has meant that participants will retire with a whole lot more money than under Social Security.',
       'On lifting the U.S. Cuban embargo and allowing travel to Cuba.',
       "The Department of Veterans Affairs has a manual out there telling our veterans stuff like, 'Are you really of value to your community?' You know, encouraging them to commit suicide."],
      dtype=object)

In [68]:
prepare_target(train_raw_y)

Unnamed: 0,column_0,column_1,column_2
0,1,0,1
1,1,0,0
2,0,1,0
3,1,0,1
4,1,0,0
...,...,...,...
10264,0,1,0
10265,0,1,0
10266,1,0,0
10267,1,0,1


In [75]:
prepare_target(val_raw_y)

Unnamed: 0,column_0,column_1,column_2
0,0,1,0
1,0,1,0
2,1,0,1
3,1,0,0
4,1,0,0
...,...,...,...
1279,1,0,0
1280,0,1,0
1281,0,1,0
1282,1,0,1


In [74]:
prepare_target(test_raw_y)

Unnamed: 0,column_0,column_1,column_2
0,0,1,0
1,1,0,1
2,1,0,1
3,1,0,0
4,0,1,0
...,...,...,...
1278,1,0,0
1279,0,1,0
1280,0,1,0
1281,0,1,0


In [76]:
raw_val = prepare_raw_data(val)
raw_val.head()

Unnamed: 0,id,statement,label
0,12134.json,We have less Americans working now than in the...,4
1,238.json,"When Obama was sworn into office, he DID NOT u...",5
2,7891.json,Says Having organizations parading as being so...,0
3,8169.json,Says nearly half of Oregons children are poor.,1
4,929.json,On attacks by Republicans that various program...,1


In [None]:
raw_test = prepare_raw_data(test)
raw_test.head()

Unnamed: 0,id,statement,label
0,11972.json,Building a wall on the U.S.-Mexico border will...,3
1,11685.json,Wisconsin is on pace to double the number of l...,0
2,11096.json,Says John McCain has done nothing to help the ...,0
3,5209.json,Suzanne Bonamici supports a plan that will cut...,1
4,9524.json,When asked by a reporter whether hes at the ce...,5


In [None]:
test_text = "Hello, I'm a single sentence!"
input_ids, attention_mask = encode_text(test_text)
print(input_ids)
print(attention_mask)

{'input_ids': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=
array([[ 101, 7592, 1010, 1045, 1005, 1049, 1037, 2309, 6251,  999,  102]],
      dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 11), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>}
tf.Tensor([[ 101 7592 1010 1045 1005 1049 1037 2309 6251  999  102]], shape=(1, 11), dtype=int32)
tf.Tensor([[1 1 1 1 1 1 1 1 1 1 1]], shape=(1, 11), dtype=int32)


# Modelling

## Build and Train Model

Resources:
- DistilBERT paper: https://arxiv.org/abs/1910.01108
- DistilBERT Tensorflow Documentation: https://huggingface.co/transformers/model_doc/distilbert.html#tfdistilbertmodel

In [None]:
def build_model(base_model, trainable=False, params={}):
    # TODO: build the model, with the option to freeze the parameters in distilBERT
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # Hint 1: the cls token (token for classification in bert / distilBERT)  corresponds to the first element in the sequence in DistilBERT
    # Hint 2: this guide may be helpful for parameter freezing: https://keras.io/guides/transfer_learning/
    # Hint 3: double check your number of parameters make sense
    # Hint 4: carefully consider your final layer activation and loss function

    # Refer to https://keras.io/api/layers/core_layers/input/
    inputs = Input(shape = (max_seq_len,), dtype='int64', name='inputs')
    masks  = Input(shape = (max_seq_len,), dtype='int64', name='masks')

    base_model.trainable = trainable

    dbert_output = base_model(inputs, attention_mask=masks)
    dbert_last_hidden_state = dbert_output.last_hidden_state

    # Any additional layers should go here
    # use the 'params' as a dictionary for hyper parameter to facilitate experimentation
    my_outputs = ???
    probs = Dense(3, ???)(my_outputs)

    model = keras.Model(inputs=[inputs, masks], outputs=probs)
    model.summary()
    return model



In [None]:
def compile_model(model):
    # TODO: compile the model, include relevant auc metrics when training
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # Hint: you may want to read up on the "multi_label" parameter in the keras AUC metrics

    return model

In [None]:
def train_model(model, model_inputs_and_masks_train, model_inputs_and_masks_val,
    y_train, y_val, batch_size, num_epochs):
    # TODO: train the model
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION

    return model, history

In [None]:
def evaluate_model(model, model_inputs_and_masks_test, y_test):
    # TODO: evaluate the model
    # DO NOT CHANGE THE INPUTS OR OUTPUTS TO THIS FUNCTION
    # HINT: for pr_auc: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html 

    eval_dict = {
        "false": {
            "pr_auc": ???, "pr_auc_random_guess": ???, 
            "roc_auc": ???, "roc_auc_random_guess": ???, 
            "precision": ???, "recall": ???
        }, 
        "true": {
            "pr_auc": ???, "pr_auc_random_guess": ???, 
            "roc_auc": ???, "roc_auc_random_guess": ???, 
            "precision": ???, "recall": ???
        }, 
        "pants": {
            "pr_auc": ???, "pr_auc_random_guess": ???, 
            "roc_auc": ???, "roc_auc_random_guess": ???, 
            "precision": ???, "recall": ???
        }
    }
    return eval_dict

# Execution



In [77]:
## DO NOT Change
train, val, test = load_data()
train_raw_x, train_raw_y = extract_raw_text_and_y(clean_data(train))
val_raw_x, val_raw_y = extract_raw_text_and_y(clean_data(val))
test_raw_x, test_raw_y = extract_raw_text_and_y(clean_data(test))

train_input, train_mask = encode_text(train_raw_x)
train_y = prepare_target(train_raw_y)

val_input, val_mask = encode_text(val_raw_x)
val_y = prepare_target(val_raw_y)

test_input, test_mask = encode_text(test_raw_x)
test_y = prepare_target(test_raw_y)

train_model_inputs_and_masks = {
    'inputs' : train_input,
    'masks' : train_mask
}

val_model_inputs_and_masks = {
    'inputs' : val_input,
    'masks' : val_mask
}

test_model_inputs_and_masks = {
    'inputs' : test_input,
    'masks' : test_mask
}


Using custom data configuration default
Reusing dataset liar (/root/.cache/huggingface/datasets/liar/default/1.0.0/1a6abd9863f27194da30fcb66988477abfa3780df3b0ad1d0032979c48ec7918)


AssertionError: ignored


Use the cell below to execute and experiment with your model

In [None]:
dbert_model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')
model, pretrained_weights = (TFDistilBertModel, 'distilbert-base-uncased')


model = build_model(dbert_model, params={})
#model = build_model(dbert_model, params={})

model = compile_model(model)
#model = compile_model(model)

#model          = train_model(model, train_model_inputs_and_masks, val_model_inputs_and_masks, train_y, val_y, batch_size, num_epochs)
model, history = train_model(model, train_model_inputs_and_masks, val_model_inputs_and_masks, train_y, val_y, batch_size, num_epochs)

eval_dict = evaluate_model(model, test_model_inputs_and_masks, test_y)
#eval_dict = evaluate_model(model, test_model_inputs_and_masks, test_y)

## Conclusions (TODO)
TODO: Make Your Final Conclusions About Your Model (Answer questions below, answer in this cell)
- a) What is driving your model's decisions?
- b) Is your model biased in some ways? If so how? 
- c) Does your model accomplish the objectives? If not, is your model useful and how can you justify this?

# Further exploration (REMOVE ALL CODE AFTER THIS CELL BEFORE SUBMISSION)
Any code after this is not evaluated, and must be removed before submission.
Leaving code below will result in losing marks.