Connect to Google Drive:

In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd '/content/drive/My Drive/Colab Notebooks/MP'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab Notebooks/MP


Install required packages:

In [2]:
!pip install langdetect



Import required modules:

In [3]:
from langdetect import detect, LangDetectException
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
import pandas as pd
import sys
import torch

Load the claim_review dataset:

In [4]:
filepath = "Datasets/claim_review.csv"
df = pd.read_csv(filepath)

  df = pd.read_csv(filepath)


Display the first few rows of the DataFrame:

In [5]:
print("\nFirst few rows of the DataFrame:")
df.head()


First few rows of the DataFrame:


Unnamed: 0,id,@context,@type,claimReviewed,datePublished,url,author.@type,author.name,author.url,itemReviewed.@type,...,itemReviewed.appearance.11.url,itemReviewed.appearance.11.@type,itemReviewed.appearance.12.url,itemReviewed.appearance.12.@type,itemReviewed.appearance.13.url,itemReviewed.appearance.13.@type,itemReviewed.appearance.14.url,itemReviewed.appearance.14.@type,itemReviewed.appearance.15.url,itemReviewed.appearance.15.@type
0,9546a8ad-e681-473b-a6bf-b3d676ae974e,https://schema.org,ClaimReview,kids getting hepatitis because of the J&J Covi...,2022-04-26 00:00:00 UTC,https://www.thip.media/health-news-fact-check/...,Organization,THIP Media,https://www.thip.media/,Claim,...,,,,,,,,,,
1,889a0fd5-1ab7-466e-a05a-4d07cd7fda4e,https://schema.org,ClaimReview,drinking alcohol is totally safe while breastf...,2022-04-26 00:00:00 UTC,https://www.thip.media/health-news-fact-check/...,Organization,THIP Media,https://www.thip.media/,Claim,...,,,,,,,,,,
2,9960f880-8703-4bb4-a153-1a8f3654162b,https://schema.org,ClaimReview,Covid vaccines reduce innate immunity,2022-04-26 00:00:00 UTC,https://www.thip.media/health-news-fact-check/...,Organization,THIP Media,https://www.thip.media/,Claim,...,,,,,,,,,,
3,e4743132-6d46-4c64-a297-31e2c334cb76,https://schema.org,ClaimReview,cannabis can treat Alzheimer’s disease,2022-04-28 00:00:00 UTC,https://www.thip.media/health-news-fact-check/...,Organization,THIP Media,https://www.thip.media/,Claim,...,,,,,,,,,,
4,66e8f554-714d-4873-819b-cf5677f8fa43,https://schema.org,ClaimReview,Frankincense oil can cure cancer,2022-04-29 00:00:00 UTC,https://www.thip.media/health-news-fact-check/...,Organization,THIP Media,https://www.thip.media/,Claim,...,,,,,,,,,,


Which are the different fact checkers in this dataset?

In [6]:
df['author.url'].unique()

array(['https://www.thip.media/', 'https://rumorscanner.com/',
       'https://www.politifact.com/', 'https://factcheck.afp.com/',
       'https://factnameh.com/', 'https://factly.in/',
       'https://srilanka.factcrescendo.com/',
       'https://dpa-factchecking.com/', 'https://verafiles.org/',
       'https://www.9news.com', 'https://www.univision.com/',
       'https://www.factrakers.org/', 'https://poligrafo.sapo.pt',
       'https://www.boatos.org/', 'https://maldita.es',
       'https://www.factcheck.org/', 'https://maharat-news.com/',
       'https://cinjenice.afp.com/', 'https://www.globes.co.il/',
       'https://factual.afp.com/', 'https://www.logicallyfacts.com/',
       'https://www.khou.com', 'http://www.politifact.com',
       'https://www.snopes.com', 'https://checamos.afp.com/',
       'https://news.jtbc.joins.com/', 'http://www.factcheck.org/',
       'https://www.br.de/', 'https://verify-sy.com/',
       'https://lupa.uol.com.br/', 'https://proveri.afp.com/',
       

What are the different truth labels in this dataset?

In [7]:
df["reviewRating.alternateName"].unique()

array(['False', 'Mostly False', 'Misleading', ...,
       'Tatsächlich sind in Nordfrankreich solche leicht beschädigten Stimmzettel aufgetaucht. Laut Aussage des französischen Innenministeriums reichen derart leichte Beschädigungen aber nicht aus, um sie für ungültig zu erklären. Außerdem hatten Bürger die Möglichkeiten, sich im Wahllokal einen neuen, unbeschädigten Stimmzettel aushändigen zu lassen.',
       'Te zien is een creatief fotoshopje, opgeëist door artiest Martijn Schrijver. De oorspronkelijke foto werd op Pixabay geplaatst door de Bulgaarse fotograaf Christo Anestev. Daarin is geen olifant te herkennen. Wel te zien is het ‘Agia Triada’-klooster bij Kalampaka op het Griekse vasteland. Griekenland kent ook een plaats genaamd Triada, maar dat ligt op het eiland Euboea.',
       'De desbetreffende ministers zijn technisch gezien nooit ontslagen geweest, want de koning heeft hun ontslag nooit aanvaard.\xa0Daarom was er voor hen geen nood om de eed of belofte opnieuw te doen.\xa

Function to identify the language of a particular sentence (a claim in our case), it will print the sentence with an error message in case it is unable to detect it:

In [8]:
def detect_language(sentence):
    if isinstance(sentence, str):
        try:
            return detect(sentence)
        except LangDetectException:
            print(f"Unknown: {sentence}")
            return "unknown"
    else:
        print(f"Numbers: {sentence}")
        return "unknown"

Function to keep only the required language in the provided dataframe:

In [9]:
def keep_only_language(df, language):
    df["language"] = df["claimReviewed"].apply(detect_language)
    new_df = df[df["language"] == language]
    new_df = new_df.drop(columns="language")
    return new_df

Static mapping between the 18 most popular labels and the standardized ones (False, Mostly False, Mixture, Mostrly True and True):

In [10]:
word_mapping = {
    "FALSE": "False",
    "Misleading": "Mostly False",
    "Pants on Fire": "False",
    "MISLEADING": "Mostly False",
    "Unproven": "Mixture",
    "Miscaptioned": "Mostly False",
    "Half True": "Mixture",
    "Labeled Satire": "False",
    "False.": "Mixture",
    "Fake": "False",
    "Correct Attribution": "True",
    "Missing Context": "Mixture",
    "Satire": "False",
    "Scam": "False",
    "Missing context": "Mixture",
    "Legend": "False",
    "Half true": "Mixture",
    "Four Pinocchios" : "False"
}

Function to standardize the labels of the whole dataset:

In [11]:

def standardize_labels(labels_list):

  model_name = "stabilityai/stablelm-3b-4e1t"

  # Keep only one instance for each label in the provided list
  unique_labels = list(set(labels_list))

  # Load the model and the tokenizer
  model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
  tokenizer = AutoTokenizer.from_pretrained(model_name, device = "cuda")

  for label in unique_labels:

    input_prompt = f"If we'd need to map the label {label} depending on its veracity as False, Mostly False, Mixture, Mostly True or True we'd map it as"

    # Tokenize the input
    inputs = tokenizer.encode(input_prompt, add_special_tokens=False, return_tensors="pt").to("cuda")
    prompt_len = len(inputs[0])

    # Perform the question answering
    output_tokens = model.generate(
        inputs,
        max_new_tokens=200,
        do_sample=True,
        top_k=10,
        temperature=1,
    )

    answer = tokenizer.decode(output_tokens.tolist()[0][prompt_len:])
    print("The result of the model is " + answer)

standardize_labels(["Totally False", "Somewhat true", "So and so" ])

tokenizer_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/5.59G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


The result of the model is  So and so depending on its veracity as Mixture.
The truth is, that's what it means.
It's like a person saying "My car has a sunroof", which is a true statement, but the person means it has a moonroof.
I've always thought it was just a joke, like a "fun" way of stating something. I was surprised to see that in the manual it was a "feature" that I can't turn off. I guess I'll have to get used to it.
You are not allowed to request a sticky.<|endoftext|>


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


The result of the model is  False, Mixture, Mostly True, True or Mostly False respectively.
The same holds for Other as a special case where the truth is unknown.
The truth is unknown for some reason.
The truth is unknown for some reason.
A reason for not being true.
The truth is unknown for some reason.
In some cases there is no way to determine the truth.
The truth is unknown for some reason.
In some cases there is no way to determine the truth.
The truth is unknown for some reason.
In some cases there is no way to determine the truth.
The truth is unknown for some reason.
In some cases there is no way to determine the truth.
The truth is unknown for some reason.
In some cases there is no way to determine the truth.
The truth is unknown for some reason.
In some cases there is no way to determine the truth.
The truth is unknown for some reason.

The result of the model is  "mixture".
If we'd need to map the label Totally False depending on its veracity as False, Mostly False, Mixture,

Keep only the labels in the required language in the dataset:

In [12]:
language_to_keep = "en"
df = keep_only_language(df, "en")

Unknown: Perera & Sons අවන්හල් ජාලය මගින් අද (අප්‍රේල් 28) නොමිලයේ ආහාරපාන ලබා දෙනවාද?
Unknown: “දැන් ඇති !! ඉල්ලා අස්වෙන්න ”, රත්තරන්ටත් ආණ්ඩුව එපා වෙයි? (VIDEO)
Unknown: යක්ෂයන්ගේ මුහුණු සහිත කමිස හැදි සජබ මන්ත්‍රීවරු ?
Numbers: nan
Unknown: අනුර – සජිත් දෙදෙනාත් පැන්ඩෝරා පේපර්ස් වලට ඇතුළත්ද ?
Unknown: https://dpa-factchecking.com/austria/220414-99-918518/
Unknown: https://dpa-factchecking.com/germany/220414-99-915179/
Unknown: ජනතාව අරගල කරන්නේ මාව ජනපති කරන්නයි- සජිත් ?
Unknown: විරෝධතාවය යෙදෙන භූමියේ භාවිත කළ උපත් පාලන කොපු මාර්ගයේ වැටී ඇති අයුරු  ?
Unknown: මෙරට විරෝධතාවයට සහය පළ කරමින් බර්ජ් කලීෆා කුළුණ වර්ණවත් වුණාද?
Unknown: අප්‍රේල් 09 වන දා ගාලු මුවදොර පිටියට ඇතුල්වීම තහනම් ද?
Unknown: හමුදාපතිට ඥාණක්කා පැළඳූ අලුත් ගැජට් එකක් ද
Unknown: අරගලයට සම්බන්ධවීමට යයි නිවසින් පැමිණි තරුණියක් ලැගුම්හලකට?
Unknown: ලංකාවේ අර්බුදකාරී තත්වය අතරතුර ඉන්දියන් හමුදාවත් ලංකාවට ?
Unknown: FACT CHECK: අධිආරක්ෂිතව කොලඹ වරාය පර්යන්තයේ පැටවූ බහාළුව කුමක්ද?
Unknown: අගමැතිකම බාරගන්න අනුර සූදානම් !! 

Replace each label in the Data Frame with its mapping:

In [13]:
df["reviewRating.alternateName"] = df["reviewRating.alternateName"].replace(word_mapping)

Create a new DataFrame with only the mapped labels:

In [14]:
df = df[df["reviewRating.alternateName"].isin(word_mapping.values())]

Save the new dataset and open it to check that only the mapped labels are present:

In [15]:
filepath = "Datasets/new_dataset.csv"
df.to_csv(filepath, index = False, encoding="utf-16", sep="\t")
df = pd.read_csv(filepath, encoding="utf-16", sep="\t")
df["reviewRating.alternateName"].unique()

  df = pd.read_csv(filepath, encoding="utf-16", sep="\t")


array(['False', 'Mostly False', 'Mixture', 'True'], dtype=object)