<a href="https://colab.research.google.com/github/eugeneyan/visualizing-finetunes/blob/main/1_prep_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
%load_ext watermark
%watermark --conda -p torch,transformers,peft,datasets,sklearn

torch       : 2.3.1+cu121
transformers: 4.42.4
peft        : 0.11.1
datasets    : 2.19.1
sklearn     : 1.2.2

conda environment: n/a



In [1]:
import pandas as pd
import logging
import re

from collections import Counter
from datasets import load_dataset
from sklearn.model_selection import train_test_split

In [2]:
# Set up logger
logger = logging.getLogger('1-prep-data')
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S',
    force=True
)

logger.info('Running notebook to prep data')

2024-07-22 12:55:49 - INFO - Running notebook to prep data


In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
pd.set_option('display.max_colwidth', 1000)

## Prepare FIB data
- FIB contains one-sentence summaries on CNN/DM & XSUM news articles.
- Note: We exclude the CNN/Daily Mail data is pretty bad.
- https://huggingface.co/datasets/r-three/fib
![](images/fib.png)

In [5]:
fib_ds = load_dataset('r-three/fib', split='test')
fib_df = fib_ds.to_pandas()
logger.info(f'No. of rows in FIB: {len(fib_df):,}')

Downloading readme:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/7.11M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/3579 [00:00<?, ? examples/s]

2024-07-22 12:56:27 - INFO - No. of rows in FIB: 3,579


In [6]:
# Visualize the CNN/DM data
# fib_df.loc[fib_df['dataset'] == 'cnn_dm', ['input', 'list_choices', 'correct_choice']].head(5)
fib_df.loc[fib_df['dataset'] == 'cnn_dm'].head(5)

Unnamed: 0,id,input,correct_choice,list_choices,lbl,distractor_model,dataset
3122,b48858ca911327bc7cc4d6ec66e3be2d041513fc,"( cnn ) the american pharmacists association is discouraging its members from participating in executions . on monday , the group voted at its annual meeting to adopt a ban as an official policy , stating that `` such activities are fundamentally contrary to the role of pharmacists as healthcare providers . '' this bolsters the association 's previous positions to oppose the use of the term `` drug '' for chemicals used in lethal injection and to oppose laws that require or prohibit pharmacists from participation in lethal injection cases . the group acted this week because of increased public attention on lethal injection , said michelle spinnler , spokeswoman for the american pharmacists association . that spotlight includes a january supreme court decision to stay the execution for three death row inmates in oklahoma . this was prompted by clayton lockett 's execution by lethal injection nearly one year ago in which he writhed on a gurney for 43 minutes before he died from a hea...",<t> the american pharmacists association passed a new policy banning members from participating in lethal injections . </t> <t> pharmacists say role as health care providers conflicts with participation in lethal injection . </t> <t> the pharmacy association had already adopted a policy against lethal injection . </t>,"[<t> the american pharmacists association passed a new policy banning members from participating in lethal injections . </t> <t> pharmacists say role as health care providers conflicts with participation in lethal injection . </t> <t> the pharmacy association had already adopted a policy against lethal injection . </t>, <t> -lrb- cnn -rrb- the american pharmacists association is discouraging its members from participating in executions . </t> <t> on monday , the group voted at its annual meeting to adopt a ban as an official policy , stating that `` such activities are fundamentally contrary to the role of pharmacists as healthcare providers . </t> <t> '' she says . </t>]",0,banditsumm,cnn_dm
3123,ef181dab5f3e1ff34eeee1334201d111a6d6498b,"( cnn ) oprah 's in there . so 's bill murray , george clooney , scarlett johansson , jerry seinfeld , howard stern , tina fey , michael keaton and ray romano . on tuesday , `` the late show with david letterman '' announced some of the guests for the talk show host 's final month of broadcasts . the last `` late show '' will air wednesday , may 20 . among the notables are oprah winfrey , with whom letterman has had an on-and-off faux feud for years ; clooney , who 's starring in `` tomorrowland , '' which will be released on may 22 ; and stern , who 's always an engaging letterman guest . but longtime fans may be even more intrigued by the appearances of keaton , an old acquaintance who once shared a stage with letterman as players on mary tyler moore 's short-lived 1978 variety show , and murray , who was the very first guest on letterman 's old nbc show , `` late night with david letterman . '' steve martin , who 's taken part in some of the `` late show 's '' best bits , will a...",<t> `` the late show with david letterman '' concludes may 20 . </t> <t> letterman 's guests will include oprah winfrey and bill murray . </t> <t> stephen colbert takes over the slot september 8 . </t>,"[<t> `` the late show with david letterman '' concludes may 20 . </t> <t> letterman 's guests will include oprah winfrey and bill murray . </t> <t> stephen colbert takes over the slot september 8 . </t>, <t> so 's bill murray , george clooney , scarlett johansson , jerry seinfeld , howard stern , tina fey , michael keaton and ray romano . </t> <t> on tuesday , `` the late show with david letterman '' announced some of the guests for the talk show host 's final month of broadcasts . </t> <t> the last `` late show '' will air wednesday , may 20 . </t>]",0,banditsumm,cnn_dm
3124,b3506f2df5559e4a5469a5be78fee17ee7423cbc,"( cnn ) feeling so happy you just ca n't stand it ? you might want to pop some acetaminophen . a new study has found that acetaminophen , the main ingredient in tylenol , most forms of midol and more than 600 other medicines , reduces not only pain but pleasure , as well . the authors of the study , which was published this week in psychological science , say that it was already known that acetaminophen blunted psychological pain . but their new research led them to the conclusion that it also blunted joy -- in other words , that it narrowed the range of feelings experienced . `` this means that using tylenol or similar products might have broader consequences than previously thought , '' said geoffrey durso , a doctoral student in social psychology at ohio state university and the lead author of the study . `` rather than just being a pain reliever , acetaminophen can be seen as an all-purpose emotion reliever . '' the researchers tested their thesis by showing 82 college students...","<t> subjects taking acetaminophen reacted less strongly to both pleasant and unpleasant photos . </t> <t> each week , 52 million americans use the pain reliever . </t> <t> unknown whether other pain products produce the same effect . </t>","[<t> a new study has found that acetaminophen , the main ingredient in tylenol , most forms of midol and more than 600 other medicines , reduces not only pain but pleasure , as well . </t> <t> the authors of the study , which was published this week in psychological science , say that it was already known that acetaminophen blunted psychological pain . </t> <t> this group showed the same blunting of emotional reactions . </t>, <t> subjects taking acetaminophen reacted less strongly to both pleasant and unpleasant photos . </t> <t> each week , 52 million americans use the pain reliever . </t> <t> unknown whether other pain products produce the same effect . </t>]",1,banditsumm,cnn_dm
3125,fd6bc93561e36fea7ca9a4785f2f4fe2b16828ab,"( cnn ) love it or hate it , jared leto 's interpretation of the joker is an internet sensation . the oscar winner put on white makeup ( and a lot of tattoos this time ) to portray the clown prince of crime in the upcoming movie `` suicide squad . '' set for release august 5 , 2016 , `` suicide squad '' is based on the dc comics series and also stars will smith , margot robbie and viola davis . twitter users got their first look at leto in character friday night , and the memes started almost immediately . from comparisons to `` home alone '' to an imagining of ben affleck tatted up , people on social media put their photoshopping skills to work all weekend . which is your favorite ?",<t> leto will play the clown prince of crime in 2016 's `` suicide squad '' the first picture of leto in character led to a series of spoof photos . </t>,"[<t> the oscar winner put on white makeup -lrb- and a lot of tattoos this time -rrb- to portray the clown prince of crime in the upcoming movie `` suicide squad . </t> <t> '' set for release august 5 , 2016 , `` suicide squad '' is based on the dc comics series and also stars will smith , margot robbie and viola davis . </t> <t> which is your favorite ? </t>, <t> leto will play the clown prince of crime in 2016 's `` suicide squad '' the first picture of leto in character led to a series of spoof photos . </t>]",1,banditsumm,cnn_dm
3126,0b66cb7ebcfd16de6d6d7623036c2dd9161f032b,"( the hollywood reporter ) the original cast of twin peaks is backing david lynch in his salary standoff with showtime . the stars have teamed together for a video backing the show 's co-creator with a # savetwinpeaks campaign that says doing the revival without lynch is `` like pies without cherries , '' among other nods to the original drama series . sherilyn fenn , sheryl lee , james marshall , peggy lipton and other familiar faces from the series appear in the video . ( some members have also set up a facebook page . ) showtime renews ' shameless , ' orders ' happyish ' to series . lynch announced sunday that he was exiting showtime 's nine-episode revival over a salary dispute . he originally signed on to direct the project but noted that there was `` not enough money offered to do the script the way i felt needed to be done . '' showtime already had a deal in place with lynch and co-creator mark frost to bring back the cult hit with star kyle maclachlan for a run in 2016 , wi...",<t> `` twin peaks '' creator david lynch announced he was departing the showtime revival of the cult series sunday . </t> <t> cast members of the show posted a video pleading for him to return .</t>,"[<t> -lrb- the hollywood reporter -rrb- the original cast of twin peaks is backing david lynch in his salary standoff with showtime . </t> <t> sherilyn fenn , sheryl lee , james marshall , peggy lipton and other familiar faces from the series appear in the video . </t>, <t> `` twin peaks '' creator david lynch announced he was departing the showtime revival of the cult series sunday . </t> <t> cast members of the show posted a video pleading for him to return .</t>]",1,banditsumm,cnn_dm


In [7]:
# Only keep xsum data
fib_df = fib_df[fib_df['dataset'] == 'xsum']
logger.info(f'No. of rows in FIB: {len(fib_df):,}')

2024-07-22 12:59:57 - INFO - No. of rows in FIB: 3,122


In [8]:
fib_df[['input', 'list_choices', 'correct_choice']].head(5)

Unnamed: 0,input,list_choices,correct_choice
0,"Vehicles and pedestrians will now embark and disembark the Cowes ferry separately following Maritime and Coastguard Agency (MCA) guidance.\nIsle of Wight Council said its new procedures were in response to a resident's complaint.\nCouncillor Shirley Smart said it would ""initially result in a slower service"".\nOriginally passengers and vehicles boarded or disembarked the so called ""floating bridge"" at the same time.\nMs Smart, who is the executive member for economy and tourism, said the council already had measures in place to control how passengers and vehicles left or embarked the chain ferry ""in a safe manner"".\nHowever, it was ""responding"" to the MCA's recommendations ""following this complaint"".\nShe added: ""This may initially result in a slower service while the measures are introduced and our customers get used to the changes.""\nThe service has been in operation since 1859.","[ A new service on the Isle of Wight's chain ferry has been launched following a complaint from a resident., Passengers using a chain ferry have been warned crossing times will be longer because of new safety measures.]",Passengers using a chain ferry have been warned crossing times will be longer because of new safety measures.
1,"If you leave your mobile phone somewhere do you worry you will not be able to check it?\nIf any of this sounds familiar, there is a chance you could be spending too much time on social networks.\nAn exclusive online Newsbeat poll suggests that a quarter of 15 to 18-year-olds in the UK feel happier online than they do in real life.\nDr Radha from The Surgery on Radio 1 has dealt with patients who have displayed ""a lot of social anxiety"" because they are using social networks too much.\n""Being online can provoke a sense of 'I'm not good enough, everyone else is having an amazing life',"" she explained.\n""It doesn't give us a sense of reality and actually what you will find is most people are probably doing the same thing as you are.""\nThe survey, carried out last month, also suggests a third of 15 to 18-year-olds have met someone in person they originally met through social media.\nDr Radha has said it is important people carefully consider what information they share with the online ...","[ You may be worried about your health, but what if you are online?, Do you ever feel lonely, stressed or jealous when you are online?]","Do you ever feel lonely, stressed or jealous when you are online?"
2,"Speaking on TV, Maria Zakharova said Jews had told her they donated both to Mr Trump and Hillary Clinton.\nShe joked that American Jews were the best guide to US politics.\nThe diplomat's remarks caused shock. Anti-US propagandists in the last century peddled an idea that rich New York Jews controlled US politics.\nMs Zakharova was speaking on a chat show on Russian state TV at the weekend but her comments drew more attention after being picked up by media outlets on Thursday.\nShe said she had visited New York with an official Russian delegation at the time of the last UN General Assembly, in September.\n""I have a lot of friends and acquaintances there, of course I was interested to find out: how are the elections going, what are the American people's expectations?"" she said.\n""If you want to know what will happen in America, who do you need to talk to? You have to talk to the Jews, of course. It goes without saying.""\nAt this, the TV studio audience applauded loudly.\n""I went her...","[ The Russian foreign minister has said she has been ""settled"" by criticism from Jewish people for saying that the US election was a ""Jewish conspiracy""., A spokeswoman on Russian TV has said Jewish people in New York told her they had mainly backed Trump in the US election.]",A spokeswoman on Russian TV has said Jewish people in New York told her they had mainly backed Trump in the US election.
3,"A report by the organisation suggests men, women and children are being abused ""to eliminate public protest"".\nMany are subjected to virginity tests, rape and gang rape after arrest.\nEgypt's Interior Ministry said it would not comment until it had studied the report.\nThe study notes a surge in sexual violence after the Egyptian military takeover in July 2013.\nThe perpetrators are rarely held to account and the impunity points to a ""cynical political strategy aimed at silencing all opposition"".\nPolice, intelligence officers and members of the military are guilty of targeting male and female detainees, according to the report.\nAmong the victims are student demonstrators, human rights activists, gay people and children.\nStudent's ordeal\nI saw an officer who was grabbing a young woman by the breasts and I said to him: ""If you want to arrest her, then arrest her, but you have no right to touch her breasts.""\nHe grabbed me exactly as he had her, before calling two other police off...","[ Egyptian police are systematically abusing detainees, including women, in a campaign to end impunity, the Human Rights Watch says., Egyptian security forces are using sexual violence against detainees on a massive scale, it is reported.]","Egyptian security forces are using sexual violence against detainees on a massive scale, it is reported."
4,"Police in Australia and Europe were aware of a paedophile site called the Love Zone hidden in the so-called dark web.\nIt was protected by passwords, encryption and specialist software. Users were totally anonymous.\nThe images and videos there were particularly disturbing - showing the abuse of babies and very young children.\nMembers had to post increasingly graphic material to remain on the site. There were tens of thousands of accounts.\nOfficers with Task Force Argos in Australia knew the creator of the site used an unusual greeting - the word ""hiyas"".\nAfter exhaustively trawling chatrooms and forums in the open internet, they found a Facebook page of a man who used the same greeting.\nAlthough the Facebook page was fake, they identified a picture of a vehicle and that led them to a man called Shannon McCoole - a childcare worker in Adelaide.\nWhen officers went through his door, he was actually online running the site.\nThey took detailed photographs of McCoole's hands. This...","[One word and a freckle indirectly led to Huckle being tracked down., Police in the UK have uncovered a huge online paedophile network that was operating on the internet.]",One word and a freckle indirectly led to Huckle being tracked down.


In [9]:
# Each list choice contains a positive and negative summary; we'll explode, clean, and drop duplicates
fib_df = fib_df.explode('list_choices')
fib_df['list_choices'] = fib_df['list_choices'].apply(lambda x: x.strip())
fib_df = fib_df.drop_duplicates(subset=['input', 'list_choices'])
logger.info(f'No. of rows in FIB: {len(fib_df):,}')
fib_df[['input', 'list_choices', 'correct_choice']].head(5)

2024-07-22 13:01:28 - INFO - No. of rows in FIB: 3,534


Unnamed: 0,input,list_choices,correct_choice
0,"Vehicles and pedestrians will now embark and disembark the Cowes ferry separately following Maritime and Coastguard Agency (MCA) guidance.\nIsle of Wight Council said its new procedures were in response to a resident's complaint.\nCouncillor Shirley Smart said it would ""initially result in a slower service"".\nOriginally passengers and vehicles boarded or disembarked the so called ""floating bridge"" at the same time.\nMs Smart, who is the executive member for economy and tourism, said the council already had measures in place to control how passengers and vehicles left or embarked the chain ferry ""in a safe manner"".\nHowever, it was ""responding"" to the MCA's recommendations ""following this complaint"".\nShe added: ""This may initially result in a slower service while the measures are introduced and our customers get used to the changes.""\nThe service has been in operation since 1859.",A new service on the Isle of Wight's chain ferry has been launched following a complaint from a resident.,Passengers using a chain ferry have been warned crossing times will be longer because of new safety measures.
0,"Vehicles and pedestrians will now embark and disembark the Cowes ferry separately following Maritime and Coastguard Agency (MCA) guidance.\nIsle of Wight Council said its new procedures were in response to a resident's complaint.\nCouncillor Shirley Smart said it would ""initially result in a slower service"".\nOriginally passengers and vehicles boarded or disembarked the so called ""floating bridge"" at the same time.\nMs Smart, who is the executive member for economy and tourism, said the council already had measures in place to control how passengers and vehicles left or embarked the chain ferry ""in a safe manner"".\nHowever, it was ""responding"" to the MCA's recommendations ""following this complaint"".\nShe added: ""This may initially result in a slower service while the measures are introduced and our customers get used to the changes.""\nThe service has been in operation since 1859.",Passengers using a chain ferry have been warned crossing times will be longer because of new safety measures.,Passengers using a chain ferry have been warned crossing times will be longer because of new safety measures.
1,"If you leave your mobile phone somewhere do you worry you will not be able to check it?\nIf any of this sounds familiar, there is a chance you could be spending too much time on social networks.\nAn exclusive online Newsbeat poll suggests that a quarter of 15 to 18-year-olds in the UK feel happier online than they do in real life.\nDr Radha from The Surgery on Radio 1 has dealt with patients who have displayed ""a lot of social anxiety"" because they are using social networks too much.\n""Being online can provoke a sense of 'I'm not good enough, everyone else is having an amazing life',"" she explained.\n""It doesn't give us a sense of reality and actually what you will find is most people are probably doing the same thing as you are.""\nThe survey, carried out last month, also suggests a third of 15 to 18-year-olds have met someone in person they originally met through social media.\nDr Radha has said it is important people carefully consider what information they share with the online ...","You may be worried about your health, but what if you are online?","Do you ever feel lonely, stressed or jealous when you are online?"
1,"If you leave your mobile phone somewhere do you worry you will not be able to check it?\nIf any of this sounds familiar, there is a chance you could be spending too much time on social networks.\nAn exclusive online Newsbeat poll suggests that a quarter of 15 to 18-year-olds in the UK feel happier online than they do in real life.\nDr Radha from The Surgery on Radio 1 has dealt with patients who have displayed ""a lot of social anxiety"" because they are using social networks too much.\n""Being online can provoke a sense of 'I'm not good enough, everyone else is having an amazing life',"" she explained.\n""It doesn't give us a sense of reality and actually what you will find is most people are probably doing the same thing as you are.""\nThe survey, carried out last month, also suggests a third of 15 to 18-year-olds have met someone in person they originally met through social media.\nDr Radha has said it is important people carefully consider what information they share with the online ...","Do you ever feel lonely, stressed or jealous when you are online?","Do you ever feel lonely, stressed or jealous when you are online?"
2,"Speaking on TV, Maria Zakharova said Jews had told her they donated both to Mr Trump and Hillary Clinton.\nShe joked that American Jews were the best guide to US politics.\nThe diplomat's remarks caused shock. Anti-US propagandists in the last century peddled an idea that rich New York Jews controlled US politics.\nMs Zakharova was speaking on a chat show on Russian state TV at the weekend but her comments drew more attention after being picked up by media outlets on Thursday.\nShe said she had visited New York with an official Russian delegation at the time of the last UN General Assembly, in September.\n""I have a lot of friends and acquaintances there, of course I was interested to find out: how are the elections going, what are the American people's expectations?"" she said.\n""If you want to know what will happen in America, who do you need to talk to? You have to talk to the Jews, of course. It goes without saying.""\nAt this, the TV studio audience applauded loudly.\n""I went her...","The Russian foreign minister has said she has been ""settled"" by criticism from Jewish people for saying that the US election was a ""Jewish conspiracy"".",A spokeswoman on Russian TV has said Jewish people in New York told her they had mainly backed Trump in the US election.


In [11]:
# Create labels where factually consistent = 2 (entailment) and factually inconsistent = 0 (contradiction)
# What happened to label = 1? We drop it as it represents neutral in the NLI task

fib_df.loc[fib_df['correct_choice'] == fib_df['list_choices'], 'label'] = 2
fib_df.loc[fib_df['correct_choice'] != fib_df['list_choices'], 'label'] = 0
fib_df['label'] = fib_df['label'].astype(int)

logger.info(f'Label distribution:\n{fib_df["label"].value_counts()}')
fib_df[['input', 'list_choices', 'correct_choice', 'label']].head()

2024-07-22 13:03:43 - INFO - Label distribution:
label
0    3034
2     500
Name: count, dtype: int64


Unnamed: 0,input,list_choices,correct_choice,label
0,"Vehicles and pedestrians will now embark and disembark the Cowes ferry separately following Maritime and Coastguard Agency (MCA) guidance.\nIsle of Wight Council said its new procedures were in response to a resident's complaint.\nCouncillor Shirley Smart said it would ""initially result in a slower service"".\nOriginally passengers and vehicles boarded or disembarked the so called ""floating bridge"" at the same time.\nMs Smart, who is the executive member for economy and tourism, said the council already had measures in place to control how passengers and vehicles left or embarked the chain ferry ""in a safe manner"".\nHowever, it was ""responding"" to the MCA's recommendations ""following this complaint"".\nShe added: ""This may initially result in a slower service while the measures are introduced and our customers get used to the changes.""\nThe service has been in operation since 1859.",A new service on the Isle of Wight's chain ferry has been launched following a complaint from a resident.,Passengers using a chain ferry have been warned crossing times will be longer because of new safety measures.,0
0,"Vehicles and pedestrians will now embark and disembark the Cowes ferry separately following Maritime and Coastguard Agency (MCA) guidance.\nIsle of Wight Council said its new procedures were in response to a resident's complaint.\nCouncillor Shirley Smart said it would ""initially result in a slower service"".\nOriginally passengers and vehicles boarded or disembarked the so called ""floating bridge"" at the same time.\nMs Smart, who is the executive member for economy and tourism, said the council already had measures in place to control how passengers and vehicles left or embarked the chain ferry ""in a safe manner"".\nHowever, it was ""responding"" to the MCA's recommendations ""following this complaint"".\nShe added: ""This may initially result in a slower service while the measures are introduced and our customers get used to the changes.""\nThe service has been in operation since 1859.",Passengers using a chain ferry have been warned crossing times will be longer because of new safety measures.,Passengers using a chain ferry have been warned crossing times will be longer because of new safety measures.,2
1,"If you leave your mobile phone somewhere do you worry you will not be able to check it?\nIf any of this sounds familiar, there is a chance you could be spending too much time on social networks.\nAn exclusive online Newsbeat poll suggests that a quarter of 15 to 18-year-olds in the UK feel happier online than they do in real life.\nDr Radha from The Surgery on Radio 1 has dealt with patients who have displayed ""a lot of social anxiety"" because they are using social networks too much.\n""Being online can provoke a sense of 'I'm not good enough, everyone else is having an amazing life',"" she explained.\n""It doesn't give us a sense of reality and actually what you will find is most people are probably doing the same thing as you are.""\nThe survey, carried out last month, also suggests a third of 15 to 18-year-olds have met someone in person they originally met through social media.\nDr Radha has said it is important people carefully consider what information they share with the online ...","You may be worried about your health, but what if you are online?","Do you ever feel lonely, stressed or jealous when you are online?",0
1,"If you leave your mobile phone somewhere do you worry you will not be able to check it?\nIf any of this sounds familiar, there is a chance you could be spending too much time on social networks.\nAn exclusive online Newsbeat poll suggests that a quarter of 15 to 18-year-olds in the UK feel happier online than they do in real life.\nDr Radha from The Surgery on Radio 1 has dealt with patients who have displayed ""a lot of social anxiety"" because they are using social networks too much.\n""Being online can provoke a sense of 'I'm not good enough, everyone else is having an amazing life',"" she explained.\n""It doesn't give us a sense of reality and actually what you will find is most people are probably doing the same thing as you are.""\nThe survey, carried out last month, also suggests a third of 15 to 18-year-olds have met someone in person they originally met through social media.\nDr Radha has said it is important people carefully consider what information they share with the online ...","Do you ever feel lonely, stressed or jealous when you are online?","Do you ever feel lonely, stressed or jealous when you are online?",2
2,"Speaking on TV, Maria Zakharova said Jews had told her they donated both to Mr Trump and Hillary Clinton.\nShe joked that American Jews were the best guide to US politics.\nThe diplomat's remarks caused shock. Anti-US propagandists in the last century peddled an idea that rich New York Jews controlled US politics.\nMs Zakharova was speaking on a chat show on Russian state TV at the weekend but her comments drew more attention after being picked up by media outlets on Thursday.\nShe said she had visited New York with an official Russian delegation at the time of the last UN General Assembly, in September.\n""I have a lot of friends and acquaintances there, of course I was interested to find out: how are the elections going, what are the American people's expectations?"" she said.\n""If you want to know what will happen in America, who do you need to talk to? You have to talk to the Jews, of course. It goes without saying.""\nAt this, the TV studio audience applauded loudly.\n""I went her...","The Russian foreign minister has said she has been ""settled"" by criticism from Jewish people for saying that the US election was a ""Jewish conspiracy"".",A spokeswoman on Russian TV has said Jewish people in New York told her they had mainly backed Trump in the US election.,0


In [12]:
# Split into train and val, ensuring that the same source doc doesn't appear across train and val
source_grouped = (fib_df.groupby('input')
                  .agg({'label': 'count'})
                  .reset_index())

input_train, input_val = train_test_split(source_grouped,
                                          test_size=0.3,
                                          stratify=source_grouped['label'],
                                          random_state=1368)

input_test, input_val = train_test_split(input_val,
                                         test_size=0.5,
                                         stratify=input_val['label'],
                                         random_state=1368)

fib_train = fib_df[fib_df['input'].isin(input_train['input'])]
fib_val = fib_df[fib_df['input'].isin(input_val['input'])]
fib_test = fib_df[fib_df['input'].isin(input_test['input'])]

logger.info(f'Rows in FIB train: {len(fib_train):,}, val: {len(fib_val):,}, test: {len(fib_test):,}')

2024-07-22 13:04:05 - INFO - Rows in FIB train: 2,474, val: 530, test: 530


In [13]:
# NOTE: In FIB, each doc has 1 positive summary and 5-6 negative summaries. We'll balance it to 1 is to 1.
fib_train = fib_train.drop_duplicates(subset=['input', 'label'])
fib_val = fib_val.drop_duplicates(subset=['input', 'label'])
fib_test = fib_test.drop_duplicates(subset=['input', 'label'])

logger.info(f'Rows in balanced FIB train: {len(fib_train)}, val: {len(fib_val)}, test: {len(fib_test)}')

2024-07-22 13:05:00 - INFO - Rows in balanced FIB train: 700, val: 150, test: 150


In [15]:
fib_train.to_csv('./data/fib-train.csv', index=False)
fib_val.to_csv('./data/fib-val.csv', index=False)
fib_test.to_csv('./data/fib-test.csv', index=False)

In [18]:
# Test loading into dataset
fib_files = {'train': './data/fib-train.csv',
             'val': './data/fib-val.csv',
             'test': './data/fib-test.csv'}

fib_ds = load_dataset('csv', data_files=fib_files)
fib_ds = fib_ds.select_columns(['input', 'list_choices', 'label'])
fib_ds = fib_ds.rename_column('input', 'premise').rename_column('list_choices', 'hypothesis')

logger.info(
    f"Label distribution - Train: {Counter(fib_ds['train']['label'])}, Val: {Counter(fib_ds['val']['label'])},Test: {Counter(fib_ds['test']['label'])}"
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

2024-07-22 13:10:45 - INFO - Label distribution - Train: Counter({0: 350, 2: 350}), Val: Counter({0: 75, 2: 75}),Test: Counter({0: 75, 2: 75})


## Prepare USB data
- Note: label = 0 is "after edit"/factual consistency; label = 1 is "before edit"/factual inconsistency
- https://github.com/kukrishna/usb/blob/master/dataset_creators/usb_fac.py#L83
![](images/usb.png)

In [20]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [21]:
!git clone https://github.com/kukrishna/usb.git
!cd usb && tar -xf raw_annotations.tar.gz
!cd usb && pip install -r requirements.txt
!cd usb && bash create_all_datasets.sh

Cloning into 'usb'...
remote: Enumerating objects: 79, done.[K
remote: Counting objects: 100% (79/79), done.[K
remote: Compressing objects: 100% (55/55), done.[K
remote: Total 79 (delta 39), reused 49 (delta 22), pack-reused 0[K
Receiving objects: 100% (79/79), 9.04 MiB | 13.67 MiB/s, done.
Resolving deltas: 100% (39/39), done.
Collecting jsonlines>=3.1.0 (from -r requirements.txt (line 2))
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting openai (from -r requirements.txt (line 5))
  Downloading openai-1.36.1-py3-none-any.whl.metadata (22 kB)
Collecting tenacity (from -r requirements.txt (line 6))
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting transformers==4.23.1 (from -r requirements.txt (line 7))
  Downloading transformers-4.23.1-py3-none-any.whl.metadata (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.7/88.7 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting tokenizers!=0.11

In [22]:
usb_train = pd.read_json('usb/task_datasets/all/factuality_classification/train.jsonl', lines=True)
usb_val = pd.read_json('usb/task_datasets/all/factuality_classification/validation.jsonl', lines=True)

logger.info(f'Rows in USB train: {len(usb_train):,}, val: {len(usb_val):,}')

2024-07-22 13:13:35 - INFO - Rows in USB train: 5,050, val: 2,668


In [23]:
usb_train['source'] = usb_train['input_lines'].apply(lambda x: ' '.join(line for line in x))
usb_val['source'] = usb_val['input_lines'].apply(lambda x: ' '.join(line for line in x))

In [25]:
# 0 = "after edit" / factual consistency; 
# 1 = "before edit" / factually inconsistent
usb_train[['source', 'summary_sent', 'label']].head(10)

Unnamed: 0,source,summary_sent,label
0,"Wendy Jane Crewson Crewson was born in Hamilton, Ontario, the daughter of June Doreen (née Thomas) and Robert Binnie Crewson. Also in 2012, Crewson began playing Dr. Dana Kinny in the CTV medical drama ""Saving Hope"", for which she received Canadian Screen Award for Best Supporting Actress in a Drama Program or Series in 2013.",Wendy Jane Crewson is a Canadian actress.,0
1,"Wendy Jane Crewson Crewson was born in Hamilton, Ontario, the daughter of June Doreen (née Thomas) and Robert Binnie Crewson. Also in 2012, Crewson began playing Dr. Dana Kinny in the CTV medical drama ""Saving Hope"", for which she received Canadian Screen Award for Best Supporting Actress in a Drama Program or Series in 2013.","Wendy Jane Crewson (born May 9, 1956) is a Canadian actress and producer.",1
2,"When she returned to Canada, Crewson landed a leading role in the television movie ""War Brides"" (1980) directed by Martin Lavut, for which she received her first ACTRA Award nomination. From 1980 to 1983, she starred in the CBC drama series, ""Home Fires"", a family saga set in Toronto during World War II. In 1991, Crewson appeared in her first breakthrough role in the American drama film ""The Doctor"" starring William Hurt.","She began her career appearing on Canadian television, before her breakthrough role in the 1991 dramatic film ""The Doctor"".",0
3,"When she returned to Canada, Crewson landed a leading role in the television movie ""War Brides"" (1980) directed by Martin Lavut, for which she received her first ACTRA Award nomination. From 1980 to 1983, she starred in the CBC drama series, ""Home Fires"", a family saga set in Toronto during World War II. In 1991, Crewson appeared in her first breakthrough role in the American drama film ""The Doctor"" starring William Hurt.","She began her career appearing on Canadian television, before her breakthrough role in 1991 dramatic film ""The Doctor"".",1
4,"In 1993, she starred in the psychological thriller ""The Good Son"" (1993), and in 1994 appeared opposite Whoopi Goldberg in ""Corrina, Corrina"". Also in 1994, Crewson starred alongside Tim Allen in the financially successful Christmas comedy film ""The Santa Clause"". The film grossed $189 million and its two sequels, The Santa Clause 2 (2002) and The Santa Clause 3: The Escape Clause (2006) also grossed $283 million worldwide together. In 1996, Crewson co-starred in the romantic drama film ""To Gillian on Her 37th Birthday"" as Peter Gallagher's unfortunate blind date, and the following year played Grace Marshall, First Lady to President James Marshall (Harrison Ford) in the political thriller ""Air Force One"" directed by Wolfgang Petersen. She also appeared in ""Gang Related"" (1997), played a leading role in ""Sleeping Dogs Lie"" (1998), and co-starred opposite Robin Williams in the science fiction film ""Bicentennial Man"" (1999). In 2000, she played Arnold Schwarzenegger's wife in ""The 6th...","Crewson has appeared in many films, including ""The Good Son"" (1993), ""The Santa Clause"" (1994) and its sequels ""The Santa Clause 2"" (2002) and ""The Santa Clause 3: The Escape Clause"" (2006), as well as ""Air Force One"" (1997), ""Bicentennial Man"" (1999), ""What Lies Beneath"" (2000), ""The 6th Day"" (2000), ""The Covenant"" (2006) and ""Eight Below"" (2006).",0
5,"In 1993, she starred in the psychological thriller ""The Good Son"" (1993), and in 1994 appeared opposite Whoopi Goldberg in ""Corrina, Corrina"". Also in 1994, Crewson starred alongside Tim Allen in the financially successful Christmas comedy film ""The Santa Clause"". The film grossed $189 million and its two sequels, The Santa Clause 2 (2002) and The Santa Clause 3: The Escape Clause (2006) also grossed $283 million worldwide together. In 1996, Crewson co-starred in the romantic drama film ""To Gillian on Her 37th Birthday"" as Peter Gallagher's unfortunate blind date, and the following year played Grace Marshall, First Lady to President James Marshall (Harrison Ford) in the political thriller ""Air Force One"" directed by Wolfgang Petersen. She also appeared in ""Gang Related"" (1997), played a leading role in ""Sleeping Dogs Lie"" (1998), and co-starred opposite Robin Williams in the science fiction film ""Bicentennial Man"" (1999). In 2000, she played Arnold Schwarzenegger's wife in ""The 6th...","Crewson has appeared in many Hollywood films, including ""The Good Son"" (1993), ""The Santa Clause"" (1994) and its sequels ""The Santa Clause 2"" (2002) and ""The Santa Clause 3: The Escape Clause"" (2006), as well as ""Air Force One"" (1997), ""Bicentennial Man"" (1999), ""What Lies Beneath"" (2000), ""The 6th Day"" (2000), ""The Covenant"" (2006) and ""Eight Below"" (2006).",1
6,"For the final season, she won ACTRA Award for Best Actress in a Drama Series in 1984. Crewson also won Gemini Awards for guest starring in ""Due South "" in 1998, and supporting role in ""ReGenesis"" in 2007. Also in 2012, Crewson began playing Dr. Dana Kinny in the CTV medical drama ""Saving Hope"", for which she received Canadian Screen Award for Best Supporting Actress in a Drama Program or Series in 2013. Also in 2017, she won another Canadian Screen Award for Best Supporting Actress for her recurring role on ""Slasher"".","Crewson has won Gemini Awards, two Canadian Screen Awards and an ACTRA Award for her performances on television.",0
7,"For the final season, she won ACTRA Award for Best Actress in a Drama Series in 1984. Crewson also won Gemini Awards for guest starring in ""Due South "" in 1998, and supporting role in ""ReGenesis"" in 2007. Also in 2012, Crewson began playing Dr. Dana Kinny in the CTV medical drama ""Saving Hope"", for which she received Canadian Screen Award for Best Supporting Actress in a Drama Program or Series in 2013. Also in 2017, she won another Canadian Screen Award for Best Supporting Actress for her recurring role on ""Slasher"".","Crewson has won six Gemini Awards, two Canadian Screen Awards and ACTRA Award for her performances on television.",1
8,"In 1973, thanks in part to the influence of his mentor, General William W. Dunn, the Commander, Lt-Gen Kenneth Schultz, assigned Parkinson to a floundering Air Force program called Project 621B. At that meeting, attended only by his officer-engineers and two people from the Aerospace Corporation, he led the re-architecture of the concept. He then assumed lead responsibility to sell the new configuration to the Air Force and to top Pentagon Officials. Parkinson then assumed full, direct control of the development of the demonstration system, which included satellites, a global ground control system, nine types of user receivers, and an extensive land, sea and air test program. In 1978, Parkinson was the launch Commander for the first prototype GPS satellite to be launched (forty-four months after go-ahead).","He is known as the lead architect, advocate and developer of the Air Force Project 621B program, better known as Global Positioning System.",0
9,"In 1973, thanks in part to the influence of his mentor, General William W. Dunn, the Commander, Lt-Gen Kenneth Schultz, assigned Parkinson to a floundering Air Force program called Project 621B. At that meeting, attended only by his officer-engineers and two people from the Aerospace Corporation, he led the re-architecture of the concept. He then assumed lead responsibility to sell the new configuration to the Air Force and to top Pentagon Officials. Parkinson then assumed full, direct control of the development of the demonstration system, which included satellites, a global ground control system, nine types of user receivers, and an extensive land, sea and air test program. In 1978, Parkinson was the launch Commander for the first prototype GPS satellite to be launched (forty-four months after go-ahead).","He is best known as the lead architect, advocate and developer, with early contributions from Ivan Getting and Roger Easton, of the Air Force NAVSTAR program, better known as Global Positioning System.",1


In [26]:
usb_train['label'] = usb_train['label'].apply(lambda x: 0 if x == 1 else 2)
usb_val['label'] = usb_val['label'].apply(lambda x: 0 if x == 1 else 2)

logger.info(f'Label distribution (train):\n{usb_train["label"].value_counts()}')
logger.info(f'Label distribution (val):\n{usb_val["label"].value_counts()}')

2024-07-22 13:16:56 - INFO - Label distribution (train):
label
2    2525
0    2525
Name: count, dtype: int64
2024-07-22 13:16:56 - INFO - Label distribution (val):
label
2    1334
0    1334
Name: count, dtype: int64


In [27]:
usb_train.to_csv('./data/usb-train.csv', index=False)
usb_val.to_csv('./data/usb-val.csv', index=False)

In [28]:
# Test loading into dataset
usb_files = {'train': './data/usb-train.csv',
             'val': './data/usb-val.csv'}

usb_ds = load_dataset('csv', data_files=usb_files)
usb_ds = usb_ds.select_columns(['source', 'summary_sent', 'label'])
usb_ds = usb_ds.rename_column('source', 'premise').rename_column('summary_sent', 'hypothesis')

logger.info(f"Label distribution - Train: {Counter(usb_ds['train']['label'])}, Val: {Counter(usb_ds['val']['label'])}")

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

2024-07-22 13:18:18 - INFO - Label distribution - Train: Counter({2: 2525, 0: 2525}), Val: Counter({2: 1334, 0: 1334})


# Next: Link: [2_ft_fib.ipynb](2_ft_fib.ipynb)