<a href="https://colab.research.google.com/github/eugeneyan/visualizing-finetunes/blob/main/1_prep_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install -q transformers accelerate bitsandbytes datasets peft watermark

In [2]:
%load_ext watermark
%watermark --conda -p torch,transformers,peft,datasets,sklearn



torch       : 2.2.1+cu121
transformers: 4.23.1
peft        : 0.10.0
datasets    : 2.19.1
sklearn     : 1.2.2

conda environment: n/a



In [3]:
import pandas as pd
import logging
import re

from collections import Counter
from datasets import load_dataset
from sklearn.model_selection import train_test_split

In [4]:
# Set up logger
logger = logging.getLogger('1-prep-data')
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S',
    force=True
)

logger.info('Running notebook to prep data')

2024-05-11 20:37:13 - INFO - Running notebook to prep data


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Prepare FIB data
- FIB contains one-sentence summaries on CNN/DM & XSUM news articles.
- Note: We exclude the CNN/Daily Mail data is pretty bad.
- https://huggingface.co/datasets/r-three/fib

In [6]:
fib_ds = load_dataset('r-three/fib', split='test')
fib_df = fib_ds.to_pandas()
logger.info(f'No. of rows in FIB: {len(fib_df):,}')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Repo card metadata block was not found. Setting CardData to empty.
2024-05-11 20:37:18 - INFO - No. of rows in FIB: 3,579


In [7]:
# Visualize the CNN/DM data
fib_df.loc[fib_df['dataset'] == 'cnn_dm', ['input', 'list_choices', 'correct_choice']].head(5)

Unnamed: 0,input,list_choices,correct_choice
3122,( cnn ) the american pharmacists association i...,[<t> the american pharmacists association pass...,<t> the american pharmacists association passe...
3123,( cnn ) oprah 's in there . so 's bill murray ...,[<t> `` the late show with david letterman '' ...,<t> `` the late show with david letterman '' c...
3124,( cnn ) feeling so happy you just ca n't stand...,[<t> a new study has found that acetaminophen ...,<t> subjects taking acetaminophen reacted less...
3125,"( cnn ) love it or hate it , jared leto 's int...",[<t> the oscar winner put on white makeup -lrb...,<t> leto will play the clown prince of crime i...
3126,( the hollywood reporter ) the original cast o...,[<t> -lrb- the hollywood reporter -rrb- the or...,<t> `` twin peaks '' creator david lynch annou...


In [8]:
# Only keep xsum data
fib_df = fib_df[fib_df['dataset'] == 'xsum']
logger.info(f'No. of rows in FIB: {len(fib_df):,}')

2024-05-11 20:37:18 - INFO - No. of rows in FIB: 3,122


In [9]:
fib_df[['input', 'list_choices', 'correct_choice']].head(5)

Unnamed: 0,input,list_choices,correct_choice
0,Vehicles and pedestrians will now embark and d...,[ A new service on the Isle of Wight's chain f...,Passengers using a chain ferry have been warne...
1,If you leave your mobile phone somewhere do yo...,"[ You may be worried about your health, but wh...","Do you ever feel lonely, stressed or jealous w..."
2,"Speaking on TV, Maria Zakharova said Jews had ...",[ The Russian foreign minister has said she ha...,A spokeswoman on Russian TV has said Jewish pe...
3,"A report by the organisation suggests men, wom...",[ Egyptian police are systematically abusing d...,Egyptian security forces are using sexual viol...
4,Police in Australia and Europe were aware of a...,[One word and a freckle indirectly led to Huck...,One word and a freckle indirectly led to Huckl...


In [10]:
# Each list choice contains a positive and negative summary; we'll explode, clean, and drop duplicates
fib_df = fib_df.explode('list_choices')
fib_df['list_choices'] = fib_df['list_choices'].apply(lambda x: x.strip())
fib_df = fib_df.drop_duplicates(subset=['input', 'list_choices'])
logger.info(f'No. of rows in FIB: {len(fib_df):,}')
fib_df[['input', 'list_choices', 'correct_choice']].head(5)

2024-05-11 20:37:19 - INFO - No. of rows in FIB: 3,534


Unnamed: 0,input,list_choices,correct_choice
0,Vehicles and pedestrians will now embark and d...,A new service on the Isle of Wight's chain fer...,Passengers using a chain ferry have been warne...
0,Vehicles and pedestrians will now embark and d...,Passengers using a chain ferry have been warne...,Passengers using a chain ferry have been warne...
1,If you leave your mobile phone somewhere do yo...,"You may be worried about your health, but what...","Do you ever feel lonely, stressed or jealous w..."
1,If you leave your mobile phone somewhere do yo...,"Do you ever feel lonely, stressed or jealous w...","Do you ever feel lonely, stressed or jealous w..."
2,"Speaking on TV, Maria Zakharova said Jews had ...",The Russian foreign minister has said she has ...,A spokeswoman on Russian TV has said Jewish pe...


In [11]:
# Create labels where factually consistent = 2 (entailment) and factually inconsistent = 0 (contradiction)
# What happened to label = 1? We drop it as it represents neutral in the NLI task
fib_df.loc[fib_df['correct_choice'] == fib_df['list_choices'], 'label'] = 2
fib_df.loc[fib_df['correct_choice'] != fib_df['list_choices'], 'label'] = 0
fib_df['label'] = fib_df['label'].astype(int)

logger.info(f'Label distribution:\n{fib_df["label"].value_counts()}')
fib_df[['input', 'list_choices', 'correct_choice', 'label']].head()

2024-05-11 20:37:19 - INFO - Label distribution:
label
0    3034
2     500
Name: count, dtype: int64


Unnamed: 0,input,list_choices,correct_choice,label
0,Vehicles and pedestrians will now embark and d...,A new service on the Isle of Wight's chain fer...,Passengers using a chain ferry have been warne...,0
0,Vehicles and pedestrians will now embark and d...,Passengers using a chain ferry have been warne...,Passengers using a chain ferry have been warne...,2
1,If you leave your mobile phone somewhere do yo...,"You may be worried about your health, but what...","Do you ever feel lonely, stressed or jealous w...",0
1,If you leave your mobile phone somewhere do yo...,"Do you ever feel lonely, stressed or jealous w...","Do you ever feel lonely, stressed or jealous w...",2
2,"Speaking on TV, Maria Zakharova said Jews had ...",The Russian foreign minister has said she has ...,A spokeswoman on Russian TV has said Jewish pe...,0


In [12]:
# Split into train and val, ensuring that the same source doc doesn't appear across train and val
source_grouped = (fib_df.groupby('input')
                  .agg({'label': 'count'})
                  .reset_index())

input_train, input_val = train_test_split(source_grouped,
                                          test_size=0.3,
                                          stratify=source_grouped['label'],
                                          random_state=1368)

input_test, input_val = train_test_split(input_val,
                                          test_size=0.5,
                                          stratify=input_val['label'],
                                          random_state=1368)

fib_train = fib_df[fib_df['input'].isin(input_train['input'])]
fib_val = fib_df[fib_df['input'].isin(input_val['input'])]
fib_test = fib_df[fib_df['input'].isin(input_test['input'])]

logger.info(f'Rows in FIB train: {len(fib_train):,}, val: {len(fib_val):,}, test: {len(fib_test):,}')

2024-05-11 20:37:19 - INFO - Rows in FIB train: 2,474, val: 530, test: 530


In [13]:
# NOTE: In FIB, each doc has 1 positive summary and 5-6 negative summaries. We'll balance it to 1 is to 1.
fib_train = fib_train.drop_duplicates(subset=['input', 'label'])
fib_val = fib_val.drop_duplicates(subset=['input', 'label'])
fib_test = fib_test.drop_duplicates(subset=['input', 'label'])

logger.info(f'Rows in balanced FIB train: {len(fib_train)}, val: {len(fib_val)}, test: {len(fib_test)}')

2024-05-11 20:37:19 - INFO - Rows in balanced FIB train: 700, val: 150, test: 150


In [14]:
fib_train.to_csv('/content/drive/My Drive/fib-train.csv', index=False)
fib_val.to_csv('/content/drive/My Drive/fib-val.csv', index=False)
fib_test.to_csv('/content/drive/My Drive/fib-test.csv', index=False)

In [15]:
# Test loading into dataset
fib_files = {'train': '/content/drive/My Drive/fib-train.csv',
             'val': '/content/drive/My Drive/fib-val.csv',
             'test': '/content/drive/My Drive/fib-test.csv'}

fib_ds = load_dataset('csv', data_files=fib_files)
fib_ds = fib_ds.select_columns(['input', 'list_choices', 'label'])
fib_ds = fib_ds.rename_column('input', 'premise').rename_column('list_choices', 'hypothesis')

logger.info(f"Label distribution - Train: {Counter(fib_ds['train']['label'])}, Val: {Counter(fib_ds['val']['label'])}, Test: {Counter(fib_ds['test']['label'])}")

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

2024-05-11 20:37:20 - INFO - Label distribution - Train: Counter({0: 350, 2: 350}), Val: Counter({0: 75, 2: 75}), Test: Counter({0: 75, 2: 75})


## Prepare USB data
- Note: label = 0 is "after edit"/factual consistency; label = 1 is "before edit"/factual inconsistency
- https://github.com/kukrishna/usb/blob/master/dataset_creators/usb_fac.py#L83

In [16]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [17]:
!git clone https://github.com/kukrishna/usb.git
!cd usb && tar -xf raw_annotations.tar.gz
!cd usb && pip install -r requirements.txt
!cd usb && bash create_all_datasets.sh

fatal: destination path 'usb' already exists and is not an empty directory.
READING DATASET FROM REPOSITIORY HERE:  raw_annotations
WILL WRITE ALL TASK DATASETS TO :  task_datasets
Traceback (most recent call last):
  File "/content/usb/dataset_creators/usb_abs_comp.py", line 46, in <module>
    os.mkdir(f"{OUTPUT_ROOT}/{DS_VARIANT}")
FileExistsError: [Errno 17] File exists: 'task_datasets/biographies/abstractive_summarization'
Traceback (most recent call last):
  File "/content/usb/dataset_creators/usb_evext.py", line 48, in <module>
    os.mkdir(f"{OUTPUT_ROOT}/{DS_VARIANT}")
FileExistsError: [Errno 17] File exists: 'task_datasets/biographies/evidence_extraction'
Traceback (most recent call last):
  File "/content/usb/dataset_creators/usb_fac.py", line 49, in <module>
    os.mkdir(f"{OUTPUT_ROOT}/{DS_VARIANT}")
FileExistsError: [Errno 17] File exists: 'task_datasets/biographies/factuality_classification'
Traceback (most recent call last):
  File "/content/usb/dataset_creators/usb_fix

In [18]:
usb_train = pd.read_json('usb/task_datasets/all/factuality_classification/train.jsonl', lines=True)
usb_val = pd.read_json('usb/task_datasets/all/factuality_classification/validation.jsonl', lines=True)

logger.info(f'Rows in USB train: {len(usb_train):,}, val: {len(usb_val):,}')

2024-05-11 20:38:08 - INFO - Rows in USB train: 5,050, val: 2,668


In [19]:
usb_train['source'] = usb_train['input_lines'].apply(lambda x: ' '.join(line for line in x))
usb_val['source'] = usb_val['input_lines'].apply(lambda x: ' '.join(line for line in x))

In [20]:
# 0 = "after edit" / factual consistency; 1 = "before edit" / factually inconsistent
usb_train[['source', 'summary_sent', 'label']].head(8)

Unnamed: 0,source,summary_sent,label
0,Wendy Jane Crewson Crewson was born in Hamilto...,Wendy Jane Crewson is a Canadian actress.,0
1,Wendy Jane Crewson Crewson was born in Hamilto...,"Wendy Jane Crewson (born May 9, 1956) is a Can...",1
2,"When she returned to Canada, Crewson landed a ...",She began her career appearing on Canadian tel...,0
3,"When she returned to Canada, Crewson landed a ...",She began her career appearing on Canadian tel...,1
4,"In 1993, she starred in the psychological thri...","Crewson has appeared in many films, including ...",0
5,"In 1993, she starred in the psychological thri...","Crewson has appeared in many Hollywood films, ...",1
6,"For the final season, she won ACTRA Award for ...","Crewson has won Gemini Awards, two Canadian Sc...",0
7,"For the final season, she won ACTRA Award for ...","Crewson has won six Gemini Awards, two Canadia...",1


In [21]:
usb_train['label'] = usb_train['label'].apply(lambda x: 0 if x == 1 else 2)
usb_val['label'] = usb_val['label'].apply(lambda x: 0 if x == 1 else 2)

logger.info(f'Label distribution (train):\n{usb_train["label"].value_counts()}')
logger.info(f'Label distribution (val):\n{usb_val["label"].value_counts()}')

2024-05-11 20:38:08 - INFO - Label distribution (train):
label
2    2525
0    2525
Name: count, dtype: int64
2024-05-11 20:38:08 - INFO - Label distribution (val):
label
2    1334
0    1334
Name: count, dtype: int64


In [22]:
usb_train.to_csv('/content/drive/My Drive/usb-train.csv', index=False)
usb_val.to_csv('/content/drive/My Drive/usb-val.csv', index=False)

In [23]:
# Test loading into dataset
usb_files = {'train': '/content/drive/My Drive/usb-train.csv',
             'val': '/content/drive/My Drive/usb-val.csv'}

usb_ds = load_dataset('csv', data_files=usb_files)
usb_ds = usb_ds.select_columns(['source', 'summary_sent', 'label'])
usb_ds = usb_ds.rename_column('source', 'premise').rename_column('summary_sent', 'hypothesis')

logger.info(f"Label distribution - Train: {Counter(usb_ds['train']['label'])}, Val: {Counter(usb_ds['val']['label'])}")

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

2024-05-11 20:38:11 - INFO - Label distribution - Train: Counter({2: 2525, 0: 2525}), Val: Counter({2: 1334, 0: 1334})
