<a href="https://colab.research.google.com/github/alialhousseini/glimpse-mds/blob/main/NLP_main_notebook3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!git clone https://github.com/alialhousseini/glimpse-mds

Cloning into 'glimpse-mds'...
remote: Enumerating objects: 356, done.[K
remote: Counting objects: 100% (99/99), done.[K
remote: Compressing objects: 100% (75/75), done.[K
remote: Total 356 (delta 43), reused 68 (delta 23), pack-reused 257 (from 1)[K
Receiving objects: 100% (356/356), 32.95 MiB | 16.77 MiB/s, done.
Resolving deltas: 100% (193/193), done.


In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:14
🔁 Restarting kernel...


In [None]:
# Define the path for the new environment
env_path = '/content/glimpse_env'

In [None]:
# Create the Conda environment in the specified folder
!conda create --prefix "$env_path" python=3.11.8 -y

Channels:
 - conda-forge
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - done


In [None]:
%%capture
!source activate /content/glimpse_env && pip install -r /content/glimpse-mds/requirements

________

# 1. Extension 1: Use an ensemble set of models for producing RSA scores

| **Model**           | **Small Definition**                                                                                                         | **Model Ability and Performance**                                                                   | **Why Choose This**                                                                                       | **Purpose**                                                                                      |
|----------------------|-----------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| **BART**            | A transformer model optimized for sequence-to-sequence tasks like summarization and translation.                           | Focuses on fluency and readability with strong abstractive summarization capabilities.              | Pretrained on a diverse dataset, excels in generating coherent and fluent summaries.                      | Abstractive summarization and text generation tasks.                                            |
| **PEGASUS**         | A transformer model designed for summarization with gap-sentence generation pretraining.                                    | Excels at abstractive summarization and capturing document essence.                                 | Strong performance on summarization benchmarks like CNN/DailyMail and XSum.                               | Creating concise and coherent abstractive summaries.                                            |
| **T5**              | A unified text-to-text transformer model capable of handling multiple NLP tasks.                                            | Balances fluency, factual consistency, and flexibility across tasks.                                | Suitable for multitask setups with state-of-the-art results on summarization and NLI.                     | Summarization, translation, and text classification.                                            |
| **LED**             | A long-context version of BART tailored for summarizing long documents.                                                     | Processes up to 16,384 tokens, maintaining coherence and relevance over long contexts.              | Handles lengthy academic or technical documents effectively.                                              | Summarization of research papers, reports, and other long-form texts.                          |
| **BigBird-Pegasus** | A sparse-attention model designed for efficient processing of long documents.                                               | Balances computational efficiency with performance on long-form summarization tasks.                | Efficiently handles long contexts while preserving critical information.                                  | Summarization and classification for long texts.                                                |
| **PEGASUS (XSUM)**  | A PEGASUS model fine-tuned specifically on the XSum dataset for abstractive summarization.                                  | Produces highly concise summaries but can sometimes lose factual consistency.                       | Ideal for tasks requiring short, highly abstractive summaries.                                            | Summarizing news articles or brief documents.                                                   |
| **DistilBART**      | A distilled version of BART with faster inference and smaller size.                                                         | Slightly reduced fluency compared to BART but much faster and more efficient.                       | Useful for scenarios with computational constraints.                                                      | Quick summarization and inference on resource-limited systems.                                 |
| **mBART**           | A multilingual version of BART pre-trained on multiple languages.                                                           | Handles cross-lingual summarization and translation tasks effectively.                              | Ideal for multilingual datasets and scenarios involving diverse languages.                                | Summarization and translation for non-English or multilingual texts.                           |
| **Flan-T5**         | A fine-tuned version of T5 for enhanced task performance, including summarization and reasoning.                            | Improves generalization and performance across multiple NLP tasks.                                  | Great for zero-shot and few-shot setups, with excellent task-specific adaptation.                         | Summarization, classification, and text reasoning.                                              |
| **BERTSUM**         | A fine-tuned BERT model for extractive summarization tasks.                                                                 | Excels at identifying and extracting key sentences from texts.                                      | Reliable for extracting main points without paraphrasing.                                                 | Extractive summarization of structured documents.                                               |
| **FactPEGASUS**     | A fine-tuned PEGASUS model for ensuring factual consistency in summaries.                                                    | Focuses on generating factually accurate abstractive summaries.                                     | Essential for tasks where factual correctness is critical.                                                | Summarization for sensitive or factual information.                                             |
| **CTRLsumm**        | A fine-tuned model designed for conversational summarization.                                                               | Captures conversational nuances and preserves dialogue structure in summaries.                      | Ideal for summarizing meetings, chats, or interview transcripts.                                          | Summarization of conversational or dialogue-based content.                                      |
| **SciBERT**         | A BERT model trained on scientific literature.                                                                              | Captures domain-specific vocabulary and syntax effectively.                                         | Tailored for summarizing or analyzing scholarly articles and scientific texts.                            | Text summarization, classification, and entity recognition in scientific domains.               |
| **ALL-MPNET**       | A sentence transformer optimized for semantic similarity tasks.                                                              | Measures semantic similarity with high accuracy.                                                    | Ideal for evaluating consistency between source text and summaries.                                       | Consistency checking and similarity analysis.                                                   |
| **XLM-ROBERTA**     | A multilingual transformer model with strong cross-lingual understanding capabilities.                                       | Excels in multilingual understanding and summarization tasks.                                       | Suitable for multilingual datasets or cross-lingual consistency evaluation.                               | Multilingual summarization and text classification.                                             |
| **LongFormer-LARGE**| A transformer optimized for handling long contexts efficiently using sparse attention.                                       | Processes long documents with improved computational efficiency and accuracy.                       | Handles dense academic or technical texts requiring long-form attention.                                  | Summarization and question answering for long-form texts.                                       |


In [3]:
def consensus_scores_based_summaries(sample, n_consensus=3, n_dissensus=3):
    consensus_samples = sample['consensuality_scores'].sort_values(ascending=True).head(n_consensus).index.tolist()
    disensus_samples = sample['consensuality_scores'].sort_values(ascending=False).head(n_dissensus).index.tolist()

    consensus = ".".join(consensus_samples)
    disensus = ".".join(disensus_samples)

    return consensus + "\n\n" + disensus


def rsa_scores_based_summaries(sample, n_consensus=3, n_rsa_speaker=3):
    consensus_samples = sample['consensuality_scores'].sort_values(ascending=True).head(n_consensus).index.tolist()
    rsa = sample['best_rsa'].tolist()[:n_rsa_speaker]

    consensus = ".".join(consensus_samples)
    rsa = ".".join(rsa)

    return consensus + "\n\n" + rsa

In [4]:

%cd glimpse-mds/

/content/glimpse-mds


Now path is fixed

In [5]:
#Testing
from pathlib import Path
test = Path('data/all_reviews_2017.csv')
test.is_file()

True

In [6]:
%%capture
!pip install pandas
!pip install torch
!pip install datasets
!pip install tqdm
!pip install transformers
!pip install nltk
!pip install pickle
!pip install numpy
!pip install operator

In [7]:
import pandas as pd
from torch.utils.data import DataLoader
from datasets import Dataset
from tqdm import tqdm
import datetime
import torch
from pathlib import Path
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
from typing import List, Dict
import nltk
import pickle
import re
import numpy as np
from functools import reduce
import operator
import random
import json
import os
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:
def set_random_seed(seed: int):
    random.seed(seed)  # For Python's random
    np.random.seed(seed)  # For NumPy
    torch.manual_seed(seed)  # For PyTorch on CPU
    torch.cuda.manual_seed(seed)  # For PyTorch on GPU


set_random_seed(42)

In [None]:

class SummaryGenerator:

    model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
    tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

    def __init__(self, model_name: str = 'BART', model=model, tokenizer=tokenizer, device="cuda"):
        self.model_name = model_name
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.generation_config = {
            "max_new_tokens": 200,
            "do_sample": True,
            "top_p": 0.95,
            "temperature": 1.0,
            "num_return_sequences": 8,
            "num_beams": 1,
            "early_stopping": True,
            "min_length": 0,
        }

    def evaluate_summarizer(self, dataset_path: Path, batch_size: int, trimming: bool, checkpoint_path: str) -> Dataset:
        """
        @param model: The model used to generate the summaries
        @param tokenizer: The tokenizer used to tokenize the text and the summary
        @param dataset: A dataset with the text
        @param decoding_config: Dictionary with the decoding config
        @param batch_size: The batch size used to generate the summaries
        @return: The same dataset with the summaries added
        """
        try:
            dataset = pd.read_csv(dataset_path)
        except:
            raise ValueError(f"Unknown dataset {dataset_path}")

        # make a dataset from the dataframe
        dataset = Dataset.from_pandas(dataset)

        # create a dataloader
        dataloader = DataLoader(
            dataset, batch_size=batch_size, shuffle=False, drop_last=trimming)

        # Checkpoint file to save progress
        if os.path.exists(checkpoint_path):
            with open(checkpoint_path, "r") as f:
                checkpoint = json.load(f)
        else:
            checkpoint = {"processed_batches": 0, "summaries": []}

        summaries = checkpoint["summaries"]
        print("Generating summaries...")

        self.model.to(self.device)

        for batch_idx, batch in enumerate(tqdm(dataloader)):
            # Skip already processed batches
            if batch_idx < checkpoint["processed_batches"]:
                continue

            text = batch["text"]

            inputs = self.tokenizer(
                text,
                max_length=min(self.tokenizer.model_max_length,
                               768),  # Adjust max_length
                padding="max_length",
                truncation=True,
                return_tensors="pt",
            )

            # move inputs to device
            inputs = {key: value.to(self.device)
                      for key, value in inputs.items()}

            # generate summaries
            try:
                with torch.amp.autocast('cuda'):
                    outputs = self.model.generate(
                        **inputs, **self.generation_config)
            except RuntimeError as e:
                print(f"Error during generation: {e}")
                print(f"Input shape: {inputs['input_ids'].shape}")
                # Save progress before raising an error
                checkpoint["processed_batches"] = batch_idx
                checkpoint["summaries"] = summaries
                with open(checkpoint_path, "w") as f:
                    json.dump(checkpoint, f)
                raise

            total_size = outputs.numel()  # Total number of elements in the tensor
            # Target size of the last dimension
            target_size = batch_size * outputs.shape[-1]
            # Calculate the required padding size to make the total number of elements divisible by the target size
            pad_size = (target_size - (total_size % target_size)) % target_size

            # Pad the tensor with zeros to make the total number of elements divisible by the target size
            if not trimming and pad_size != 0:
                outputs = torch.nn.functional.pad(
                    outputs, (0, 0, 0, pad_size // outputs.shape[-1]))

            # output : (batch_size * num_return_sequences, max_length)
            try:
                outputs = outputs.reshape(batch_size, -1, outputs.shape[-1])
            except Exception as e:
                print(f"Error reshaping outputs: {e}")
                raise ValueError(f"Cannot reshape tensor of size {outputs.numel()} into shape "
                                 f"({batch_size}, -1, {outputs.shape[-1]}).")

            # decode summaries
            for b in range(batch_size):
                summaries.append(
                    [
                        self.tokenizer.decode(
                            outputs[b, i],
                            skip_special_tokens=True,
                        )
                        for i in range(outputs.shape[1])
                    ]
                )

            # Save progress after processing each batch
            checkpoint["processed_batches"] = batch_idx + 1
            checkpoint["summaries"] = summaries
            with open(checkpoint_path, "w") as f:
                json.dump(checkpoint, f)

            # if trimming the last batch, remove them from the dataset
            if trimming:
                dataset = dataset.select(range(len(summaries)))

        # add summaries to the huggingface dataset
        dataset = dataset.map(lambda example: {"summary": summaries.pop(0)})

        # Clean up the checkpoint file after successful completion
        if os.path.exists(checkpoint_path):
            os.remove(checkpoint_path)

        return dataset

    def generate_abstractive_summary(self, dataset_path: Path, batch_size: int, trimming: bool):
        self.tokenizer.pad_token = self.tokenizer.unk_token
        self.tokenizer.pad_token_id = self.tokenizer.unk_token_id

        dataset = self.evaluate_summarizer(
            dataset_path, batch_size, trimming, checkpoint_path="summarizer_checkpoint.json"
        )

        df_dataset = dataset.to_pandas()
        df_dataset = df_dataset.explode('summary')
        df_dataset = df_dataset.reset_index()
        # add an idx with  the id of the summary for each example
        df_dataset['id_candidate'] = df_dataset.groupby(['index']).cumcount()

        output_path = Path(
            f"data/candidates/{self.model_name}_{dataset_path.stem}_-_abstr.csv")
        # create output dir if it doesn't exist
        if not output_path.parent.exists():
            output_path.parent.mkdir(parents=True, exist_ok=True)
        df_dataset.to_csv(output_path, index=False, encoding="utf-8")
        print('done')

    def generate_extractive_summary(self, dataset_path: Path):
        try:
            dataset = pd.read_csv(dataset_path)
        except:
            raise ValueError(f"Unknown dataset {dataset_path}")

        # make a dataset from the dataframe
        dataset = Dataset.from_pandas(dataset)

        summaries = []

        # (tqdm library for progress bar)
        for sample in tqdm(dataset):
            text = sample["text"]

            # Replace any set of successive dashes (e.g., --, ----, -----) with a newline
            text = re.sub(r'-{2,}', '\n', text)

            # Remove patterns like ".2-" or isolated numerics with hyphens
            text = re.sub(r'\.\d+-', '', text)

            # Replace multiple newlines or spaces with a single newline or space
            # Replace multiple newlines with one
            text = re.sub(r'\n+', '\n', text)
            # Replace multiple spaces with one
            text = re.sub(r'\s+', ' ', text)

            # Remove any remaining unwanted characters (e.g., control characters)
            # Remove non-ASCII characters
            text = re.sub(r'[^\x00-\x7F]+', '', text)

            # To be discussed
            text = text.replace("\n", " ")

            sentences = nltk.sent_tokenize(text)

            # remove empty sentences
            sentences = [sentence for sentence in sentences if sentence != ""]

            # Filter out short or meaningless sentences
            sentences = [sent for sent in sentences if len(sent) > 8]

            summaries.append(sentences)

        # add summaries to the huggingface dataset
        dataset = dataset.map(lambda example: {"summary": summaries.pop(0)})

        df_dataset = dataset.to_pandas()
        df_dataset = df_dataset.explode("summary")
        df_dataset = df_dataset.reset_index()
        # add an idx with  the id of the summary for each example
        df_dataset["id_candidate"] = df_dataset.groupby(["index"]).cumcount()

        output_path = f"data/candidates/{dataset_path.stem}_-_extr.csv"
        output_path = Path(output_path)
        # create output dir if it doesn't exist
        if not output_path.parent.exists():
            output_path.parent.mkdir(parents=True, exist_ok=True)
        df_dataset.to_csv(output_path, index=False, encoding="utf-8")
        print('done')

    def change_model(self, model_name, model, tokenizer):
        self.model_name = model_name
        self.model = model
        self.tokenizer = tokenizer
########################################################################

In [11]:

class RSAReranking:
    """
    Rerank a list of candidates according to the RSA model.
    """

    def __init__(
            self,
            model,
            tokenizer,
            candidates: List[str],
            source_texts: List[str],
            batch_size: int = 32,
            rationality: int = 1,
            device="cuda",
    ):
        """
        :param model: hf model used to compute the likelihoods (supposed to be a seq2seq model), is S0 in the RSA model
        :param tokenizer:
        :param candidates: list of candidates summaries
        :param source_texts: list of source texts
        :param batch_size: batch size used to compute the likelihoods (can be high since we don't need gradients and
        it's a single forward pass)
        :param rationality: rationality parameter of the RSA model
        :param device: device used to compute the likelihoods
        """
        self.model = model
        self.device = device
        self.tokenizer = tokenizer

        self.candidates = candidates
        self.source_texts = source_texts

        self.batch_size = batch_size
        self.rationality = rationality

        self.model.to(self.device)

    def compute_conditionned_likelihood(
            self, x: List[str], y: List[str], mean: bool = True
    ) -> torch.Tensor:
        """
        Compute the likelihood of y given x

        :param x: list of source texts len(x) = batch_size
        :param y: list of candidates summaries len(y) = batch_size
        :param mean: average the likelihoods over the tokens of y or take the sum
        :return: tensor of shape (batch_size) containing the likelihoods of y given x
        """
        # Dummy inputs
        # source_texts = ["The paper is interesting."] -> 7 tokens
        # candidate_summaries = ["Well-written summary."] -> 7 tokens not necessary to have the same number of tokens
        assert len(x) == len(y)

        # Define the loss function (cross-entropy for token-level predictions)
        loss_fn = torch.nn.CrossEntropyLoss(reduction="none")

        if 'pegasus' in self.model.name_or_path:
          y = [summary for summary in y if isinstance(summary, str) and summary]
          x = x[:len(y)]  # Adjust x to match the filtered y length


        # Tokenize the source texts (x) and summaries (y)
        x = self.tokenizer(x, return_tensors="pt", padding=True, max_length=512,
                           truncation=True).to(self.device)
        y = self.tokenizer(y, return_tensors="pt", padding=True, max_length=512,
                           truncation=True).to(self.device)

        # Extract token IDs for input and output
        x_ids = x.input_ids.to(self.device)
        y_ids = y.input_ids.to(self.device)
        x_attention_mask = x.attention_mask.to(self.device)
        y_attention_mask = y.attention_mask.to(self.device)
        # print(x_ids.shape, y_ids.shape) -> (1, 7) (1, 7)
        # print(x_ids) -> tensor([[0,133,2225,16,2679,4,2]])

        # Pass the inputs through the model
        logits = self.model(
            input_ids=x_ids,
            decoder_input_ids=y_ids,
            attention_mask=x_attention_mask,
            decoder_attention_mask=y_attention_mask,
        ).logits

        # print(logits.shape) -> (1, 7, 50265)

        # Shift logits and token IDs for loss computation
        shifted_logits = logits[..., :-1, :].contiguous()
        shifted_ids = y_ids[..., 1:].contiguous()

        # print(shifted_logits.shape, shifted_ids.shape)
        # Result: (1, 6, 50265) (1, 6)

        # Compute token-level negative log-likelihood
        # shifted logits has a size (batch_size, sequence_length, vocab_size)
        # WE FLATTEN IT TO (batch_size x sequence_length, vocab_size)
        likelihood = -loss_fn(
            shifted_logits.view(-1, shifted_logits.size(-1)  # (1x6, 50265)
                                ), shifted_ids.view(-1)  # (1x6,)
        )

        # print(likelihood.shape) -> [6] == (6,)

        # Reshape the likelihood to match the batch
        # Reshape back to (batch_size, sequence_length) then sum(-1) -> (batch_size,)
        likelihood = likelihood.view(len(x["input_ids"]), -1).sum(-1)

        # print(likelihood.shape) -> [1] == (1,)

        # Normalize likelihood by the number of tokens if `mean=True`
        if mean:
            likelihood /= (y_ids !=
                           self.tokenizer.pad_token_id).float().sum(-1)

        # print(likelihood) = tensor([-6.6653])
        return likelihood

    def score(self, x: List[str], y: List[str], **kwargs):
        return self.compute_conditionned_likelihood(x, y, **kwargs)

    def likelihood_matrix(self) -> torch.Tensor:
        """
        Compute a likelihood matrix where entry (i, j) is the likelihood of
        candidate j summarizing source text i.

        Returns:
            torch.Tensor: Likelihood matrix of shape (len(source_texts), len(candidates)).
        """

        # initialize the likelihood matrix of size (len(source_texts), len(candidates))
        likelihood_matrix = torch.zeros(
            (len(self.source_texts), len(self.candidates))
        ).to(self.device)

        # create a list of pairs (i: index source, j: index candidate, source_text, candidate)
        pairs = []
        for i, source_text in enumerate(self.source_texts):
            for j, candidate in enumerate(self.candidates):
                pairs.append((i, j, source_text, candidate))

        batches = [
            pairs[i: i + self.batch_size]
            for i in range(0, len(pairs), self.batch_size)
        ]

        for batch in tqdm(batches):
            # get the source texts and candidates
            source_texts = [pair[2] for pair in batch]
            candidates = [pair[3] for pair in batch]

            # compute the likelihoods
            with torch.no_grad():
                likelihoods = self.score(
                    source_texts, candidates, mean=True
                )

            # fill the matrix
            # update the likelihood matrix with the likelihoods
            for k, (i, j, _, _) in enumerate(batch):
                likelihood_matrix[i, j] = likelihoods[k].detach()

        # return the likelihood matrix
        return likelihood_matrix


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
model_for_likelihood_computation = {
    # 'model_1':
    #     {
    #         'model_id': "facebook/bart-large-cnn",
    #         'model_name': "BART"
    #     },
    # 'model_2':
    #     {
    #         'model_id': "Falconsai/text_summarization",
    #         'model_name': "Falcon"
    #     },
    'model_3':
        {
            'model_id': "google/pegasus-xsum",
            'model_name': "PEGASUS-XSUM"
        },
    'model_4':
        {
            'model_id': "google/bigbird-pegasus-large-arxiv",
            'model_name': "PEGASUS-BigBird-Arxiv"
        },
    'model_5':
        {
            'model_id': "google/pegasus-arxiv",
            'model_name': "PEGASUS-Arxiv"
        }
}

path_candidates = Path("/content/drive/MyDrive/candidates2")

for model_count, model_info in model_for_likelihood_computation.items():
    # Load the model and tokenizer
    model_id = model_info['model_id']
    model_name = model_info['model_name']
    model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    for file in path_candidates.glob('*.csv'):
        # For each dataset we want to do the following
        # Extract LM_probas for each dataframe
        # Save them in a pkl file for each 'paper' (id)
        # Save the pkl files in a folder named after the dataset and the model used

        results = []
        curr_ds = pd.read_csv(file)

        # Name is a tuple e.g. ('id_name',)
        # group is a GroupedByKey DataFrame

        for name, group in tqdm(curr_ds.groupby(["id"])):

            rsa_reranker = RSAReranking(
                model,  # model on which we want to compute the RSA
                tokenizer,  # tokenizer for the model
                device='cuda',
                candidates=group.summary.unique().tolist(),
                source_texts=group.text.unique().tolist(),
                batch_size=32,
                rationality=1,
            )
            # print(len(group.summary.unique().tolist()))
            # print(len(group.text.unique().tolist()))
            lm_probas = rsa_reranker.likelihood_matrix()
            # print(lm_probas.shape)
            lm_probas = lm_probas.cpu().numpy()
            lm_probas_df = pd.DataFrame(lm_probas)
            lm_probas_df.index = group.text.unique().tolist()
            lm_probas_df.columns = group.summary.unique().tolist()
            gold = group['gold'].tolist()[0]

            results.append(
                {
                    "id": name,
                    "language_model_proba_df": lm_probas_df,
                    "gold": gold,
                    "rationality": 1,  # hyperparameter
                    "text_candidates": group
                }
            )

        # Save the results
        opt_dir = Path(f'data/lm_probas/')
        if not opt_dir.exists():
            opt_dir.mkdir(parents=True, exist_ok=True)

        opt_path = Path(f"data/lm_probas/{file.stem}-_-{model_name}.pkl")
        results = {"results": results}
        with open(opt_path, 'wb') as f:
            pickle.dump(results, f)
    print(f'{file} is done')

print('ALL DONE :)')


config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/259 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.52M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

  0%|          | 0/226 [00:00<?, ?it/s]
  0%|          | 0/5 [00:00<?, ?it/s][A
 40%|████      | 2/5 [00:00<00:00,  3.59it/s][A
 60%|██████    | 3/5 [00:01<00:00,  2.62it/s][A
 80%|████████  | 4/5 [00:01<00:00,  2.28it/s][A
100%|██████████| 5/5 [00:02<00:00,  2.42it/s]
  0%|          | 1/226 [00:02<10:52,  2.90s/it]
  0%|          | 0/5 [00:00<?, ?it/s][A
 40%|████      | 2/5 [00:00<00:00,  3.29it/s][A
 60%|██████    | 3/5 [00:01<00:00,  2.41it/s][A
 80%|████████  | 4/5 [00:01<00:00,  2.41it/s][A
100%|██████████| 5/5 [00:02<00:00,  2.25it/s]
  1%|          | 2/226 [00:05<10:08,  2.72s/it]
  0%|          | 0/11 [00:00<?, ?it/s][A
 18%|█▊        | 2/11 [00:00<00:02,  3.58it/s][A
 27%|██▋       | 3/11 [00:00<00:02,  3.02it/s][A
 36%|███▋      | 4/11 [00:01<00:03,  2.31it/s][A
 45%|████▌     | 5/11 [00:02<00:03,  1.92it/s][A
 55%|█████▍    | 6/11 [00:02<00:02,  1.85it/s][A
 64%|██████▎   | 7/11 [00:03<00:02,  1.70it/s][A
 73%|███████▎  | 8/11 [00:04<00:01,  1.69it/s][A
 82%

IndexError: index 31 is out of bounds for dimension 0 with size 31

In [None]:
import os
from google.colab import auth
from google.colab import runtime

# Authenticate for access to Colab's secrets
auth.authenticate_user()
token = os.environ.get('github_tk')

if not token:
    raise ValueError("GitHub token not found in Colab Secrets.")

# Configure Git
repository_url = "https://github.com/alialhousseini/glimpse-mds.git"
tokenized_url = repository_url.replace("https://", f"https://{token}@")


# Make changes (if any), add, and commit
!git add -A
!git commit -m "Commit by colab!"

# Push changes
!git push origin main
# #############################

# import shutil

# # Define source and destination paths
# repo_path = "/content/glimpse-mds/data/lm_probas"  # Path to your data in the repository
# drive_path = "/content/drive/My Drive/lm_probas"

# # Check if the destination folder exists
# if not os.path.exists(drive_path):
#     os.makedirs(drive_path)  # Create the folder if it doesn't exist
#     print(f"Created folder: {drive_path}")

# # Copy the entire folder or specific files
# shutil.copytree(repo_path, drive_path)  # For folders
# # shutil.copy(f"{repo_path}/file.txt", drive_path)  # For individual files


ValueError: GitHub token not found in Colab Secrets.

In [None]:

########################################################################

def elementwise_max(dfs):
    """
    dfs: list of DataFrames (same index/columns)
    """
    return reduce(lambda x, y: x.combine(y, func=max), dfs)

# Now we can write a script that takes the set of LM_probas for each dataset and (set) of models
# and aggregate them to get the final ranking

# We define a set of model names, this set represents the set of models we want to aggregate their results
# In addition we define a methodology of aggregation(e.g. mean, max, weighted_avg, etc.)

model_names = ["BART", "PEGASUS"]

# We need to find for each set of common datasets, the models we are looking for:
lm_probas_path = Path("data/lm_probas")
lm_probas_files = list(lm_probas_path.glob("*.pkl"))
# Filter out the files that do not contain the models we are looking for
# So we keep only the files that contain the models we are looking for
lm_probas_files = [file for file in lm_probas_files if any(
    model_name in file.stem.split('-_-')[-1] for model_name in model_names)]

# Now for each file, we collect filenames together to be processed
files_and_pickles = {}
for file in lm_probas_files:
    filename = file.stem.split('-_-')[0]
    if filename not in files_and_pickles:
        files_and_pickles[filename] = [file]
    else:
        files_and_pickles[filename].append(file)

method = "mean"

# Now we can aggregate the results
# We will aggregate the results for each dataset
for filename, files in files_and_pickles.items():
    # We iterate over the dict
    # filename is the name of the dataset
    # files is a list of paths to the pkl files

    # Load the results for each model
    pkls = [pd.read_pickle(f) for f in files]
    # Go to results
    pkls = [f['results'] for f in pkls]
    # Now pkls is a list of lists of dictionaries [ [{},{},{}], [{},{},{}], ...]
    # We want to access the language_model_proba_df for each dictionary in parallel
    # i.e. [ [{a1},{b1},{c1}], [{a2},{b2},{c2}], ...] -> [ {a_i}, {b_i}, {c_i} ]

    # Results
    results = []
    for i in range(len(pkls[0])):  # iterate over the dictionaries
        # index 'i' is shared
        set_of_dicts = [pkls[j][i] for j in range(len(pkls))]
        # set_of_dicts is a list of dictionaries that share the same index
        # [{a1}, {a2}, {a3}, ...]
        # Now we want to aggregate the language_model_proba_df for each dictionary
        new_dict = {}
        new_dict['id'] = set_of_dicts[0]['id']
        new_dict['gold'] = set_of_dicts[0]['gold']
        new_dict['rationality'] = set_of_dicts[0]['rationality']
        new_dict['text_candidates'] = set_of_dicts[0]['text_candidates']
        # Now we want to aggregate the language_model_proba_df
        # THIS HAS TO BE DONE ACCORDING TO A METHOD (max, weighted_avg, etc.)
        set_of_dfs = [d['language_model_proba_df'] for d in set_of_dicts]

        # Additional check of consistency
        ref_index = set_of_dfs[0].index
        ref_columns = set_of_dfs[0].columns

        for t, df in enumerate(set_of_dfs[1:], start=2):
            # Compare sets OR compare ordered lists
            if not df.index.equals(ref_index):
                raise ValueError(
                    f"DataFrame #{i} index does not match the reference. "
                    f"Expected {list(ref_index)}, got {list(df.index)}."
                )
            if not df.columns.equals(ref_columns):
                raise ValueError(
                    f"DataFrame #{i} columns do not match the reference. "
                    f"Expected {list(ref_columns)}, got {list(df.columns)}."
                )

        if method == "mean":

            # To aggregation safely
            df_sum = reduce(operator.add, set_of_dfs)
            df_agg = df_sum / len(set_of_dfs)

        if method == "max":
            df_agg = elementwise_max(set_of_dfs)

        # Save it!
        new_dict['language_model_proba_df'] = df_agg

        # Save model names used as well
        new_dict['model_names'] = model_names

        results.append(new_dict)

    results = {"results": results}
    # Save the results
    opt_dir = Path(f'data/agg_lms/')
    if not opt_dir.exists():
        opt_dir.mkdir(parents=True, exist_ok=True)
    opt_path = Path(f"data/agg_lms/{filename}.pkl")
    with open(opt_path, 'wb') as f:
        pickle.dump(results, f)
########################################################################
