In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('data/processed/all_reviews_2017.csv')

In [3]:
df[['text', 'gold']].head()

Unnamed: 0,text,gold
0,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...
1,Results on the VQA task are good for this simp...,The program committee appreciates the authors'...
2,This work proposes to approximate the bilinear...,The program committee appreciates the authors'...
3,Summary:--------This paper proposes to use sur...,"Based on the feedback, I'm going to be rejecti..."
4,This paper proposes to use previous error sign...,"Based on the feedback, I'm going to be rejecti..."


The dataset for each year consist of `['id','text','gold']`
- Text: Is the source 
- Gold: We assume that the area chair's motivations for their decision provide a reasonable comparison (summary)

*Note*: For each paper, 3 reviews are extracted, you can notice that the `gold` value is same for all the reviews.

______

We start by following directories in order and apply some functions

We start from the directory `glimpse/baselines` where comparative results are treated

1. `generate_llm_summaries.py`

In [4]:
import pandas as pd
from pathlib import Path

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import re
import argparse
from tqdm import tqdm

from glimpse.baselines import generate_llm_summaries

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
#model replace 'togethercomputer/Llama-2-7B-32K-Instruct'
model = "meta-llama/Llama-3.2-1B-Instruct"
token = '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'
tokenizer = AutoTokenizer.from_pretrained(model, token=token)
model = AutoModelForCausalLM.from_pretrained(
    model, trust_remote_code=True, torch_dtype=torch.float16, token=token)

In [9]:
df = generate_llm_summaries.prepare_dataset('reviews_2017', dataset_path='data/processed/')

In [10]:
df = generate_llm_summaries.group_text_by_id(df)

# Group text by sample id and concatenate text
df.head(3)

# Grouped by id, text is concatenated of all reviews, and gold is same.

Unnamed: 0_level_0,text,gold
id,Unnamed: 1_level_1,Unnamed: 2_level_1
https://openreview.net/forum?id=B1-Hhnslg,The paper is an extension of the matching netw...,The program committee appreciates the authors'...
https://openreview.net/forum?id=B1-q5Pqxl,The paper looks at the problem of locating the...,This paper provides two approaches to question...
https://openreview.net/forum?id=B16Jem9xe,I just noticed I submitted my review as a pre-...,"Hello Authors, Congratulations on the accepta..."


In [11]:
# We take first 10 samples for testing
df = df.head(10)
len(df)

10

In [12]:
# Add pad token
tokenizer.pad_token = tokenizer.eos_token

df = generate_llm_summaries.generate_summaries(model, tokenizer, df, batch_size=2, device='cuda')

  0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['[INST]\nThe paper is an extension of the matching networks by Vinyals et al. in NIPS2016. Instead of using all the examples in the support set during test, the method represents each class by the mean of its learned embeddings. The training procedure and experimental setting are very similar to the original matching networks. I am not completely sure about its advantages over the original matching networks. It seems to me when dealing with 1-shot case, these two methods are identical since there is only one example seen in this class, so the mean of the embedding is the embedding itself. When dealing with 5-shot case, original matching networks compute the weighted average of all examples, but it is at most 5x cost. The experimental results reported for prototypical nets are only slightly better than matching networks. I  think it is a simple, straightforward,  novel extension, but I am not fully convinced its advantages.  This paper proposes an improved version of matching networks,

 10%|█         | 1/10 [00:21<03:17, 21.90s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['[INST]\nThe paper looks at the problem of locating the answer to a question in a text (For this task the answer is always part of the input text). For this the paper proposes to combine two existing works: Match-LSTM to relate question and text representations and Pointer Net to predict the location of the answer in the text.----------------Strength:--------- The suggested approach makes sense for the task and achieves good performance, (although as the authors mention, recent concurrent works achieve better results)--------- The paper is evaluated on the SQuAD dataset and achieves significant improvements over prior work.------------------------Weaknesses:--------1. It is unclear from the paper how well it is applicable to other problem scenarios where the answer is not a subset of the input text.--------2. Experimental evaluation--------2.1. It is not clear why the Bi-Ans-Ptr in Table 2 is not used for the ensemble although it achieves the best performance.--------2.2. It would be 

 20%|██        | 2/10 [00:35<02:15, 16.91s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['[INST]\nI just noticed I submitted my review as a pre-review question - sorry about this. Here it is again, with a few more thoughts added...----------------The authors present a great and - as far as I can tell - accurate and honest overview of the emerging theory about GANs from a likelihood ratio estimation/divergence minimisation perspective. It is well written and a good read, and one I would recommend to people who would like to get involved in GANs.----------------My main problem with this submission is that it is hard as a reviewer to pin down what precisely the novelty is - beyond perhaps articulating these views better than other papers have done in the past. A sentence from the paper "But it has left us unsatisfied since we have not gained the insight needed to choose between them.” summarises my feeling about this paper: this is a nice \'unifying review’ type paper that - for me - lacks a novel insight.----------------In summary, my assessment is mixed: I think this is a 

 30%|███       | 3/10 [00:41<01:22, 11.79s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['[INST]\nThis paper proposed a novel adversarial framework to train a model from demonstrations in a third-person perspective, to perform the task in the first-person view. Here the adversarial training is used to extract a novice-expert (or third-person/first-person) independent feature so that the agent can use to perform the same policy in a different view point.----------------While the idea is quite elegant and novel (I enjoy reading it), more experiments are needed to justify the approach. Probably the most important issue is that there is no baseline, e.g., what if we train the model with the image from the same viewpoint? It should be better than the proposed approach but how close are they? How the performance changes when we gradually change the viewpoint from third-person to first-person? Another important question is that maybe the network just blindly remembers the policy, in this case, the extracted feature could be artifacts of the input image that implicitly counts the

 40%|████      | 4/10 [00:56<01:19, 13.32s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['[INST]\nThe authors present a simple method to affix a cache to neural language models, which provides in effect a copying mechanism from recently used words. Unlike much related work in neural networks with copying mechanisms, this mechanism need not be trained with long-term backpropagation, which makes it efficient and scalable to much larger cache sizes. They demonstrate good improvements on language modeling by adding this cache to RNN baselines.----------------The main contribution of this paper is the observation that simply using the hidden states h_i as keys for words x_i, and h_t as the query vector, naturally gives a lookup mechanism that works fine without tuning by backprop. This is a simple observation and might already exist as folk knowledge among some people, but it has nice implications for scalability and the experiments are convincing.----------------The basic idea of repurposing locally-learned representations for large-scale attention where backprop would normal

 50%|█████     | 5/10 [01:06<00:59, 11.95s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


["[INST]\nThis paper investigates the hessian of small deep networks near the end of training. The main result is that many eigenvalues are approximately zero, such that the Hessian is highly singular, which means that a wide amount of theory does not apply.----------------The overall point that deep learning algorithms are singular, and that this undercuts many theoretical results, is important but it has already been made: Watanabe. “Almost All Learning Machines are Singular”, FOCI 2007. This is one paper in a growing body of work investigating this phenomenon. In general, the references for this paper could be fleshed out much further—a variety of prior work has examined the Hessian in deep learning, e.g., Dauphin et al. “Identifying and attacking the saddle point problem in high dimensional non-convex optimization” NIPS 2014 or the work of Amari and others.----------------Experimentally, it is hard to tell how results from the small sized networks considered here might translate to

 60%|██████    | 6/10 [01:36<01:13, 18.34s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


["[INST]\nThe authors proposes an interesting idea of connecting the energy-based model (descriptor) and --------the generator network to help each other. The samples from the generator are used as the initialization --------of the descriptor inference. And the revised samples from the descriptor is in turn used to update--------the generator as the target image. ----------------The proposed idea is interesting. However, I think the main flaw is that the advantages of having that --------architecture are not convincingly demonstrated in the experiments. For example, readers will expect --------quantative analysis on how initializing with the samples from the generator helps? Also, the only --------quantative experiment on the reconstruction is also compared to quite old models. Considering that --------the model is quite close to the model of Kim & Bengio 2016, readers would also expect a comparison --------to that model. ----------------** Minor--------- I'm wondering if the analysis 

 70%|███████   | 7/10 [01:48<00:48, 16.07s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


["[INST]\nThis is a parallel work with BiGAN.  The idea is using auto encoder to provide extra information for discriminator. This approach seems is promising from reported result. After reading the rebuttal, I decided to increase my score. I think ALI somehow stabilizes the GAN training as demonstrated in Fig. 8 and learns a reasonable inference network.---------------------------------------Initial Review:----------------This paper proposes a new method for learning an inference network in the GAN framework. ALI's objective is to match the joint distribution of hidden and visible units imposed by an encoder and decoder network. ALI is trained on multiple datasets, and it seems to have a good reconstruction even though it does not have an explicit reconstruction term in the cost function. This shows it is learning a decent inference network for GAN.----------------There are currently many ways to learn an inference network for GANs: One can learn an inference network after training th

 80%|████████  | 8/10 [01:52<00:24, 12.19s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['[INST]\nThis paper proposes a multimodal neural machine translation that is based upon previous work using variational methods but attempts to ground semantics with images. Considering way to improve translation with visual information seems like a sensible thing to do when such data is available. ----------------As pointed out by a previous reviewer, it is not actually correct to do model selection in the way it was done in the paper. This makes the gains reported by the authors very marginal. In addition, as the author\'s also said in their question response, it is not clear if the model is really learning to capture useful image semantics. As such, it is unfortunately hard to conclude that this paper contributes to the direction that originally motivated it. The paper proposes an approach to the task of multimodal machine translation, namely to the case when an image is available that corresponds to both source and target sentences. ----------------The idea seems to be to use a la

 90%|█████████ | 9/10 [01:56<00:09,  9.87s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


['[INST]\nThis paper shows that extending deep RL algorithms to decide which action to take as well as how many times to repeat it leads to improved performance on a number of domains. The evaluation is very thorough and shows that this simple idea works well in both discrete and continuous actions spaces.----------------A few comments/questions:--------- Table 1 could be easier to interpret as a figure of histograms.--------- Figure 3 could be easier to interpret as a table.--------- How was the subset of Atari games selected?--------- The Atari evaluation does show convincing improvements over A3C on games requiring extended exploration (e.g. Freeway and Seaquest), but it would be nice to see a full evaluation on 57 games. This has become quite standard and would make it possible to compare overall performance using mean and median scores.--------- It would also be nice to see a more direct comparison to the STRAW model of Vezhnevets et al., which aims to solve some of the same probl

100%|██████████| 10/10 [02:06<00:00, 12.61s/it]


In [13]:
import textwrap
textwrap.wrap(df['summary'].iloc[0], width=100)

['[INST] The paper is an extension of the matching networks by Vinyals et al. in NIPS2016. Instead of',
 'using all the examples in the support set during test, the method represents each class by the mean',
 'of its learned embeddings. The training procedure and experimental setting are very similar to the',
 'original matching networks. I am not completely sure about its advantages over the original matching',
 'networks. It seems to me when dealing with 1-shot case, these two methods are identical since there',
 'is only one example seen in this class, so the mean of the embedding is the embedding itself. When',
 'dealing with 5-shot case, original matching networks compute the weighted average of all examples,',
 'but it is at most 5x cost. The experimental results reported for prototypical nets are only slightly',
 'better than matching networks. I  think it is a simple, straightforward,  novel extension, but I am',
 'not fully convinced its advantages.  This paper proposes an imp

In [14]:
df.columns

Index(['text', 'gold', 'instruction', 'summary'], dtype='object')

Conclusion:
- For each paper (document) we have 3 reviews, these reviews are concatenated
- The review (contactenated 3 reviews) are fed to a model to provide a summary for them
- In order to do that we use `generate_summaries` function that adds first a column to the `df` where instruction is applied along with `text`.
- Then the model provides a summary for each text and df is returned back

2. `sumy_baselines.py`

In [15]:
from glimpse.baselines import sumy_baselines

In [17]:
for N in [1]:
        summaries = []
        for text in df.text:
            summary = sumy_baselines.summarize('LSA', "english", N, "text", text)
            summaries.append(summary)

        df['summary'] = summaries
        df["metadata/method"] = 'LSA'
        df["metadata/sentence_count"] = N

        name = f"{df}-_-'LSA'-_-sumy_{N}.csv"

In [18]:
df

Unnamed: 0_level_0,text,gold,instruction,summary,metadata/method,metadata/sentence_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
https://openreview.net/forum?id=B1-Hhnslg,The paper is an extension of the matching netw...,The program committee appreciates the authors'...,[INST]\nThe paper is an extension of the match...,"On Cub 200, I thought that the state-of-the-ar...",LSA,1
https://openreview.net/forum?id=B1-q5Pqxl,The paper looks at the problem of locating the...,This paper provides two approaches to question...,[INST]\nThe paper looks at the problem of loca...,The authors might want to consider pointing to...,LSA,1
https://openreview.net/forum?id=B16Jem9xe,I just noticed I submitted my review as a pre-...,"Hello Authors, Congratulations on the accepta...",[INST]\nI just noticed I submitted my review a...,"It is well written and a good read, and one I ...",LSA,1
https://openreview.net/forum?id=B16dGcqlx,This paper proposed a novel adversarial framew...,pros: - new problem - huge number of experim...,[INST]\nThis paper proposed a novel adversaria...,I will list these concerns in the following (i...,LSA,1
https://openreview.net/forum?id=B184E5qee,The authors present a simple method to affix a...,Reviewers agree that this paper is based on a ...,[INST]\nThe authors present a simple method to...,They demonstrate good improvements on language...,LSA,1
https://openreview.net/forum?id=B186cP9gx,This paper investigates the hessian of small d...,This is quite an important topic to understand...,[INST]\nThis paper investigates the hessian of...,"Overall, the results feel preliminary but like...",LSA,1
https://openreview.net/forum?id=B1E7Pwqgl,The authors proposes an interesting idea of co...,While the paper may have an interesting theore...,[INST]\nThe authors proposes an interesting id...,"On the third in-painting tasks, baselines are ...",LSA,1
https://openreview.net/forum?id=B1ElR4cgg,This is a parallel work with BiGAN. The idea ...,The reviewers were positive about this paper a...,[INST]\nThis is a parallel work with BiGAN. T...,ALI's objective is to match the joint distribu...,LSA,1
https://openreview.net/forum?id=B1G9tvcgx,This paper proposes a multimodal neural machin...,The area chair agrees with the reviewers that ...,[INST]\nThis paper proposes a multimodal neura...,This paper proposes a multimodal neural machin...,LSA,1
https://openreview.net/forum?id=B1GOWV5eg,This paper shows that extending deep RL algori...,The basic idea of this paper is simple: run RL...,[INST]\nThis paper shows that extending deep R...,This has become quite standard and would make ...,LSA,1


Using this script, we can produce summaries using 'LSA', 'Text Rank', 'LexRank', 'Edmundson', 'Luhn', 'KL-Sum', 'Random'

_____

We move now to the second directory `glimpse/data_loading` where we have 3 scripts (one of them can be skipped)

1. `generate_abstractive_candidates.py`

In [19]:
GENERATION_CONFIGS = {
    "top_p_sampling": {
        "max_new_tokens": 200,
        "do_sample": True,
        "top_p": 0.95,
        "temperature": 1.0,
        "num_return_sequences": 8,
        "num_beams" : 1,

        #"num_beam_groups" : 4,
    },

    **{
        f"sampling_topp_{str(topp).replace('.', '')}": {
            "max_new_tokens": 200,
            "do_sample": True,
            "num_return_sequences": 8,
            "top_p": 0.95,
        }
        for topp in [0.5, 0.8, 0.95, 0.99]
    },
}

for key, value in GENERATION_CONFIGS.items():
    GENERATION_CONFIGS[key] = {
        # "max_length": 2048,
        "min_length": 0,
        "early_stopping": True,
        **value,
    }

In [20]:
GENERATION_CONFIGS['sampling_topp_05']

{'min_length': 0,
 'early_stopping': True,
 'max_new_tokens': 200,
 'do_sample': True,
 'num_return_sequences': 8,
 'top_p': 0.95}

In [21]:
from glimpse.data_loading import generate_abstractive_candidates

In [22]:
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.unk_token_id
dataset = generate_abstractive_candidates.prepare_dataset('data/processed/all_reviews_2017.csv')

In [23]:
dataset = dataset.select(range(10))

In [None]:

tokenizer.pad_token = tokenizer.eos_token
dataset = generate_abstractive_candidates.evaluate_summarizer(
    model,
    tokenizer,
    dataset,
    GENERATION_CONFIGS['sampling_topp_05'],
    2,
    'cuda',
    True,
)


Generating summaries...


  0%|          | 0/5 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
 20%|██        | 1/5 [00:21<01:25, 21.39s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
 40%|████      | 2/5 [00:42<01:03, 21.02s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
 60%|██████    | 3/5 [01:02<00:41, 20.92s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
 80%|████████  | 4/5 [01:23<00:20, 20.89s/it]Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
100%|██████████| 5/5 [01:44<00:00, 20.95s/it]
Map: 100%|██████████| 10/10 [00:00<00:00, 480.69 examples/s]


In [25]:
dataset

Dataset({
    features: ['id', 'text', 'gold', 'summary'],
    num_rows: 10
})

In [26]:
dataset[0]

{'id': 'https://openreview.net/forum?id=r1rhWnZkg',
 'text': 'Summary: The paper presents low-rank bilinear pooling that uses Hadamard product (commonly known as element-wise multiplication). The paper implements low-rank bilinear pooling on an existing model (Kim et al., 2016b) and builds a model for Visual Question Answering (VQA) that outperforms the current state-of-art by 0.42%. The paper presents various ablation studies of the new VQA model they built.----------------Strengths:----------------1. The paper presents new insights into element-wise multiplication operation which has been previously used in VQA literature (such as Antol et al., ICCV 2015) without insights on why it should work. ----------------2. The paper presents a new model for the task of VQA that beats the current state-of-art by 0.42%. However, I have concerns about the statistical significance of the performance (see weaknesses below).----------------3. The various design choices made in model development have

In [27]:
df_dataset = dataset.to_pandas()
df_dataset = df_dataset.explode('summary')
df_dataset = df_dataset.reset_index()

In [28]:
df_dataset['id_candidate'] = df_dataset.groupby(['index']).cumcount()

In [29]:
df_dataset.head()

Unnamed: 0,index,id,text,gold,summary,id_candidate
0,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,Summary: The paper presents low-rank bilinear ...,0
1,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,Summary: The paper presents low-rank bilinear ...,1
2,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,Summary: The paper presents low-rank bilinear ...,2
3,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,Summary: The paper presents low-rank bilinear ...,3
4,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,Summary: The paper presents low-rank bilinear ...,4


2. `generate_extractive_candidates.py`

Same as abstractive_candidates while summaries are nothing but the set of sentences.
___

Next step is to discover `glimpse/evaluate` where a set of evaluators is introduced. Here nothing special, dataframe is filtered to get the gold vs summaries, and then some evaluator is called to be applied.

1. `evaluate_bartbert_metrics.py`: Computing Bert Score
2. `evaluate_common_metrics_samples.py`: Evaluating Rouge; Rouge1, Rouge2, RougeL and RougeLsum
3. `evaluate_seahorse_metrics_samples.py`: Computing Seahorse


In [37]:
textwrap.wrap(df[['text','summary']].iloc[0]['text'], width=100)

['The paper is an extension of the matching networks by Vinyals et al. in NIPS2016. Instead of using',
 'all the examples in the support set during test, the method represents each class by the mean of its',
 'learned embeddings. The training procedure and experimental setting are very similar to the original',
 'matching networks. I am not completely sure about its advantages over the original matching',
 'networks. It seems to me when dealing with 1-shot case, these two methods are identical since there',
 'is only one example seen in this class, so the mean of the embedding is the embedding itself. When',
 'dealing with 5-shot case, original matching networks compute the weighted average of all examples,',
 'but it is at most 5x cost. The experimental results reported for prototypical nets are only slightly',
 'better than matching networks. I  think it is a simple, straightforward,  novel extension, but I am',
 'not fully convinced its advantages.  This paper proposes an improved v

In [38]:

textwrap.wrap(df_dataset[['text','summary']].iloc[0]['summary'], width=100)

['Summary: The paper presents low-rank bilinear pooling that uses Hadamard product (commonly known as',
 'element-wise multiplication). The paper implements low-rank bilinear pooling on an existing model',
 '(Kim et al., 2016b) and builds a model for Visual Question Answering (VQA) that outperforms the',
 'current state-of-art by 0.42%. The paper presents various ablation studies of the new VQA model they',
 'built.----------------Strengths:----------------1. The paper presents new insights into element-wise',
 'multiplication operation which has been previously used in VQA literature (such as Antol et al.,',
 'ICCV 2015) without insights on why it should work. ----------------2. The paper presents a new model',
 'for the task of VQA that beats the current state-of-art by 0.42%. However, I have concerns about the',
 'statistical significance of the performance (see weaknesses below).----------------3. The various',
 'design choices made in model development have been experimentally ver

In [41]:
from glimpse.evaluate import evaluate_common_metrics_samples

# This script is used to evaluate the quality of the generated summaries
# Using ROUGE

metrics = evaluate_common_metrics_samples.evaluate_rouge(df.head())

In [43]:
metrics

{'rouge1': [0.22764227642276422,
  0.27586206896551724,
  0.21513944223107567,
  0.29824561403508776,
  0.10416666666666666],
 'rouge2': [0.049586776859504134,
  0.05263157894736842,
  0.016064257028112452,
  0.017857142857142856,
  0.010443864229765011],
 'rougeL': [0.14634146341463414,
  0.15517241379310348,
  0.11155378486055777,
  0.17543859649122806,
  0.057291666666666664],
 'rougeLsum': [0.14634146341463414,
  0.15517241379310348,
  0.11155378486055777,
  0.17543859649122806,
  0.057291666666666664]}

Applying SeaHorse for testing

In [50]:
from glimpse.evaluate import evaluate_seahorse_metrics_samples
from transformers import AutoModelForSeq2SeqLM

model_id = f"google/seahorse-large-q2"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, torch_dtype=torch.float16)
quest = "SHMetric/Repetition"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model= model.to('cuda')
metrics =  evaluate_seahorse_metrics_samples.evaluate_classification_task(model, tokenizer, quest, df, 2)

  0%|          | 0/5 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
100%|██████████| 5/5 [00:19<00:00,  3.83s/it]


In [51]:
metrics

{'SHMetric/Repetition/proba_1': [0.962890625,
  0.9873046875,
  0.99072265625,
  0.9892578125,
  0.99169921875,
  0.99169921875,
  0.9892578125,
  0.994140625,
  0.99267578125,
  0.986328125],
 'SHMetric/Repetition/proba_0': [0.036956787109375,
  0.01275634765625,
  0.009033203125,
  0.01068878173828125,
  0.00814056396484375,
  0.00835418701171875,
  0.0105743408203125,
  0.0060577392578125,
  0.007213592529296875,
  0.01345062255859375],
 'SHMetric/Repetition/guess': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

_____

Below a small test on *Generating extractive candidates*



In [7]:
df = pd.read_csv('data\candidates\extractive_sentences-_-all_reviews_2017_tst-_-none-_-2024-12-18-20-15-06.csv')

In [8]:
df

Unnamed: 0,index,id,text,gold,summary,id_candidate
0,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,Summary: The paper presents low-rank bilinear ...,0
1,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,The paper implements low-rank bilinear pooling...,1
2,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,The paper presents various ablation studies of...,2
3,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,-Strengths:-1.,3
4,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,The paper presents new insights into element-w...,4
...,...,...,...,...,...,...
150,8,https://openreview.net/forum?id=BkCPyXm1l,This manuscript tries to tackle neural network...,The reviewers unanimously recommend rejection.,It's unclear what conclusions can be drawn abo...,8
151,8,https://openreview.net/forum?id=BkCPyXm1l,This manuscript tries to tackle neural network...,The reviewers unanimously recommend rejection.,-I have remaining reservations about data hygi...,9
152,8,https://openreview.net/forum?id=BkCPyXm1l,This manuscript tries to tackle neural network...,The reviewers unanimously recommend rejection.,"Relatedly, the regularization potential of ear...",10
153,8,https://openreview.net/forum?id=BkCPyXm1l,This manuscript tries to tackle neural network...,The reviewers unanimously recommend rejection.,"See, e.g.",11


In [9]:
df[df['index'] == 0]

Unnamed: 0,index,id,text,gold,summary,id_candidate
0,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,Summary: The paper presents low-rank bilinear ...,0
1,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,The paper implements low-rank bilinear pooling...,1
2,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,The paper presents various ablation studies of...,2
3,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,-Strengths:-1.,3
4,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,The paper presents new insights into element-w...,4
5,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,-2.,5
6,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,The paper presents a new model for the task of...,6
7,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,"However, I have concerns about the statistical...",7
8,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,-3.,8
9,0,https://openreview.net/forum?id=r1rhWnZkg,Summary: The paper presents low-rank bilinear ...,The program committee appreciates the authors'...,The various design choices made in model devel...,9


In [11]:
import nltk 
text = df['text'][0].replace('-----', '\n')
sentc = nltk.sent_tokenize(text)

len(sentc) == len(df[df['index'] == 0]) == 29

True

____