<a href="https://colab.research.google.com/github/anushavasup/Generating-Titles-from-Abstracts/blob/main/Generating_Titles_from_Abstracts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook uses T5 model - A Sequence to Sequence model fully capable to perform any text to text tasks.

In [10]:
#We will install dependencies and work with latest stable pytorch 1.6
! pip install torch torchvision -y
!pip install simpletransformers


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: -y


In [11]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

In [12]:
#mount google drive for getting data
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:
#unzip the mounted data
import json
import zipfile
zip_file_path = '../content/drive/MyDrive/Paper Abstracts/arxiv-metadata-oai-snapshot.json.zip'
extract_path = os.path.dirname(zip_file_path)

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)



In [14]:
#getting the unzipped data path
extract_path = '../content/drive/MyDrive/Paper Abstracts/arxiv-metadata-oai-snapshot.json'

In [6]:
#function for getting meta data
def get_metadata():
    with open(extract_path, 'r') as f:
        for line in f:
            yield line

In [7]:
metadata = get_metadata()
for paper in metadata:
    paper_dict = json.loads(paper)
    print('Title: {}\n\nAbstract: {}\nRef: {}'.format(paper_dict.get('title'), paper_dict.get('abstract'), paper_dict.get('journal-ref')))
    break

Title: Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies

Abstract:   A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of massive photon pairs at hadron colliders. All
next-to-leading order perturbative contributions from quark-antiquark,
gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as
all-orders resummation of initial-state gluon radiation valid at
next-to-next-to-leading logarithmic accuracy. The region of phase space is
specified in which the calculation is most reliable. Good agreement is
demonstrated with data from the Fermilab Tevatron, and predictions are made for
more detailed tests with CDF and DO data. Predictions are shown for
distributions of diphoton pairs produced at the energy of the Large Hadron
Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs
boson are contrasted with those produced from QCD processes at the LHC, showing
tha

In [15]:
# Taking only 2018 to 2023 data to reduce run time.
titles = []
abstracts = []
years = []
metadata = get_metadata()
for paper in metadata:
    paper_dict = json.loads(paper)
    ref = paper_dict.get('journal-ref')
    try:
        year = int(ref[-4:])
        if 2018 < year < 2023:
          years.append(year)
          titles.append(paper_dict.get('title'))
          abstracts.append(paper_dict.get('abstract'))
    except:
        pass

len(titles), len(abstracts), len(years)

(11110, 11110, 11110)

In [16]:
#converting to pandas data frame
papers = pd.DataFrame({
    'title': titles,
    'abstract': abstracts,
    'year': years
})
papers.head()

Unnamed: 0,title,abstract,year
0,Weight Reduction for Mod l Bianchi Modular Forms,Let K be an imaginary quadratic field with c...,2019
1,Spectroscopy and dissociative recombination of...,The dissociative recombination of the lowest...,2022
2,Nonequilibrium phase transition in a spreading...,We consider a nonequilibrium process on a ti...,2020
3,Quantum integrable systems in three-dimensiona...,In this paper we construct integrable three-...,2019
4,Numerical Performance of Compact Fourth Order ...,In this study the numerical performance of t...,2019


In [17]:
del titles, abstracts, years

We will use simpletransformers library to train a T5 model

In [18]:
import logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

In [19]:
papers = papers[['title','abstract']]
papers.columns = ['target_text', 'input_text']
papers = papers.dropna()

In [20]:
eval_df = papers.sample(frac=0.2, random_state=101)
train_df = papers.drop(eval_df.index)

In [21]:
train_df.shape, eval_df.shape

((8888, 2), (2222, 2))

We will training out T5 model with very bare minimum num_train_epochs=4, train_batch_size=16

In [22]:
import logging
import torch
import pandas as pd
from simpletransformers.t5 import T5Model

train_df['prefix'] = "summarize"
eval_df['prefix'] = "summarize"

# defining T5 model arguments
model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 512,
    "train_batch_size": 16,
    "num_train_epochs": 4

}


# Create T5 Model

model = T5Model(model_type="t5", model_name='t5-small', args=model_args,  use_cuda = True)

# Train T5 Model on new task
model.train_model(train_df)

# Evaluate T5 Model on new task
results = model.eval_model(eval_df)



You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


  0%|          | 0/8888 [00:00<?, ?it/s]



Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Running Epoch 0 of 4:   0%|          | 0/556 [00:00<?, ?it/s]



Running Epoch 1 of 4:   0%|          | 0/556 [00:00<?, ?it/s]

Running Epoch 2 of 4:   0%|          | 0/556 [00:00<?, ?it/s]

Running Epoch 3 of 4:   0%|          | 0/556 [00:00<?, ?it/s]

  0%|          | 0/2222 [00:00<?, ?it/s]



Running Evaluation:   0%|          | 0/278 [00:00<?, ?it/s]

In [23]:
results

{'eval_loss': 1.461784690618515}

generating paper's titles

In [24]:
#taking a random data and predict the title from abstract
random_num = 456
actual_title = eval_df.iloc[random_num]['target_text']
actual_abstract = ["summarize: "+eval_df.iloc[random_num]['input_text']]
predicted_title = model.predict(actual_abstract)

print(f'Actual Title: {actual_title}')
print(f'Predicted Title: {predicted_title}')
print(f'Actual Abstract: {actual_abstract}')

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



Decoding outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Actual Title: A Many-Objective Evolutionary Algorithm With Two Interacting Processes:
  Cascade Clustering and Reference Point Incremental Learning
Predicted Title: ['Many-Objective Adaptation for Cascade Clustering and Reference Point Incremental']
Actual Abstract: ['summarize:   Researches have shown difficulties in obtaining proximity while maintaining\ndiversity for many-objective optimization problems. Complexities of the true\nPareto front pose challenges for the reference vector-based algorithms for\ntheir insufficient adaptability to the diverse characteristics with no priori.\nThis paper proposes a many-objective optimization algorithm with two\ninteracting processes: cascade clustering and reference point incremental\nlearning (CLIA). In the population selection process based on cascade\nclustering (CC), using the reference vectors provided by the process based on\nincremental learning, the nondominated and the dominated individuals are\nclustered and sorted with different ma