Paper's Title Generation from Abstracts

# ***Installing the dependencies***

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.18.0-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.0 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 50.1 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 53.1 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 67.5 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.6 MB/s 
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (94 kB)
[K     |████████████████████████████████| 94 kB 3.2 

In [None]:
!pip install simplet5

Collecting simplet5
  Downloading simplet5-0.1.3.tar.gz (7.2 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.1 MB/s 
Collecting transformers==4.10.0
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 45.3 MB/s 
[?25hCollecting pytorch-lightning==1.4.5
  Downloading pytorch_lightning-1.4.5-py3-none-any.whl (919 kB)
[K     |████████████████████████████████| 919 kB 58.9 MB/s 
Collecting PyYAML>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 60.5 MB/s 
Collecting torchmetrics>=0.4.0
  Downloading torchmetrics-0.7.0-py3-none-any.whl (396 kB)
[K     |████████████████████████████████| 396 kB 65.1 MB/s 
Collecting future>=0.17.1
  Downloading future-0.18.2.tar.gz (

In [None]:
!pip install kaggle



# ***Reading the dataframe***

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import dask.bag as db
import json
import pandas as pd

docs = db.read_text('/content/drive/MyDrive/Blog_3/arxiv-metadata-oai-snapshot.json').map(json.loads)

In [None]:
# The dataset is very huge. Not sure if the whole set can be used. I start prototyping with a subset of the data so it's easyer to handel:
# This procedure was recommended in the ArXiv dataset itself

get_latest_version = lambda x: x['versions'][-1]['created']


# get only necessary fields of the metadata file
trim = lambda x: {'id': x['id'],
                  'authors': x['authors'],
                  'title': x['title'],
                  'doi': x['doi'],
                  'category':x['categories'].split(' '),
                  'abstract':x['abstract'],}
# filter for papers published on or after 2019-01-01
columns = ['id','category','abstract']
docs_df = (docs.filter(lambda x: int(get_latest_version(x).split(' ')[3]) > 2018)
           .map(trim).
           compute())

# convert to pandas
docs_df = pd.DataFrame(docs_df)

In [None]:
docs_df

Unnamed: 0,id,authors,title,doi,category,abstract
0,0704.0479,T.Geisser,The affine part of the Picard scheme,,"[math.AG, math.KT]",We describe the maximal torus and maximal un...
1,0704.1445,Yasha Gindikin and Vladimir A. Sablikov,Deformed Wigner crystal in a one-dimensional q...,10.1103/PhysRevB.76.045122,"[cond-mat.str-el, cond-mat.mes-hall]",The spatial Fourier spectrum of the electron...
2,0705.0033,"Nikos Frantzikinakis, Randall McCutcheon",Ergodic Theory: Recurrence,,[math.DS],We survey the impact of the Poincar\'e recur...
3,0705.0344,J. P. Pridham,Unifying derived deformation theories,,[math.AG],We develop a framework for derived deformati...
4,0705.0825,Ram Gopal Vishwakarma (Zacatecas University),Einstein's Theory of Gravity in the Presence o...,10.1007/s10509-009-0016-8,"[gr-qc, astro-ph, hep-th]",The mysterious `dark energy' needed to expla...
...,...,...,...,...,...,...
562117,quant-ph/0612050,"Igor Devetak, Jon Yard",The exact cost of redistributing multipartite ...,10.1103/PhysRevLett.100.230501,[quant-ph],How correlated are two quantum systems from ...
562118,quant-ph/0701163,Daegene Song,Does Observation Create Reality?,,[quant-ph],It has been suggested that the locality of i...
562119,quant-ph/0702160,"Andrew M. Childs, Richard Cleve, Stephen P. Jo...",Discrete-query quantum algorithm for NAND trees,10.4086/toc.2009.v005a005,[quant-ph],"Recently, Farhi, Goldstone, and Gutmann gave..."
562120,quant-ph/9606017,Arthur Jabs,Quantum Mechanics in Terms of Realism,,[quant-ph],.We expound an alternative to the Copenhagen...


In [None]:
 docs_df = docs_df.drop(["id","authors","doi","category"],axis = 1)

In [None]:
docs_df

Unnamed: 0,title,abstract
0,The affine part of the Picard scheme,We describe the maximal torus and maximal un...
1,Deformed Wigner crystal in a one-dimensional q...,The spatial Fourier spectrum of the electron...
2,Ergodic Theory: Recurrence,We survey the impact of the Poincar\'e recur...
3,Unifying derived deformation theories,We develop a framework for derived deformati...
4,Einstein's Theory of Gravity in the Presence o...,The mysterious `dark energy' needed to expla...
...,...,...
562117,The exact cost of redistributing multipartite ...,How correlated are two quantum systems from ...
562118,Does Observation Create Reality?,It has been suggested that the locality of i...
562119,Discrete-query quantum algorithm for NAND trees,"Recently, Farhi, Goldstone, and Gutmann gave..."
562120,Quantum Mechanics in Terms of Realism,.We expound an alternative to the Copenhagen...


In [None]:
docs_df = docs_df.drop_duplicates()

In [None]:
docs_df = docs_df.rename(columns = {'title': 'target_text', 'abstract': 'source_text'})

In [None]:
# split the data into training and test
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(docs_df, test_size=0.2)

In [None]:
train_df

Unnamed: 0,target_text,source_text
210743,Classification of discrete modular symmetries ...,We classify discrete modular symmetries in t...
276787,Vortex lattice in two-dimensional chiral XY fe...,"In this Letter we will show that, in the pre..."
205026,Parameterizing the Energy Dissipation Rate in ...,We use a database of direct numerical simula...
309769,A General Superapproximation Result,A general superapproximation result is deriv...
20022,Distributed Simulation and Distributed Inference,Independent samples from an unknown probabil...
...,...,...
526534,Fundamental Wireless Performance of a Building,Over 80% of wireless traffic already takes p...
550228,Fifth-order Z-type weighted essentially non-os...,In this paper we propose the variant Z-type ...
442440,Front propagation of a sexual population with ...,The adaptation of biological species to thei...
560352,Equation-of-motion and Lorentz-invariance rela...,Structure functions of polarized spin-1 hadr...


In [None]:
test_df

Unnamed: 0,target_text,source_text
499755,Matching of Fracture Functions for SIDIS in Ta...,In the target fragmentation region of Semi-I...
494077,Explicit decay rate for the Gini index in the ...,We investigate the repeated averaging model ...
360040,Color-Dipole Picture versus Hard Pomeron in De...,For photon virtualities of $Q^2 \gsim 20 {\r...
339236,Second Order parallel tensor on generalized f....,The purpose of the present paper to study a ...
193824,Rigorous Theory of the Thin Vapor Layers Optic...,The theory of the thin vapor layers linear o...
...,...,...
175554,Duckiefloat: a Collision-Tolerant Resource-Con...,There are several challenges for search and ...
512385,Compound Krylov subspace methods for parametri...,"In this work, we propose a reduced basis met..."
557468,Gaussian Processes for Finite Size Extrapolati...,Key to being able to accurately model the pr...
135822,"Covering graphs, magnetic spectral gaps and ap...","In this article, we analyze the spectrum of ..."


In [None]:
!pip install simplet5



In [None]:
!pip install torchmetrics==0.6.2

Collecting torchmetrics==0.6.2
  Downloading torchmetrics-0.6.2-py3-none-any.whl (332 kB)
[?25l[K     |█                               | 10 kB 22.8 MB/s eta 0:00:01[K     |██                              | 20 kB 10.6 MB/s eta 0:00:01[K     |███                             | 30 kB 8.9 MB/s eta 0:00:01[K     |████                            | 40 kB 8.2 MB/s eta 0:00:01[K     |█████                           | 51 kB 5.0 MB/s eta 0:00:01[K     |██████                          | 61 kB 5.2 MB/s eta 0:00:01[K     |███████                         | 71 kB 5.4 MB/s eta 0:00:01[K     |███████▉                        | 81 kB 6.0 MB/s eta 0:00:01[K     |████████▉                       | 92 kB 4.7 MB/s eta 0:00:01[K     |█████████▉                      | 102 kB 5.2 MB/s eta 0:00:01[K     |██████████▉                     | 112 kB 5.2 MB/s eta 0:00:01[K     |███████████▉                    | 122 kB 5.2 MB/s eta 0:00:01[K     |████████████▉                   | 133 kB 5.2 MB/s

In [None]:
from simplet5 import SimpleT5

model = SimpleT5()
model.from_pretrained(model_type="t5", model_name="t5-base")

Global seed set to 42


Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [None]:
model.train(train_df=train_df[:5000],
            eval_df=test_df[:100], 
            source_max_token_len=128, 
            target_max_token_len=50, 
            batch_size=2, max_epochs=5, use_gpu=True)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                       | Params
-----------------------------------------------------
0 | model | T5ForConditionalGeneration | 222 M 
-----------------------------------------------------
222 M     Trainable params
0         Non-trainable params
222 M     Total params
891.614   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

  f"The dataloader, {name}, does not have many workers which may be a bottleneck."
Global seed set to 42
  f"The dataloader, {name}, does not have many workers which may be a bottleneck."


Training: -1it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [None]:
! ( cd outputs; ls )

simplet5-epoch-0-train-loss-2.3988  simplet5-epoch-3-train-loss-1.491
simplet5-epoch-1-train-loss-2.0225  simplet5-epoch-4-train-loss-1.3004
simplet5-epoch-2-train-loss-1.7344


In [None]:
!mv /content/outputs/simplet5-epoch-4-train-loss-1.3004 /content/drive/MyDrive/Blog_3/latest_runs

In [None]:
# let's load the trained model from the local output folder for inferencing:
model.load_model("/content/outputs/simplet5-epoch-4-train-loss-1.3395", use_gpu=True)

In [None]:
# let's see how it performerd:
sample_abstracts = test_df.sample(10)

for i, abstract in sample_abstracts.iterrows():
    print(f"===== Abstract =====")
    print(abstract['source_text'])
    summary= model.predict(abstract['source_text'])[0]
    print(f"\n===== Actual Title =====")
    print(f"{abstract['target_text']}")
    print(f"\n===== Generated Title =====")
    print(f"{summary}")
    print("\n +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n")

===== Abstract =====
  In this work we propose a robust methodology to mitigate the undesirable
effects caused by outliers to generate reliable physical models. In this way,
we formulate the inverse problems theory in the context of Kaniadakis
statistical mechanics (or $\kappa$-statistics), in which the classical approach
is a particular case. In this regard, the errors are assumed to be distributed
according to a finite-variance $\kappa$-generalized Gaussian distribution.
Based on the probabilistic maximum-likelihood method we derive a
$\kappa$-objective function associated with the finite-variance
$\kappa$-Gaussian distribution. To demonstrate our proposal's
outlier-resistance, we analyze the robustness properties of the
$\kappa$-objective function with help of the so-called influence function. In
this regard, we discuss the role of the entropic index ($\kappa$) associated
with the Kaniadakis $\kappa$-entropy in the effectiveness in inferring physical
parameters by using strongly noi

In [None]:
# let's see how it performerd:
sample_abstracts = test_df.sample(10)

for i, abstract in sample_abstracts.iterrows():
    print(f"===== Abstract =====")
    print(abstract['source_text'])
    summary= model.predict(abstract['source_text'])[0]
    print(f"\n===== Actual Title =====")
    print(f"{abstract['target_text']}")
    print(f"\n===== Generated Title =====")
    print(f"{summary}")
    print("\n +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n")

===== Abstract =====
  This work describes the task of metaphoric paraphrase generation, in which we
are given a literal sentence and are charged with generating a metaphoric
paraphrase. We propose two different models for this task: a lexical
replacement baseline and a novel sequence to sequence model, 'metaphor
masking', that generates free metaphoric paraphrases. We use crowdsourcing to
evaluate our results, as well as developing an automatic metric for evaluating
metaphoric paraphrases. We show that while the lexical replacement baseline is
capable of producing accurate paraphrases, they often lack metaphoricity, while
our metaphor masking model excels in generating metaphoric sentences while
performing nearly as well with regard to fluency and paraphrase quality.


===== Actual Title =====
Metaphoric Paraphrase Generation

===== Generated Title =====
'Metaphor Masking' as a Replacement Baseline and a Sequence to Sequence Model for Paraphrase Generation

 ++++++++++++++++++++++++++