In [2]:
import json
from tqdm import tqdm
import pandas as pd
from simpletransformers.t5 import T5Model
from sklearn.model_selection import train_test_split

Считываем данные

In [3]:
def get_metadata():
    with open("../input/arxiv/arxiv-metadata-oai-snapshot.json") as f:
        for line in f:
            yield line

In [4]:
metadata = get_metadata()

for paper in metadata:
    first_paper = json.loads(paper)
    break
    
for key in first_paper:
    print(key, ':', first_paper[key])

id : 0704.0001
submitter : Pavel Nadolsky
authors : C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan
title : Calculation of prompt diphoton production cross sections at Tevatron and
  LHC energies
comments : 37 pages, 15 figures; published version
journal-ref : Phys.Rev.D76:013009,2007
doi : 10.1103/PhysRevD.76.013009
report-no : ANL-HEP-PR-07-12
categories : hep-ph
license : None
abstract :   A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of massive photon pairs at hadron colliders. All
next-to-leading order perturbative contributions from quark-antiquark,
gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as
all-orders resummation of initial-state gluon radiation valid at
next-to-next-to-leading logarithmic accuracy. The region of phase space is
specified in which the calculation is most reliable. Good agreement is
demonstrated with data from the Fermilab Tevatron, and predictions are made for
more detai

In [5]:
author = [] 
title = []
categories = []
abstract = []

n_journal_publicated = 0

for ind, paper in tqdm(enumerate(metadata)):
    paper = json.loads(paper)
    if paper['journal-ref'] != None:
        n_journal_publicated += 1      
        author.append(paper['submitter'])
        title.append(paper['title'])
        categories.append(paper['categories'])
        abstract.append(paper['abstract'])

print(f'paper publicated on journals is: {n_journal_publicated}')

2061366it [01:15, 27275.43it/s]

paper publicated on journals is: 761112





In [6]:
df = pd.DataFrame({'author':author,
                   'title':title,
                   'categories':categories, 
                   'abstract':abstract})
df.head()

Unnamed: 0,author,title,categories,abstract
0,Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,math.CA math.FA,In this paper we show how to compute the $\L...
1,Alejandro Corichi,Polymer Quantum Mechanics and its Continuum Limit,gr-qc,A rather non-standard quantum representation...
2,Damian Swift,Numerical solution of shock and ramp compressi...,cond-mat.mtrl-sci,A general formulation was developed to repre...
3,Paul Harvey,"The Spitzer c2d Survey of Large, Nearby, Inste...",astro-ph,We discuss the results from the combined IRA...
4,Christian Stahn,Fermionic superstring loop amplitudes in the p...,hep-th,The pure spinor formulation of the ten-dimen...


In [7]:
df.shape

(761112, 4)

Simpletransformers implementation of T5 model expects a data to be a dataframe with 3 columns: prefix, input_text, target_text

prefix: A string indicating the task to perform

input_text: The input text sequence

target_text: The target sequence

In [8]:
summarize = ['summarize'] * df.shape[0]

df_t5 = pd.DataFrame({'prefix':summarize,
                   'input_text':abstract,
                   'target_text':title})

df_t5.head()

Unnamed: 0,prefix,input_text,target_text
0,summarize,In this paper we show how to compute the $\L...,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...
1,summarize,A rather non-standard quantum representation...,Polymer Quantum Mechanics and its Continuum Limit
2,summarize,A general formulation was developed to repre...,Numerical solution of shock and ramp compressi...
3,summarize,We discuss the results from the combined IRA...,"The Spitzer c2d Survey of Large, Nearby, Inste..."
4,summarize,The pure spinor formulation of the ten-dimen...,Fermionic superstring loop amplitudes in the p...


In [14]:
df_t5 = df_t5.iloc[:10000]

Делим на выборки, обучаем модель

In [15]:
train, test = train_test_split(df_t5, test_size=0.3)

In [10]:
model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 512,
    "train_batch_size": 4,
    "num_train_epochs": 4,
}

model = T5Model("t5", "t5-small", args=model_args, use_cuda=True)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/231M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

In [16]:
model.train_model(train)

  0%|          | 0/7000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Running Epoch 0 of 4:   0%|          | 0/1750 [00:00<?, ?it/s]

Running Epoch 1 of 4:   0%|          | 0/1750 [00:00<?, ?it/s]

Running Epoch 2 of 4:   0%|          | 0/1750 [00:00<?, ?it/s]

Running Epoch 3 of 4:   0%|          | 0/1750 [00:00<?, ?it/s]

(7000, 2.1376826686901707)

In [17]:
res = model.eval_model(test)

  0%|          | 0/3000 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/375 [00:00<?, ?it/s]

In [18]:
res

{'eval_loss': 2.249910375277201}

Посмотрим, как модель предсказывает названия:

In [26]:
random_num = 300
actual_title = test.iloc[random_num]['target_text']
actual_abstract = ["summarize: "+test.iloc[random_num]['input_text']]
predicted_title = model.predict(actual_abstract)

print(f'Actual Title: {actual_title}')
print()
print(f'Predicted Title: {predicted_title[0]}')
print()
print(f'Actual Abstract: {actual_abstract[0]}')

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Actual Title: Search for Anomalous Production of Multilepton Events in p-pbar
  Collisions at sqrt(s) = 1.96 TeV

Predicted Title: Search for anomalous production of events with multiple charged leptons in p-pbar

Actual Abstract: summarize:   We report a search for the anomalous production of events with multiple
charged leptons in p-pbar collisions at sqrt{s} = 1.96 TeV using a data sample
corresponding to an integrated luminosity of 346 pb^{-1} collected by the CDF
II detector at the Fermilab Tevatron. The search is divided into three-lepton
and four-or-more-lepton data samples. We observe six events in the three-lepton
sample and zero events in the >=4-lepton sample. Both numbers of events are
consistent with standard model background expectations. Within the framework of
an R-parity violating supergravity model, the results are interpreted as mass
limits on the lightest neutralino and chargino particles. For one particular
choice of model parameters, the limits are M_neutralino > 

In [25]:
random_num = 980
actual_title = test.iloc[random_num]['target_text']
actual_abstract = ["summarize: "+test.iloc[random_num]['input_text']]
predicted_title = model.predict(actual_abstract)

print(f'Actual Title: {actual_title}')
print()
print(f'Predicted Title: {predicted_title[0]}')
print()
print(f'Actual Abstract: {actual_abstract[0]}')

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Decoding outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Actual Title: Stability transitions for axisymmetric relative equilibria of Euclidean
  symmetric Hamiltonian systems

Predicted Title: Stability of relativ equilibria under momentum-preserving perturbations

Actual Abstract: summarize:   In the presence of noncompact symmetry, the stability of relative equilibria
under momentum-preserving perturbations does not generally imply robust
stability under momentum-changing perturbations. For axisymmetric relative
equilibria of Hamiltonian systems with Euclidean symmetry, we investigate
different mechanisms of stability: stability by energy-momentum confinement,
KAM, and Nekhoroshev stability, and we explain the transitions between these.
We apply our results to the Kirchhoff model for the motion of an axisymmetric
underwater vehicle, and we numerically study dissipation induced instability of
KAM stable relative equilibria for this system.

