⚠️ This notebook is meant to be run on a GPU ⚠️

In [1]:
from IPython.core.display import HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

In [8]:
!pip install -q torch

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
from transformers import AutoTokenizer, TFAutoModel

In [3]:
articles = pd.read_pickle('../data/to_index_p4.pkl')
articles.dropna(axis=0, inplace=True)
articles.reset_index(inplace=True)
articles

Unnamed: 0,index,id,title,abstract,text
0,0,P0,Abduction,"In the philosophical literature, the term “abd...",\n1. Abduction: The General Idea\n\nYou happen...
1,1,P1,Affirmative Action,“Affirmative action” means positive steps take...,"\n1. In the Beginning\n\n\nIn 1972, affirmativ..."
2,2,P2,Aesthetics of the Everyday,"In the history of Western aesthetics, the subj...",\n1. Recent History\n\nWith the establishment ...
3,3,P3,Wittgenstein’s Aesthetics,Given the extreme importance that Wittgenstein...,\n1. The Critique of Traditional Aesthetics\n\...
4,4,P4,Schopenhauer’s Aesthetics,The focus of this entry is on Schopenhauer’s a...,"\n1. Brief Background\n\n\nBy the 1870s, Arthu..."
...,...,...,...,...,...
7216,6084,W6084,Stanisław Krajewski,Stanisław Krajewski (born 1950) is a Polish ph...,Stanisław Krajewski (born 1950) is a Polish ph...
7217,6085,W6085,Patrick Stokes (philosopher),Patrick Stokes (born 1978) is an Australian ph...,Patrick Stokes (born 1978) is an Australian ph...
7218,6086,W6086,Ernst Mach,Ernst Waldfried Josef Wenzel Mach (; German: [...,Ernst Waldfried Josef Wenzel Mach (; German: [...
7219,6087,W6087,Jessica Pierce,"Jessica Pierce (born October 21, 1965) is an A...","Jessica Pierce (born October 21, 1965) is an A..."


# Universal Sentence Encoder

In [17]:
universal_sentence_encoder = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

### Embeddings the abstract of every article

In [26]:
%%time
dataset = tf.data.Dataset.from_tensor_slices(articles['abstract'].values)
dataset = dataset.batch(512)
embeddings = []
for batch in dataset:
    embedding = universal_sentence_encoder(batch)
    embeddings.append(embedding.numpy())
abstract_embeddings = np.vstack(embeddings)

CPU times: user 12.2 s, sys: 5.72 s, total: 17.9 s
Wall time: 5.99 s


In [27]:
abstract_embeddings.shape

(7221, 512)

In [28]:
%%time
dataset = tf.data.Dataset.from_tensor_slices(articles['text'].values)
dataset = dataset.batch(512)
embeddings = []
for batch in dataset:
    embedding = universal_sentence_encoder(batch)
    embeddings.append(embedding.numpy())
article_embeddings = np.vstack(embeddings)

CPU times: user 3min 37s, sys: 1min 18s, total: 4min 56s
Wall time: 2min 16s


In [29]:
article_embeddings.shape

(7221, 512)

In [30]:
with open('../data/use_abstract_embeddings.npy', 'wb') as f:
    np.save(f, abstract_embeddings)
with open('../data/use_article_embeddings.npy', 'wb') as f:
    np.save(f, article_embeddings)

# Spectre 
Document-level Representation Learning using Citation-informed Transformers

https://arxiv.org/pdf/2004.07180.pdf

In [9]:
tokenizer = AutoTokenizer.from_pretrained('allenai/specter')
model = TFAutoModel.from_pretrained('allenai/specter', from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [10]:
title_abstract =  (articles['title'] + ' ' + articles['abstract']).tolist()

In [11]:
inputs = tokenizer(title_abstract, padding=True, truncation=True, return_tensors="tf", max_length=512)

In [14]:
dataset = tf.data.Dataset.from_tensor_slices(inputs).batch(32)

In [15]:
%%time
spectre_embeddings = []
for batch in dataset:
    result = model(**batch)
    embeddings = result.last_hidden_state[:, 0, :].numpy()
    spectre_embeddings.append(embeddings)

CPU times: user 1min 16s, sys: 29.1 s, total: 1min 46s
Wall time: 1min 38s


In [18]:
spectre_embeddings = np.vstack(spectre_embeddings)

In [19]:
spectre_embeddings.shape

(7221, 768)

In [20]:
with open('../data/spectre_embeddings.npy', 'wb') as f:
    np.save(f, spectre_embeddings)