## Model inference

In this notebook, we will load a previously trained model, explore the learned topics, and predict topics for a paper on arXiv.

In [38]:
# imports
import sys
sys.path.insert(0, "../")
from utils import scrape_arxiv_abstract
from model import TopicModel
from dataset import ArXivDataset
from gensim.models import LdaModel
from pprint import pprint

### Build topic model

To build a `TopicModel` object, we need to pass in as arguments the dataset used to create the model (to process new instances) and the model itself (to predict topics for the new instances).

In [39]:
# create topic model
model_path = "../models/lda_n15_p5_r929_c38.1"
dataset_path = "./processed_arxiv_data.obj"
model = TopicModel(model_path, dataset_path)

### Investigate topics

Next, let us explore the different topics learned by the model so that we can assign understandable topic names to each cluster.

In [40]:
# print topics
pprint(model.topics)

[(0,
  '0.020*"model" + 0.012*"_" + 0.008*"function" + 0.008*"theory" + '
  '0.007*"case" + 0.007*"line" + 0.006*"new" + 0.006*"result" + 0.005*"study" '
  '+ 0.005*"mass"'),
 (1,
  '0.013*"network" + 0.011*"state" + 0.009*"system" + 0.007*"model" + '
  '0.007*"effect" + 0.007*"quantum" + 0.006*"time" + 0.006*"qubit" + '
  '0.006*"result" + 0.006*"entanglement"'),
 (2,
  '0.012*"planet" + 0.009*"system" + 0.008*"state" + 0.008*"star" + '
  '0.007*"low" + 0.007*"mass" + 0.006*"large" + 0.006*"effect" + '
  '0.005*"galaxy" + 0.005*"atom"'),
 (3,
  '0.014*"system" + 0.010*"function" + 0.008*"orbit" + 0.007*"result" + '
  '0.006*"model" + 0.006*"energy" + 0.005*"solution" + 0.005*"order" + '
  '0.005*"method" + 0.005*"binary"'),
 (4,
  '0.014*"quantum" + 0.013*"state" + 0.009*"field" + 0.007*"system" + '
  '0.006*"problem" + 0.005*"source" + 0.005*"theory" + 0.005*"equation" + '
  '0.005*"non" + 0.005*"time"'),
 (5,
  '0.026*"theory" + 0.018*"field" + 0.010*"gauge" + 0.008*"model" + '
  '0

We can see that there are some clusters that seem to refer to specific topics in machine learning. One of them is topic 7, which seems to direcly relate to sequential and time-series data. Another example is topic 10, which seems to be related to reinforcement learning.

To make it easier to refer to these topic clusters, we will assign (tentative) names to each of them. Note that these names are subject to interpretation and are only assigned to help "summarize" each cluster.

In [41]:
# set topic names
topic_names = [
    "Natural language processing",
    "Probability + Inference",
    "ML-related terms?",
    "Computer vision",
    "Recommendation",
    "Algorithms + Optimization",
    "Deep learning",
    "Sequences + Time series",
    "ML-related terms?",
    "Estimation + Linear algebra?",
    "Reinforcement learning",
    "Paper-related terms?",
    "Artificial intelligence",
    "Data mining",
    "Robotics"
]

model.set_topic_names(topic_names)
pprint(model.topics)

[('Natural language processing',
  '0.020*"model" + 0.012*"_" + 0.008*"function" + 0.008*"theory" + '
  '0.007*"case" + 0.007*"line" + 0.006*"new" + 0.006*"result" + 0.005*"study" '
  '+ 0.005*"mass"'),
 ('Probability + Inference',
  '0.013*"network" + 0.011*"state" + 0.009*"system" + 0.007*"model" + '
  '0.007*"effect" + 0.007*"quantum" + 0.006*"time" + 0.006*"qubit" + '
  '0.006*"result" + 0.006*"entanglement"'),
 ('ML-related terms?',
  '0.012*"planet" + 0.009*"system" + 0.008*"state" + 0.008*"star" + '
  '0.007*"low" + 0.007*"mass" + 0.006*"large" + 0.006*"effect" + '
  '0.005*"galaxy" + 0.005*"atom"'),
 ('Computer vision',
  '0.014*"system" + 0.010*"function" + 0.008*"orbit" + 0.007*"result" + '
  '0.006*"model" + 0.006*"energy" + 0.005*"solution" + 0.005*"order" + '
  '0.005*"method" + 0.005*"binary"'),
 ('Recommendation',
  '0.014*"quantum" + 0.013*"state" + 0.009*"field" + 0.007*"system" + '
  '0.006*"problem" + 0.005*"source" + 0.005*"theory" + 0.005*"equation" + '
  '0.005*"n

### Predict topics for a paper

Let us see how our model predicts a paper taken directly from arXiv. Using the `scrape_arxiv_abstract()` function, we can extract the title and the abstract of any paper on arXiv given its URL. Once scraped, this title and abstract can be passed into our topic model's `predict()` method.

To illustrate, let us scrape the title and abstract from the seminal paper ["Attention Is All You Need" (2017)](https://arxiv.org/abs/1706.03762) and see what topics the model detects.

In [42]:
# paper: "Attention Is All You Need" (Vaswani et al, 2017)
paper_url = "https://arxiv.org/abs/1706.03762"
text = scrape_arxiv_abstract(paper_url)
print(text)

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. 

In [43]:
# get predictions
model.predict(text)[:3]

[('Reinforcement learning', 0.67122495),
 ('Artificial intelligence', 0.11870766),
 ('Natural language processing', 0.114291765)]