# Project: Semantic Search with Transformers

In this project, you’ll use the sentence_transformers library to perform semantic search on a corpus of machine learning papers. The 
sentence_transformers library enables us to easily generate embeddings for any text using Transformer-based models. Semantic similarity can then be 
modeled as the distance between two embeddings.

To complete this project, you’ll perform the following tasks:

1. Generate embeddings for each paper summary.
                      
2. Create an index for efficient search using Facebook’s Faiss library.
    
3. Test the search engine using custom prompts and summaries from the dataset.



## Task 1: Import the Libraries

Let’s start this project by importing the modules required for completing this project.

To complete this task, import the following libraries:

1. pandas: This module is used for loading and displaying the dataset.
                                                          
2. torch: This module is used to create and manipulate document embedding matrices.
                                                          
3. SentenceTransformer: This is a method in the sentence_transformers library and is used for retrieving the Transformer-based model.
                                                          
4. preprocessing: This is a submodule of the sklearn package and will be used to preprocess the data.

5. faiss: This library is used to create, store, and use the search index.

6. numpy: This package provides numerical computing capabilities and is useful for various calculations.

7. pickle: This module is used to load and store the model and embeddings.



In [2]:
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
from sklearn import preprocessing
import faiss
import numpy as np
import pickle

  from .autonotebook import tqdm as notebook_tqdm


## Task 2: Load the Data

First, let’s load the dataset as a pandas DataFrame. The dataset is made available as a file named arxivData.json in the ./usercode directory, 
and consists of the metadata of 41,000+ research papers.

Perform the following steps to complete this task:

1. Load the dataset to a pandas DataFrame.
    
2. Drop the author, link, and tag columns of the dataset.
    
3. Display the dataset header using the head() method.
    
4. Print the number of machine learning papers in the dataset.
    

In [3]:
pd.set_option('display.max_colwidth', None)
data = pd.read_json('/usercode/arxivData.json')
df = data.drop(columns=["author", "link", 'tag'])
print("Number of Machine Learning papers: ", df.id.unique().shape[0])
df.head()

Number of Machine Learning papers:  41000


Unnamed: 0,day,id,month,summary,title,year
0,1,1802.00209v1,2,"We propose an architecture for VQA which utilizes recurrent layers to\ngenerate visual and textual attention. The memory characteristic of the\nproposed recurrent attention units offers a rich joint embedding of visual and\ntextual features and enables the model to reason relations between several\nparts of the image and question. Our single model outperforms the first place\nwinner on the VQA 1.0 dataset, performs within margin to the current\nstate-of-the-art ensemble model. We also experiment with replacing attention\nmechanisms in other state-of-the-art models with our implementation and show\nincreased accuracy. In both cases, our recurrent attention mechanism improves\nperformance in tasks requiring sequential or relational reasoning on the VQA\ndataset.",Dual Recurrent Attention Units for Visual Question Answering,2018
1,12,1603.03827v1,3,"Recent approaches based on artificial neural networks (ANNs) have shown\npromising results for short-text classification. However, many short texts\noccur in sequences (e.g., sentences in a document or utterances in a dialog),\nand most existing ANN-based systems do not leverage the preceding short texts\nwhen classifying a subsequent one. In this work, we present a model based on\nrecurrent neural networks and convolutional neural networks that incorporates\nthe preceding short texts. Our model achieves state-of-the-art results on three\ndifferent datasets for dialog act prediction.",Sequential Short-Text Classification with Recurrent and Convolutional\n Neural Networks,2016
2,2,1606.00776v2,6,"We introduce the multiresolution recurrent neural network, which extends the\nsequence-to-sequence framework to model natural language generation as two\nparallel discrete stochastic processes: a sequence of high-level coarse tokens,\nand a sequence of natural language tokens. There are many ways to estimate or\nlearn the high-level coarse tokens, but we argue that a simple extraction\nprocedure is sufficient to capture a wealth of high-level discourse semantics.\nSuch procedure allows training the multiresolution recurrent neural network by\nmaximizing the exact joint log-likelihood over both sequences. In contrast to\nthe standard log- likelihood objective w.r.t. natural language tokens (word\nperplexity), optimizing the joint log-likelihood biases the model towards\nmodeling high-level abstractions. We apply the proposed model to the task of\ndialogue response generation in two challenging domains: the Ubuntu technical\nsupport domain, and Twitter conversations. On Ubuntu, the model outperforms\ncompeting approaches by a substantial margin, achieving state-of-the-art\nresults according to both automatic evaluation metrics and a human evaluation\nstudy. On Twitter, the model appears to generate more relevant and on-topic\nresponses according to automatic evaluation metrics. Finally, our experiments\ndemonstrate that the proposed model is more adept at overcoming the sparsity of\nnatural language and is better able to capture long-term structure.",Multiresolution Recurrent Neural Networks: An Application to Dialogue\n Response Generation,2016
3,23,1705.08142v2,5,"Multi-task learning is motivated by the observation that humans bring to bear\nwhat they know about related problems when solving new ones. Similarly, deep\nneural networks can profit from related tasks by sharing parameters with other\nnetworks. However, humans do not consciously decide to transfer knowledge\nbetween tasks. In Natural Language Processing (NLP), it is hard to predict if\nsharing will lead to improvements, particularly if tasks are only loosely\nrelated. To overcome this, we introduce Sluice Networks, a general framework\nfor multi-task learning where trainable parameters control the amount of\nsharing. Our framework generalizes previous proposals in enabling sharing of\nall combinations of subspaces, layers, and skip connections. We perform\nexperiments on three task pairs, and across seven different domains, using data\nfrom OntoNotes 5.0, and achieve up to 15% average error reductions over common\napproaches to multi-task learning. We show that a) label entropy is predictive\nof gains in sluice networks, confirming findings for hard parameter sharing and\nb) while sluice networks easily fit noise, they are robust across domains in\npractice.",Learning what to share between loosely related tasks,2017
4,7,1709.02349v2,9,"We present MILABOT: a deep reinforcement learning chatbot developed by the\nMontreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize\ncompetition. MILABOT is capable of conversing with humans on popular small talk\ntopics through both speech and text. The system consists of an ensemble of\nnatural language generation and retrieval models, including template-based\nmodels, bag-of-words models, sequence-to-sequence neural network and latent\nvariable neural network models. By applying reinforcement learning to\ncrowdsourced data and real-world user interactions, the system has been trained\nto select an appropriate response from the models in its ensemble. The system\nhas been evaluated through A/B testing with real-world users, where it\nperformed significantly better than many competing systems. Due to its machine\nlearning architecture, the system is likely to improve with additional data.",A Deep Reinforcement Learning Chatbot,2017


In [4]:
data

Unnamed: 0,author,day,id,link,month,summary,tag,title,year
0,"[{'name': 'Ahmed Osman'}, {'name': 'Wojciech Samek'}]",1,1802.00209v1,"[{'rel': 'alternate', 'href': 'http://arxiv.org/abs/1802.00209v1', 'type': 'text/html'}, {'rel': 'related', 'href': 'http://arxiv.org/pdf/1802.00209v1', 'type': 'application/pdf', 'title': 'pdf'}]",2,"We propose an architecture for VQA which utilizes recurrent layers to\ngenerate visual and textual attention. The memory characteristic of the\nproposed recurrent attention units offers a rich joint embedding of visual and\ntextual features and enables the model to reason relations between several\nparts of the image and question. Our single model outperforms the first place\nwinner on the VQA 1.0 dataset, performs within margin to the current\nstate-of-the-art ensemble model. We also experiment with replacing attention\nmechanisms in other state-of-the-art models with our implementation and show\nincreased accuracy. In both cases, our recurrent attention mechanism improves\nperformance in tasks requiring sequential or relational reasoning on the VQA\ndataset.","[{'term': 'cs.AI', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.CL', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.CV', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.NE', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'stat.ML', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}]",Dual Recurrent Attention Units for Visual Question Answering,2018
1,"[{'name': 'Ji Young Lee'}, {'name': 'Franck Dernoncourt'}]",12,1603.03827v1,"[{'rel': 'alternate', 'href': 'http://arxiv.org/abs/1603.03827v1', 'type': 'text/html'}, {'rel': 'related', 'href': 'http://arxiv.org/pdf/1603.03827v1', 'type': 'application/pdf', 'title': 'pdf'}]",3,"Recent approaches based on artificial neural networks (ANNs) have shown\npromising results for short-text classification. However, many short texts\noccur in sequences (e.g., sentences in a document or utterances in a dialog),\nand most existing ANN-based systems do not leverage the preceding short texts\nwhen classifying a subsequent one. In this work, we present a model based on\nrecurrent neural networks and convolutional neural networks that incorporates\nthe preceding short texts. Our model achieves state-of-the-art results on three\ndifferent datasets for dialog act prediction.","[{'term': 'cs.CL', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.AI', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.LG', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.NE', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'stat.ML', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}]",Sequential Short-Text Classification with Recurrent and Convolutional\n Neural Networks,2016
2,"[{'name': 'Iulian Vlad Serban'}, {'name': 'Tim Klinger'}, {'name': 'Gerald Tesauro'}, {'name': 'Kartik Talamadupula'}, {'name': 'Bowen Zhou'}, {'name': 'Yoshua Bengio'}, {'name': 'Aaron Courville'}]",2,1606.00776v2,"[{'rel': 'alternate', 'href': 'http://arxiv.org/abs/1606.00776v2', 'type': 'text/html'}, {'rel': 'related', 'href': 'http://arxiv.org/pdf/1606.00776v2', 'type': 'application/pdf', 'title': 'pdf'}]",6,"We introduce the multiresolution recurrent neural network, which extends the\nsequence-to-sequence framework to model natural language generation as two\nparallel discrete stochastic processes: a sequence of high-level coarse tokens,\nand a sequence of natural language tokens. There are many ways to estimate or\nlearn the high-level coarse tokens, but we argue that a simple extraction\nprocedure is sufficient to capture a wealth of high-level discourse semantics.\nSuch procedure allows training the multiresolution recurrent neural network by\nmaximizing the exact joint log-likelihood over both sequences. In contrast to\nthe standard log- likelihood objective w.r.t. natural language tokens (word\nperplexity), optimizing the joint log-likelihood biases the model towards\nmodeling high-level abstractions. We apply the proposed model to the task of\ndialogue response generation in two challenging domains: the Ubuntu technical\nsupport domain, and Twitter conversations. On Ubuntu, the model outperforms\ncompeting approaches by a substantial margin, achieving state-of-the-art\nresults according to both automatic evaluation metrics and a human evaluation\nstudy. On Twitter, the model appears to generate more relevant and on-topic\nresponses according to automatic evaluation metrics. Finally, our experiments\ndemonstrate that the proposed model is more adept at overcoming the sparsity of\nnatural language and is better able to capture long-term structure.","[{'term': 'cs.CL', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.AI', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.LG', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.NE', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'stat.ML', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'I.5.1; I.2.7', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}]",Multiresolution Recurrent Neural Networks: An Application to Dialogue\n Response Generation,2016
3,"[{'name': 'Sebastian Ruder'}, {'name': 'Joachim Bingel'}, {'name': 'Isabelle Augenstein'}, {'name': 'Anders Søgaard'}]",23,1705.08142v2,"[{'rel': 'alternate', 'href': 'http://arxiv.org/abs/1705.08142v2', 'type': 'text/html'}, {'rel': 'related', 'href': 'http://arxiv.org/pdf/1705.08142v2', 'type': 'application/pdf', 'title': 'pdf'}]",5,"Multi-task learning is motivated by the observation that humans bring to bear\nwhat they know about related problems when solving new ones. Similarly, deep\nneural networks can profit from related tasks by sharing parameters with other\nnetworks. However, humans do not consciously decide to transfer knowledge\nbetween tasks. In Natural Language Processing (NLP), it is hard to predict if\nsharing will lead to improvements, particularly if tasks are only loosely\nrelated. To overcome this, we introduce Sluice Networks, a general framework\nfor multi-task learning where trainable parameters control the amount of\nsharing. Our framework generalizes previous proposals in enabling sharing of\nall combinations of subspaces, layers, and skip connections. We perform\nexperiments on three task pairs, and across seven different domains, using data\nfrom OntoNotes 5.0, and achieve up to 15% average error reductions over common\napproaches to multi-task learning. We show that a) label entropy is predictive\nof gains in sluice networks, confirming findings for hard parameter sharing and\nb) while sluice networks easily fit noise, they are robust across domains in\npractice.","[{'term': 'stat.ML', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.AI', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.CL', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.LG', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.NE', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}]",Learning what to share between loosely related tasks,2017
4,"[{'name': 'Iulian V. Serban'}, {'name': 'Chinnadhurai Sankar'}, {'name': 'Mathieu Germain'}, {'name': 'Saizheng Zhang'}, {'name': 'Zhouhan Lin'}, {'name': 'Sandeep Subramanian'}, {'name': 'Taesup Kim'}, {'name': 'Michael Pieper'}, {'name': 'Sarath Chandar'}, {'name': 'Nan Rosemary Ke'}, {'name': 'Sai Rajeshwar'}, {'name': 'Alexandre de Brebisson'}, {'name': 'Jose M. R. Sotelo'}, {'name': 'Dendi Suhubdy'}, {'name': 'Vincent Michalski'}, {'name': 'Alexandre Nguyen'}, {'name': 'Joelle Pineau'}, {'name': 'Yoshua Bengio'}]",7,1709.02349v2,"[{'rel': 'alternate', 'href': 'http://arxiv.org/abs/1709.02349v2', 'type': 'text/html'}, {'rel': 'related', 'href': 'http://arxiv.org/pdf/1709.02349v2', 'type': 'application/pdf', 'title': 'pdf'}]",9,"We present MILABOT: a deep reinforcement learning chatbot developed by the\nMontreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize\ncompetition. MILABOT is capable of conversing with humans on popular small talk\ntopics through both speech and text. The system consists of an ensemble of\nnatural language generation and retrieval models, including template-based\nmodels, bag-of-words models, sequence-to-sequence neural network and latent\nvariable neural network models. By applying reinforcement learning to\ncrowdsourced data and real-world user interactions, the system has been trained\nto select an appropriate response from the models in its ensemble. The system\nhas been evaluated through A/B testing with real-world users, where it\nperformed significantly better than many competing systems. Due to its machine\nlearning architecture, the system is likely to improve with additional data.","[{'term': 'cs.CL', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.AI', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.LG', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.NE', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'stat.ML', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'I.5.1; I.2.7', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}]",A Deep Reinforcement Learning Chatbot,2017
...,...,...,...,...,...,...,...,...,...
40995,"[{'name': 'Vitaly Feldman'}, {'name': 'Pravesh Kothari'}, {'name': 'Jan Vondrák'}]",18,1404.4702v2,"[{'rel': 'alternate', 'href': 'http://arxiv.org/abs/1404.4702v2', 'type': 'text/html'}, {'rel': 'related', 'href': 'http://arxiv.org/pdf/1404.4702v2', 'type': 'application/pdf', 'title': 'pdf'}]",4,"We study the complexity of learning and approximation of self-bounding\nfunctions over the uniform distribution on the Boolean hypercube ${0,1}^n$.\nInformally, a function $f:{0,1}^n \rightarrow \mathbb{R}$ is self-bounding if\nfor every $x \in {0,1}^n$, $f(x)$ upper bounds the sum of all the $n$ marginal\ndecreases in the value of the function at $x$. Self-bounding functions include\nsuch well-known classes of functions as submodular and fractionally-subadditive\n(XOS) functions. They were introduced by Boucheron et al. in the context of\nconcentration of measure inequalities. Our main result is a nearly tight\n$\ell_1$-approximation of self-bounding functions by low-degree juntas.\nSpecifically, all self-bounding functions can be $\epsilon$-approximated in\n$\ell_1$ by a polynomial of degree $\tilde{O}(1/\epsilon)$ over\n$2^{\tilde{O}(1/\epsilon)}$ variables. We show that both the degree and\njunta-size are optimal up to logarithmic terms. Previous techniques considered\nstronger $\ell_2$ approximation and proved nearly tight bounds of\n$\Theta(1/\epsilon^{2})$ on the degree and $2^{\Theta(1/\epsilon^2)}$ on the\nnumber of variables. Our bounds rely on the analysis of noise stability of\nself-bounding functions together with a stronger connection between noise\nstability and $\ell_1$ approximation by low-degree polynomials. This technique\ncan also be used to get tighter bounds on $\ell_1$ approximation by low-degree\npolynomials and faster learning algorithm for halfspaces.\n These results lead to improved and in several cases almost tight bounds for\nPAC and agnostic learning of self-bounding functions relative to the uniform\ndistribution. In particular, assuming hardness of learning juntas, we show that\nPAC and agnostic learning of self-bounding functions have complexity of\n$n^{\tilde{\Theta}(1/\epsilon)}$.","[{'term': 'cs.LG', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.DS', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}]",Nearly Tight Bounds on $\ell_1$ Approximation of Self-Bounding Functions,2014
40996,"[{'name': 'Orly Avner'}, {'name': 'Shie Mannor'}]",22,1404.5421v1,"[{'rel': 'alternate', 'href': 'http://arxiv.org/abs/1404.5421v1', 'type': 'text/html'}, {'rel': 'related', 'href': 'http://arxiv.org/pdf/1404.5421v1', 'type': 'application/pdf', 'title': 'pdf'}]",4,"We consider the problem of multiple users targeting the arms of a single\nmulti-armed stochastic bandit. The motivation for this problem comes from\ncognitive radio networks, where selfish users need to coexist without any side\ncommunication between them, implicit cooperation or common control. Even the\nnumber of users may be unknown and can vary as users join or leave the network.\nWe propose an algorithm that combines an $\epsilon$-greedy learning rule with a\ncollision avoidance mechanism. We analyze its regret with respect to the\nsystem-wide optimum and show that sub-linear regret can be obtained in this\nsetting. Experiments show dramatic improvement compared to other algorithms for\nthis setting.","[{'term': 'cs.LG', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.MA', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}]",Concurrent bandits and cognitive radio networks,2014
40997,"[{'name': 'Ran Zhao'}, {'name': 'Deanna Needell'}, {'name': 'Christopher Johansen'}, {'name': 'Jerry L. Grenard'}]",22,1404.5899v1,"[{'rel': 'alternate', 'href': 'http://arxiv.org/abs/1404.5899v1', 'type': 'text/html'}, {'rel': 'related', 'href': 'http://arxiv.org/pdf/1404.5899v1', 'type': 'application/pdf', 'title': 'pdf'}]",4,"In this paper, we compare and analyze clustering methods with missing data in\nhealth behavior research. In particular, we propose and analyze the use of\ncompressive sensing's matrix completion along with spectral clustering to\ncluster health related data. The empirical tests and real data results show\nthat these methods can outperform standard methods like LPA and FIML, in terms\nof lower misclassification rates in clustering and better matrix completion\nperformance in missing data problems. According to our examination, a possible\nexplanation of these improvements is that spectral clustering takes advantage\nof high data dimension and compressive sensing methods utilize the\nnear-to-low-rank property of health data.","[{'term': 'math.NA', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.LG', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': '62H30, 91C20, 94A08', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}]",A Comparison of Clustering and Missing Data Methods for Health Sciences,2014
40998,"[{'name': 'Zongyan Huang'}, {'name': 'Matthew England'}, {'name': 'David Wilson'}, {'name': 'James H. Davenport'}, {'name': 'Lawrence C. Paulson'}, {'name': 'James Bridge'}]",25,1404.6369v1,"[{'rel': 'related', 'href': 'http://dx.doi.org/10.1007/978-3-319-08434-3_8', 'type': 'text/html', 'title': 'doi'}, {'rel': 'alternate', 'href': 'http://arxiv.org/abs/1404.6369v1', 'type': 'text/html'}, {'rel': 'related', 'href': 'http://arxiv.org/pdf/1404.6369v1', 'type': 'application/pdf', 'title': 'pdf'}]",4,"Cylindrical algebraic decomposition(CAD) is a key tool in computational\nalgebraic geometry, particularly for quantifier elimination over real-closed\nfields. When using CAD, there is often a choice for the ordering placed on the\nvariables. This can be important, with some problems infeasible with one\nvariable ordering but easy with another. Machine learning is the process of\nfitting a computer model to a complex function based on properties learned from\nmeasured data. In this paper we use machine learning (specifically a support\nvector machine) to select between heuristics for choosing a variable ordering,\noutperforming each of the separate heuristics.","[{'term': 'cs.SC', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'cs.LG', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': '68W30, 68T05, O3C10', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'term': 'I.2.6', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}]",Applying machine learning to the problem of choosing a heuristic to\n select the variable ordering for cylindrical algebraic decomposition,2014


## Task 3: Retrieve the Model

In this task, you’ll load the DistilBERT model using the sentence_transformers library. The sentence_transformers library allows us to use 
Transformer-based models from Huggingface, that are fine-tuned to generate semantically meaningful embedding matrices given natural language. 
The DistilBERT model is much smaller than the BERT model while having comparable performance and therefore are more suitable for our use case.

To complete this task, perform the following steps:

1. Use the SentenceTransformer() method from the sentence_transformers library to load a DistilBERT model. This method takes the name of the
Transformer-based model to load as a parameter.
                                                                                                                            
2. Move the model to the GPU if it is available. Print the device on which the model is located.
    

In [3]:
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

if torch.cuda.is_available():
    model = model.to(torch.device("cuda"))

print(model.device)

Downloading .gitattributes:   0%|          | 0.00/399 [00:00<?, ?B/s]Downloading .gitattributes: 100%|██████████| 399/399 [00:00<00:00, 34.9kB/s]
Downloading 1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]Downloading 1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 18.4kB/s]
Downloading README.md:   0%|          | 0.00/4.05k [00:00<?, ?B/s]Downloading README.md: 100%|██████████| 4.05k/4.05k [00:00<00:00, 807kB/s]
Downloading config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]Downloading config.json: 100%|██████████| 555/555 [00:00<00:00, 121kB/s]
Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]Downloading (…)ce_transformers.json: 100%|██████████| 122/122 [00:00<00:00, 20.8kB/s]
Downloading model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]Downloading model.safetensors:   4%|▍         | 10.5M/265M [00:00<00:03, 71.6MB/s]Downloading model.safetensors:  12%|█▏        | 31.5M/265M [00:00<00:01, 123MB/s] 

cpu


## Task 4: Generate or Load the Embeddings

In this task, you’ll either create or load the embeddings for all the paper summaries in the dataset.

To complete this task, perform the following steps:

1. Do one of the following:

To create the embeddings, use the encode() method to generate embedding vectors for the paper summaries using the DistilBERT model. Remember to 
save the embeddings using pickle.dump().

To load the embeddings that have been provided, open the default_embeddings.pickle file and load it to an embeddings object using the pickle.load()
method.

To load the embeddings that were generated in an earlier session, open the new_embeddings.pickle file and load it to an embeddings object using the
pickle.load() method.
    
2. Create the variable length, which is the number of embeddings in the index.

3. Print the shape of any one embedding.

N.B : The code given in this task for generating embeddings will create embeddings for the first 2000 records of the dataset.


In [4]:
embeddings = model.encode(df.summary.to_list()[:2000], show_progress_bar=True)
with open('/usercode/new_embeddings.pickle', 'wb') as pkl:
  pickle.dump(embeddings, pkl)

Batches:   0%|          | 0/63 [00:00<?, ?it/s]Batches:   2%|▏         | 1/63 [00:06<06:38,  6.43s/it]Batches:   3%|▎         | 2/63 [00:12<06:16,  6.16s/it]Batches:   5%|▍         | 3/63 [00:18<06:11,  6.20s/it]Batches:   6%|▋         | 4/63 [00:27<07:17,  7.41s/it]Batches:   8%|▊         | 5/63 [00:34<06:42,  6.95s/it]Batches:  10%|▉         | 6/63 [00:40<06:17,  6.62s/it]Batches:  11%|█         | 7/63 [00:46<06:00,  6.43s/it]Batches:  13%|█▎        | 8/63 [00:54<06:26,  7.02s/it]Batches:  14%|█▍        | 9/63 [01:01<06:22,  7.09s/it]Batches:  16%|█▌        | 10/63 [01:07<05:58,  6.76s/it]Batches:  17%|█▋        | 11/63 [01:13<05:36,  6.48s/it]Batches:  19%|█▉        | 12/63 [01:19<05:27,  6.43s/it]Batches:  21%|██        | 13/63 [01:28<06:03,  7.27s/it]Batches:  22%|██▏       | 14/63 [01:35<05:40,  6.94s/it]Batches:  24%|██▍       | 15/63 [01:41<05:19,  6.65s/it]Batches:  25%|██▌       | 16/63 [01:47<05:03,  6.46s/it]Batches:  27%|██▋       | 17/63 [01:56<05:30,  7.

In [5]:
with open('/usercode/default_embeddings.pickle', 'rb') as pkl:
  embeddings = pickle.load(pkl)

In [6]:
with open('/usercode/new_embeddings.pickle', 'rb') as pkl:
  embeddings = pickle.load(pkl)

In [7]:
length = len(embeddings)
length


2000

In [9]:
print('Shape of the one embedding: ', embeddings[0].shape)
print(embeddings.shape)

Shape of the one embedding:  (768,)
(2000, 768)


## Task 5: Data Preparation and Helper Methods

In this task, you’ll prepare the dataset by encoding the paper IDs as integers. You’ll then write a helper method for returning a list of the 
required dataset information, given a list of IDs.

To complete this task, perform the following steps:

1. The Faiss library requires integer IDs for the items present in the DataFrame. The dataset, on the other hand, has string IDs for each item.
Use the fit_transform() method of the sklearn.preprocessing.LabelEncoder class to convert the string item IDs to integer values.

2. Write a function named id2info(), which returns a list of column values for papers specified by their IDs. This method accepts the 
following parameters as input:

df: This is the DataFrame in which the data is contained.

I: This is a list of IDs of the papers for which the information is required.

column: This is the column of the DataFrame where the required information is stored.



In [10]:
le = preprocessing.LabelEncoder()
df['id'] = le.fit_transform(df['id'])
df.head(2)

Unnamed: 0,day,id,month,summary,title,year
0,1,36693,2,"We propose an architecture for VQA which utilizes recurrent layers to\ngenerate visual and textual attention. The memory characteristic of the\nproposed recurrent attention units offers a rich joint embedding of visual and\ntextual features and enables the model to reason relations between several\nparts of the image and question. Our single model outperforms the first place\nwinner on the VQA 1.0 dataset, performs within margin to the current\nstate-of-the-art ensemble model. We also experiment with replacing attention\nmechanisms in other state-of-the-art models with our implementation and show\nincreased accuracy. In both cases, our recurrent attention mechanism improves\nperformance in tasks requiring sequential or relational reasoning on the VQA\ndataset.",Dual Recurrent Attention Units for Visual Question Answering,2018
1,12,18198,3,"Recent approaches based on artificial neural networks (ANNs) have shown\npromising results for short-text classification. However, many short texts\noccur in sequences (e.g., sentences in a document or utterances in a dialog),\nand most existing ANN-based systems do not leverage the preceding short texts\nwhen classifying a subsequent one. In this work, we present a model based on\nrecurrent neural networks and convolutional neural networks that incorporates\nthe preceding short texts. Our model achieves state-of-the-art results on three\ndifferent datasets for dialog act prediction.",Sequential Short-Text Classification with Recurrent and Convolutional\n Neural Networks,2016


In [11]:
def id2info(df, I, column):
    return [list(df[df.id == idx][column]) for idx in I]

## Task 6: Set up the Index

In this task, you will set up the search index using the Faiss library.

To complete this task, perform the following steps:

1. Convert the embeddings to NumPy arrays of the float32 data type.

2. Initialize the index using the IndexFlatL2() method from the Faiss library. This method will take the length of the embeddings as input. This 
index will return search results based on the k-nearest-neighbors algorithm through a brute-force search with L2 (Euclidean) distances.
    
3. Use the IndexIDMap() from the Faiss library to create an index map that encapsulates the initialized index and provides a mapping between IDs and 
the embedding vectors when adding and searching. This method will take the index that was created in the previous step as a parameter.

4. Use the add_with_ids() method to add the embeddings and their IDs to the index map. This method will take the following parameters:

embeddings: An array of embedding vectors

xids: The IDs corresponding to the embedding vectors

5. Print the number of embeddings in the index map.



In [12]:
embeddings = np.array(embeddings).astype("float32")
index = faiss.IndexFlatL2(embeddings.shape[1])
index = faiss.IndexIDMap(index)
index.add_with_ids(embeddings, df['id'][:length])

print("Number of embeddings in the Faiss index: ", index.ntotal)

Number of embeddings in the Faiss index:  2000


## Task 7: Search with a Summary

In this section, you’ll search the index with a summary from the dataset. To complete this task, perform the following steps:

1. Print the summary that will be used to search.

2. Get the 10 nearest neighbors by searching the index map. Search the index by using the index.search() method.

This method accepts the following arguments:

vector: These are the embeddings of the summary that will be used to search.

k: This is the number of neighbors the model will return.
    
The method will return the following outputs:

D: A list of L2 (Euclidean) distances of the results.

I: The IDs of the results.
    
3. Display the distances, IDs, titles, and summaries of the returned results as a DataFrame.


In [13]:
df.iloc[1337, [3, 1]]

summary    In this paper we study the application of convolutional neural networks for\njointly detecting objects depicted in still images and estimating their 3D\npose. We identify different feature representations of oriented objects, and\nenergies that lead a network to learn this representations. The choice of the\nrepresentation is crucial since the pose of an object has a natural, continuous\nstructure while its category is a discrete variable. We evaluate the different\napproaches on the joint object detection and pose estimation task of the\nPascal3D+ benchmark using Average Viewpoint Precision. We show that a\nclassification approach on discretized viewpoints achieves state-of-the-art\nperformance for joint object detection and pose estimation, and significantly\noutperforms existing baselines on this benchmark.
id                                                                                                                                                                     

In [14]:
df.iloc[0, [3, 1]]

summary    We propose an architecture for VQA which utilizes recurrent layers to\ngenerate visual and textual attention. The memory characteristic of the\nproposed recurrent attention units offers a rich joint embedding of visual and\ntextual features and enables the model to reason relations between several\nparts of the image and question. Our single model outperforms the first place\nwinner on the VQA 1.0 dataset, performs within margin to the current\nstate-of-the-art ensemble model. We also experiment with replacing attention\nmechanisms in other state-of-the-art models with our implementation and show\nincreased accuracy. In both cases, our recurrent attention mechanism improves\nperformance in tasks requiring sequential or relational reasoning on the VQA\ndataset.
id                                                                                                                                                                                                                        

In [15]:
D, I = index.search(np.array([embeddings[1337]]), k=10)
pd.DataFrame({'L2 distance': D.flatten().tolist(), 'ML paper IDs': I.flatten().tolist(), 'ML paper titles': id2info(df, I.flatten(), 'title'), 'Summaries': id2info(df, I.flatten(), 'summary')}).head(10)

Unnamed: 0,L2 distance,ML paper IDs,ML paper titles,Summaries
0,0.0,12964,[Convolutional Neural Networks for joint object detection and pose\n estimation: A comparative study],"[In this paper we study the application of convolutional neural networks for\njointly detecting objects depicted in still images and estimating their 3D\npose. We identify different feature representations of oriented objects, and\nenergies that lead a network to learn this representations. The choice of the\nrepresentation is crucial since the pose of an object has a natural, continuous\nstructure while its category is a discrete variable. We evaluate the different\napproaches on the joint object detection and pose estimation task of the\nPascal3D+ benchmark using Average Viewpoint Precision. We show that a\nclassification approach on discretized viewpoints achieves state-of-the-art\nperformance for joint object detection and pose estimation, and significantly\noutperforms existing baselines on this benchmark.]"
1,61.530529,11503,[Deep Metric Learning for Practical Person Re-Identification],"[Various hand-crafted features and metric learning methods prevail in the\nfield of person re-identification. Compared to these methods, this paper\nproposes a more general way that can learn a similarity metric from image\npixels directly. By using a ""siamese"" deep neural network, the proposed method\ncan jointly learn the color feature, texture feature and metric in a unified\nframework. The network has a symmetry structure with two sub-networks which are\nconnected by Cosine function. To deal with the big variations of person images,\nbinomial deviance is used to evaluate the cost between similarities and labels,\nwhich is proved to be robust to outliers.\n Compared to existing researches, a more practical setting is studied in the\nexperiments that is training and test on different datasets (cross dataset\nperson re-identification). Both in ""intra dataset"" and ""cross dataset""\nsettings, the superiorities of the proposed method are illustrated on VIPeR and\nPRID.]"
2,65.805099,11377,[Cortical spatio-temporal dimensionality reduction for visual grouping],"[The visual systems of many mammals, including humans, is able to integrate\nthe geometric information of visual stimuli and to perform cognitive tasks\nalready at the first stages of the cortical processing. This is thought to be\nthe result of a combination of mechanisms, which include feature extraction at\nsingle cell level and geometric processing by means of cells connectivity. We\npresent a geometric model of such connectivities in the space of detected\nfeatures associated to spatio-temporal visual stimuli, and show how they can be\nused to obtain low-level object segmentation. The main idea is that of defining\na spectral clustering procedure with anisotropic affinities over datasets\nconsisting of embeddings of the visual stimuli into higher dimensional spaces.\nNeural plausibility of the proposed arguments will be discussed.]"
3,67.032898,11142,[Heterogeneous Multi-task Learning for Human Pose Estimation with Deep\n Convolutional Neural Network],"[We propose an heterogeneous multi-task learning framework for human pose\nestimation from monocular image with deep convolutional neural network. In\nparticular, we simultaneously learn a pose-joint regressor and a sliding-window\nbody-part detector in a deep network architecture. We show that including the\nbody-part detection task helps to regularize the network, directing it to\nconverge to a good solution. We report competitive and state-of-art results on\nseveral data sets. We also empirically show that the learned neurons in the\nmiddle layer of our network are tuned to localized body parts.]"
4,69.937653,30623,[Neural Expectation Maximization],"[Many real world tasks such as reasoning and physical interaction require\nidentification and manipulation of conceptual entities. A first step towards\nsolving these tasks is the automated discovery of distributed symbol-like\nrepresentations. In this paper, we explicitly formalize this problem as\ninference in a spatial mixture model where each component is parametrized by a\nneural network. Based on the Expectation Maximization framework we then derive\na differentiable clustering method that simultaneously learns how to group and\nrepresent individual entities. We evaluate our method on the (sequential)\nperceptual grouping task and find that it is able to accurately recover the\nconstituent objects. We demonstrate that the learned representations are useful\nfor next-step prediction.]"
5,70.530083,13876,[Pixel-wise Deep Learning for Contour Detection],"[We address the problem of contour detection via per-pixel classifications of\nedge point. To facilitate the process, the proposed approach leverages with\nDenseNet, an efficient implementation of multiscale convolutional neural\nnetworks (CNNs), to extract an informative feature vector for each pixel and\nuses an SVM classifier to accomplish contour detection. In the experiment of\ncontour detection, we look into the effectiveness of combining per-pixel\nfeatures from different CNN layers and verify their performance on BSDS500.]"
6,73.332405,18371,[Sparse Activity and Sparse Connectivity in Supervised Learning],"[Sparseness is a useful regularizer for learning in a wide range of\napplications, in particular in neural networks. This paper proposes a model\ntargeted at classification tasks, where sparse activity and sparse connectivity\nare used to enhance classification capabilities. The tool for achieving this is\na sparseness-enforcing projection operator which finds the closest vector with\na pre-defined sparseness for any given vector. In the theoretical part of this\npaper, a comprehensive theory for such a projection is developed. In\nconclusion, it is shown that the projection is differentiable almost everywhere\nand can thus be implemented as a smooth neuronal transfer function. The entire\nmodel can hence be tuned end-to-end using gradient-based methods. Experiments\non the MNIST database of handwritten digits show that classification\nperformance can be boosted by sparse activity or sparse connectivity. With a\ncombination of both, performance can be significantly better compared to\nclassical non-sparse approaches.]"
7,73.55928,10182,[Deeply Coupled Auto-encoder Networks for Cross-view Classification],"[The comparison of heterogeneous samples extensively exists in many\napplications, especially in the task of image classification. In this paper, we\npropose a simple but effective coupled neural network, called Deeply Coupled\nAutoencoder Networks (DCAN), which seeks to build two deep neural networks,\ncoupled with each other in every corresponding layers. In DCAN, each deep\nstructure is developed via stacking multiple discriminative coupled\nauto-encoders, a denoising auto-encoder trained with maximum margin criterion\nconsisting of intra-class compactness and inter-class penalty. This single\nlayer component makes our model simultaneously preserve the local consistency\nand enhance its discriminative capability. With increasing number of layers,\nthe coupled networks can gradually narrow the gap between the two views.\nExtensive experiments on cross-view image classification tasks demonstrate the\nsuperiority of our method over state-of-the-art methods.]"
8,73.715988,21064,[Crafting a multi-task CNN for viewpoint estimation],"[Convolutional Neural Networks (CNNs) were recently shown to provide\nstate-of-the-art results for object category viewpoint estimation. However\ndifferent ways of formulating this problem have been proposed and the competing\napproaches have been explored with very different design choices. This paper\npresents a comparison of these approaches in a unified setting as well as a\ndetailed analysis of the key factors that impact performance. Followingly, we\npresent a new joint training method with the detection task and demonstrate its\nbenefit. We also highlight the superiority of classification approaches over\nregression approaches, quantify the benefits of deeper architectures and\nextended training data, and demonstrate that synthetic data is beneficial even\nwhen using ImageNet training data. By combining all these elements, we\ndemonstrate an improvement of approximately 5% mAVP over previous\nstate-of-the-art results on the Pascal3D+ dataset. In particular for their most\nchallenging 24 view classification task we improve the results from 31.1% to\n36.1% mAVP.]"
9,73.727539,18667,[Deep Aesthetic Quality Assessment with Semantic Information],"[Human beings often assess the aesthetic quality of an image coupled with the\nidentification of the image's semantic content. This paper addresses the\ncorrelation issue between automatic aesthetic quality assessment and semantic\nrecognition. We cast the assessment problem as the main task among a multi-task\ndeep model, and argue that semantic recognition task offers the key to address\nthis problem. Based on convolutional neural networks, we employ a single and\nsimple multi-task framework to efficiently utilize the supervision of aesthetic\nand semantic labels. A correlation item between these two tasks is further\nintroduced to the framework by incorporating the inter-task relationship\nlearning. This item not only provides some useful insight about the correlation\nbut also improves assessment accuracy of the aesthetic task. Particularly, an\neffective strategy is developed to keep a balance between the two tasks, which\nfacilitates to optimize the parameters of the framework. Extensive experiments\non the challenging AVA dataset and Photo.net dataset validate the importance of\nsemantic recognition in aesthetic quality assessment, and demonstrate that\nmulti-task deep models can discover an effective aesthetic representation to\nachieve state-of-the-art results.]"


## Task 8: Search with a Prompt

In this task, you’ll search the dataset using a prompt. To complete this task, perform the following steps:

1. Create the prompt and assign it to a string variable named user_query.
    
2. Encode the prompt using the encode() method of the selected. This method accepts the prompt as a list.
    
3. Search the index using the generated embeddings, and display the distances, IDs, titles, and summaries in the results as a DataFrame.


In [16]:

user_query = "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data"



In [17]:
embed = model.encode(list(user_query))
D, I = index.search(np.array([embed]).squeeze().astype("float32"), k=10)

results = {'L2 distances':D.flatten().tolist(), 'ML paper IDs':I.flatten().tolist(), "Titles": id2info(df, I.flatten(), 'title'), "Summaries": id2info(df, I.flatten(), 'summary')}

pd.DataFrame(results).head(10)

Unnamed: 0,L2 distances,ML paper IDs,Titles,Summaries
0,398.835114,26890,[Abstract Syntax Networks for Code Generation and Semantic Parsing],"[Tasks like code generation and semantic parsing require mapping unstructured\n(or partially structured) inputs to well-formed, executable outputs. We\nintroduce abstract syntax networks, a modeling framework for these problems.\nThe outputs are represented as abstract syntax trees (ASTs) and constructed by\na decoder with a dynamically-determined modular structure paralleling the\nstructure of the output tree. On the benchmark Hearthstone dataset for code\ngeneration, our model obtains 79.2 BLEU and 22.7% exact match accuracy,\ncompared to previous state-of-the-art values of 67.1 and 6.1%. Furthermore, we\nperform competitively on the Atis, Jobs, and Geo semantic parsing datasets with\nno task-specific engineering.]"
1,401.845093,30018,[Dual Rectified Linear Units (DReLUs): A Replacement for Tanh Activation\n Functions in Quasi-Recurrent Neural Networks],"[In this paper, we introduce a novel type of Rectified Linear Unit (ReLU),\ncalled a Dual Rectified Linear Unit (DReLU). A DReLU, which comes with an\nunbounded positive and negative image, can be used as a drop-in replacement for\na tanh activation function in the recurrent step of Quasi-Recurrent Neural\nNetworks (QRNNs) (Bradbury et al. (2017)). Similar to ReLUs, DReLUs are less\nprone to the vanishing gradient problem, they are noise robust, and they induce\nsparse activations.\n We independently reproduce the QRNN experiments of Bradbury et al. (2017) and\ncompare our DReLU-based QRNNs with the original tanh-based QRNNs and Long\nShort-Term Memory networks (LSTMs) on sentiment classification and word-level\nlanguage modeling. Additionally, we evaluate on character-level language\nmodeling, showing that we are able to stack up to eight QRNN layers with\nDReLUs, thus making it possible to improve the current state-of-the-art in\ncharacter-level language modeling over shallow architectures based on LSTMs.]"
2,403.243622,7184,[KSU KDD: Word Sense Induction by Clustering in Topic Space],"[We describe our language-independent unsupervised word sense induction\nsystem. This system only uses topic features to cluster different word senses\nin their global context topic space. Using unlabeled data, this system trains a\nlatent Dirichlet allocation (LDA) topic model then uses it to infer the topics\ndistribution of the test instances. By clustering these topics distributions in\ntheir topic space we cluster them into different senses. Our hypothesis is that\ncloseness in topic space reflects similarity between different word senses.\nThis system participated in SemEval-2 word sense induction and disambiguation\ntask and achieved the second highest V-measure score among all other systems.]"
3,403.410858,28659,[Topic supervised non-negative matrix factorization],"[Topic models have been extensively used to organize and interpret the\ncontents of large, unstructured corpora of text documents. Although topic\nmodels often perform well on traditional training vs. test set evaluations, it\nis often the case that the results of a topic model do not align with human\ninterpretation. This interpretability fallacy is largely due to the\nunsupervised nature of topic models, which prohibits any user guidance on the\nresults of a model. In this paper, we introduce a semi-supervised method called\ntopic supervised non-negative matrix factorization (TS-NMF) that enables the\nuser to provide labeled example documents to promote the discovery of more\nmeaningful semantic structure of a corpus. In this way, the results of TS-NMF\nbetter match the intuition and desired labeling of the user. The core of TS-NMF\nrelies on solving a non-convex optimization problem for which we derive an\niterative algorithm that is shown to be monotonic and convergent to a local\noptimum. We demonstrate the practical utility of TS-NMF on the Reuters and\nPubMed corpora, and find that TS-NMF is especially useful for conceptual or\nbroad topics, where topic key terms are not well understood. Although\nidentifying an optimal latent structure for the data is not a primary objective\nof the proposed approach, we find that TS-NMF achieves higher weighted Jaccard\nsimilarity scores than the contemporary methods, (unsupervised) NMF and latent\nDirichlet allocation, at supervision rates as low as 10% to 20%.]"
4,404.912506,34786,[Don't Just Assume; Look and Answer: Overcoming Priors for Visual\n Question Answering],"[A number of studies have found that today's Visual Question Answering (VQA)\nmodels are heavily driven by superficial correlations in the training data and\nlack sufficient image grounding. To encourage development of models geared\ntowards the latter, we propose a new setting for VQA where for every question\ntype, train and test sets have different prior distributions of answers.\nSpecifically, we present new splits of the VQA v1 and VQA v2 datasets, which we\ncall Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2\nrespectively). First, we evaluate several existing VQA models under this new\nsetting and show that their performance degrades significantly compared to the\noriginal VQA setting. Second, we propose a novel Grounded Visual Question\nAnswering model (GVQA) that contains inductive biases and restrictions in the\narchitecture specifically designed to prevent the model from 'cheating' by\nprimarily relying on priors in the training data. Specifically, GVQA explicitly\ndisentangles the recognition of visual concepts present in the image from the\nidentification of plausible answer space for a given question, enabling the\nmodel to more robustly generalize across different distributions of answers.\nGVQA is built off an existing VQA model -- Stacked Attention Networks (SAN).\nOur experiments demonstrate that GVQA significantly outperforms SAN on both\nVQA-CP v1 and VQA-CP v2 datasets. Interestingly, it also outperforms more\npowerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in\nseveral cases. GVQA offers strengths complementary to SAN when trained and\nevaluated on the original VQA v1 and VQA v2 datasets. Finally, GVQA is more\ntransparent and interpretable than existing VQA models.]"
5,406.426178,30468,[Regularizing and Optimizing LSTM Language Models],"[Recurrent neural networks (RNNs), such as long short-term memory networks\n(LSTMs), serve as a fundamental building block for many sequence learning\ntasks, including machine translation, language modeling, and question\nanswering. In this paper, we consider the specific problem of word-level\nlanguage modeling and investigate strategies for regularizing and optimizing\nLSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on\nhidden-to-hidden weights as a form of recurrent regularization. Further, we\nintroduce NT-ASGD, a variant of the averaged stochastic gradient method,\nwherein the averaging trigger is determined using a non-monotonic condition as\nopposed to being tuned by the user. Using these and other regularization\nstrategies, we achieve state-of-the-art word level perplexities on two data\nsets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the\neffectiveness of a neural cache in conjunction with our proposed model, we\nachieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and\n52.0 on WikiText-2.]"
6,410.384216,16734,[Learning the Dimensionality of Word Embeddings],"[We describe a method for learning word embeddings with data-dependent\ndimensionality. Our Stochastic Dimensionality Skip-Gram (SD-SG) and Stochastic\nDimensionality Continuous Bag-of-Words (SD-CBOW) are nonparametric analogs of\nMikolov et al.'s (2013) well-known 'word2vec' models. Vector dimensionality is\nmade dynamic by employing techniques used by Cote & Larochelle (2016) to define\nan RBM with an infinite number of hidden units. We show qualitatively and\nquantitatively that SD-SG and SD-CBOW are competitive with their\nfixed-dimension counterparts while providing a distribution over embedding\ndimensionalities, which offers a window into how semantics distribute across\ndimensions.]"
7,410.741394,12050,[HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale\n Visual Recognition],"[In image classification, visual separability between different object\ncategories is highly uneven, and some categories are more difficult to\ndistinguish than others. Such difficult categories demand more dedicated\nclassifiers. However, existing deep convolutional neural networks (CNN) are\ntrained as flat N-way classifiers, and few efforts have been made to leverage\nthe hierarchical structure of categories. In this paper, we introduce\nhierarchical deep CNNs (HD-CNNs) by embedding deep CNNs into a category\nhierarchy. An HD-CNN separates easy classes using a coarse category classifier\nwhile distinguishing difficult classes using fine category classifiers. During\nHD-CNN training, component-wise pretraining is followed by global finetuning\nwith a multinomial logistic loss regularized by a coarse category consistency\nterm. In addition, conditional executions of fine category classifiers and\nlayer parameter compression make HD-CNNs scalable for large-scale visual\nrecognition. We achieve state-of-the-art results on both CIFAR100 and\nlarge-scale ImageNet 1000-class benchmark datasets. In our experiments, we\nbuild up three different HD-CNNs and they lower the top-1 error of the standard\nCNNs by 2.65%, 3.1% and 1.1%, respectively.]"
8,414.264343,11912,[Taking into Account the Differences between Actively and Passively\n Acquired Data: The Case of Active Learning with Support Vector Machines for\n Imbalanced Datasets],"[Actively sampled data can have very different characteristics than passively\nsampled data. Therefore, it's promising to investigate using different\ninference procedures during AL than are used during passive learning (PL). This\ngeneral idea is explored in detail for the focused case of AL with\ncost-weighted SVMs for imbalanced data, a situation that arises for many HLT\ntasks. The key idea behind the proposed InitPA method for addressing imbalance\nis to base cost models during AL on an estimate of overall corpus imbalance\ncomputed via a small unbiased sample rather than the imbalance in the labeled\ntraining data, which is the leading method used during PL.]"
9,414.365906,31758,[Self-Guiding Multimodal LSTM - when we do not have a perfect training\n dataset for image captioning],"[In this paper, a self-guiding multimodal LSTM (sg-LSTM) image captioning\nmodel is proposed to handle uncontrolled imbalanced real-world image-sentence\ndataset. We collect FlickrNYC dataset from Flickr as our testbed with 306,165\nimages and the original text descriptions uploaded by the users are utilized as\nthe ground truth for training. Descriptions in FlickrNYC dataset vary\ndramatically ranging from short term-descriptions to long\nparagraph-descriptions and can describe any visual aspects, or even refer to\nobjects that are not depicted. To deal with the imbalanced and noisy situation\nand to fully explore the dataset itself, we propose a novel guiding textual\nfeature extracted utilizing a multimodal LSTM (m-LSTM) model. Training of\nm-LSTM is based on the portion of data in which the image content and the\ncorresponding descriptions are strongly bonded. Afterwards, during the training\nof sg-LSTM on the rest training data, this guiding information serves as\nadditional input to the network along with the image representations and the\nground-truth descriptions. By integrating these input components into a\nmultimodal block, we aim to form a training scheme with the textual information\ntightly coupled with the image content. The experimental results demonstrate\nthat the proposed sg-LSTM model outperforms the traditional state-of-the-art\nmultimodal RNN captioning framework in successfully describing the key\ncomponents of the input images.]"


# End