# Topic similarity

The aim of this notebook is to test a strategy to measure similarity between an **expert profile** and **project description**. 

Measuring similarity is suposed to be useful in determining how expert profiles fit for a given project description.

The poposed alternatives to test similarity include:

* LDA (topic modelling)
* word2vec

## Data

For model training, we will be using a dataset based on documents associated to experts. Next we will compare other documents to measure similarity against trained models.

Data is located at 'data/sedici/Filtrar por_ Autor.html'

First, we download the authors from the scholar websearch site (http://sedici.unlp.edu.ar/search-filter?field=author&rpp=100000). The `rpp=100000` sets the number of results per page in 100000. This will display the whole author list in a single web page (the number of authors is greater than 70000), which will be usefull for extracting authors ids.

In [10]:
from lxml import html
import re

In [66]:
htmltree = html.parse('../data/sedici/filtrar_por_autor.html')

In [88]:
a_sections = htmltree.xpath('//*[@id="aspect_discovery_SearchFacetFilter_div_browse-by-author-results"]//div/table//tr/td/a[contains(@href, "authority")]')

Produce a list of author names

In [130]:
#remove the number of articles in parenthesis
authors = []

for n in a_sections:
    name = n.text
    name = re.sub(r'\(\d+\)$', '', name)
    name = [x.strip() for x in name.split(',')]
    authors.append(name)

In [132]:
len(authors)

22380

Produce a list of *node ids* 

In [169]:
node_ids = []

for n in a_sections:
    url = n.get('href')
    url = re.search(r'(F\d+)$', url).group(0)
    node_ids.append(url)

In [170]:
len(node_ids)

22380

In [172]:
import pandas as pd

In [179]:
authors[:10]

[['Aamir', 'Muhammad N.'],
 ['Abad Santos', 'Natalia'],
 ['Abad', 'Jimena'],
 ['Abad', 'Juan Ernesto'],
 ['Abadi', 'Florencia'],
 ['Abadie', 'Diego Gustavo Edwin'],
 ['Abadíe', 'Mariana'],
 ['Abal', 'Adrián Alejandro'],
 ['Abal', 'Adrián Alejandro'],
 ['Abal', 'Mauricio']]

Save the data set to a csv file 

In [181]:
pd.DataFrame(authors, index=node_ids).to_csv('../data/sedici/authors.csv', header=False)

## LDA

Our first approach involves applying LDA to build expert profiles using topic models.

Here, **expert profiles** are defined as a set topic models, and this models are trained using documents asociated to every expert. For instance, in academics this documents could be research papers where the expert is an author.

**Project descriptions** are defined as the contributions to the topic models of the expert profiles.

The similarity between the project descriptions and the expert profiles can be measured using the Jensen-Shannon Distance (see [ref](https://www.kaggle.com/ktattan/lda-and-document-similarity#Similarity-Queries-and-Unseen-Data))