# Topic similarity

The aim of this notebook is to test a strategy to measure similarity between an **expert profile** and **project description**. 

Measuring similarity is suposed to be useful in determining how expert profiles fit for a given project description.

The poposed alternatives to test similarity include:

* LDA (topic modelling)
* word2vec

## Data

For model training, we will be using a dataset based on documents associated to experts. Next we will compare other documents to measure similarity against trained models.

Data is located at 'data/sedici/filtrar_por_autor.html'

First, we download the authors from the scholar websearch site (http://sedici.unlp.edu.ar/search-filter?field=author&rpp=100000). The `rpp=100000` sets the number of results per page in 100000. This will display the whole author list in a single web page (the number of authors is greater than 70000), which will be usefull for extracting authors ids.

In [None]:
from lxml import html
import re

In [None]:
import logging

In [None]:
logging.basicConfig(filename='../logs/console.log', 
                    level=logging.INFO,
                    filemode='w', 
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

In [None]:
htmltree = html.parse('../data/sedici/filtrar_por_autor.html')

In [None]:
a_sections = htmltree.xpath('//*[@id="aspect_discovery_SearchFacetFilter_div_browse-by-author-results"]//div/table//tr/td/a[contains(@href, "authority")]')

Produce a list of author names

In [None]:
#remove the number of articles in parenthesis
authors = []

for n in a_sections:
    name = n.text
    name = re.sub(r'\(\d+\)$', '', name)
    name = [x.strip() for x in name.split(',')]
    authors.append(name)

In [None]:
len(authors)

Produce a list of *node ids* 

In [None]:
a_sections.__class__

In [None]:
a_sections[0].text

In [None]:
def get_trailing_numbers(urls):
    '''
    Retruns the trailing numbers for every string in urls
    
    ------
    param
    url list of strings (urls)
    ------
    return
    list of numbers
    '''
    node_ids = []

    for n in urls:
        url = n.get('href')
        url = re.search(r'(\d+)$', url).group(0)
        node_ids.append(url)
                               
    return node_ids

In [None]:
node_ids = get_trailing_numbers(a_sections)

In [None]:
len(node_ids)

In [None]:
import pandas as pd

In [None]:
authors[:10]

Save the data set to a csv file 

In [None]:
pd.DataFrame(authors, index=node_ids).to_csv('../data/sedici/authors.csv', header=False)

Get the author's documents

Now that we have the data of the authors, we must get the documents associated to each author. 

First, we define a function to get article ids

In [None]:
def get_article_ids(author_id):
    '''
    Gets the articles ids for a given author id
    
    -------
    param
    author_id id of the author 
    
    -------
    returns
    list
    '''

    rpp = 1000
    article_ids = []

    url = 'http://sedici.unlp.edu.ar/discover?filtertype_0=author&filter_relational_operator_0=authority&filter_0=http://voc.sedici.unlp.edu.ar/node/{0}&rpp={1}'.format(author_id, rpp)

    htmltree = html.parse(url)

    a_sections = htmltree.xpath('//*[@id="aspect_discovery_SimpleSearch_div_search-results"]/ul/ul//li/div[2]/div[1]/span/a')

    article_ids = get_trailing_numbers(a_sections)
    
    return set(article_ids)

In [None]:
node_id = 53309

In [None]:
article_ids = get_article_ids(node_id)

In [None]:
len(article_ids)

In [None]:
article_ids

Second, we define a function to get article title and resume

In [None]:
def get_article_texts(article_id):
    '''
    Gets the title and resume for a given article id
    
    ------
    param
    article_id id of the article
    
    ------
    return 
    title, resume
    '''
    title, resume = '', ''

    try:
        url = 'http://sedici.unlp.edu.ar/handle/10915/{}'.format(article_id)

        htmltree = html.parse(url)

        #article_section = htmltree.xpath('//*[@id="aspect_artifactbrowser_ItemViewer_div_item-view"]/div[1]/h1/text()')        
        article_section = htmltree.xpath('..//h1/text()')        
        title = article_section[0]

        #article_section = htmltree.xpath('//*[@id="aspect_artifactbrowser_ItemViewer_div_item-view"]/div[1]/div[4]/div/p/text()')
        #if len(article_section) == 0:            
        #    article_section = htmltree.xpath('//*[@id="aspect_artifactbrowser_ItemViewer_div_item-view"]/div[1]/div[1]/span[2]/text()')
        #    title = title + '. ' + article_section[0]
            
        #    article_section = htmltree.xpath('//*[@id="aspect_artifactbrowser_ItemViewer_div_item-view"]/div[1]/div[5]/div/p/text()')
        article_section = htmltree.xpath('//div[@class="simple-item-view-description"]//div//p/text()')
        resume = article_section[0]
        
    except Exception as inst:
        logger = logging.getLogger('Articles logger')    
        logger.error('Article parser error: article id {0}, type error {1}'.format(article_id, type(inst)))
    
    return article_id, title, resume

In [None]:
article_id = '70802'

In [None]:
get_article_texts(article_id)

Now, let's get all together to get titles and resumes for every article.

In [None]:
node_id = '53309'


articles = []

article_ids = get_article_ids(node_id)

for a in article_ids:
    articles.append((node_id,) + get_article_texts(a))
        

In [None]:

def get_articles(node_ids, start_idx, end_idx):
    '''
    Gets the articles from node_ids 
    
    -----
    param
    node_ids list of node ids
    
    -----
    param
    start_idx starting index of the node_ids
    
    -----
    param
    end_idx ending index of the node_ids
    '''
    articles = []
    
    assert((start_idx <= end_idx))
    assert(start_idx >= 0)
    assert((end_idx <= len(node_ids)))
    
    logger = logging.getLogger('Articles logger')
    logger.setLevel('INFO')
    
    logger.info("Start getting articles from node id {0} to {1}".format(start_idx, end_idx))

    for i in range(start_idx, end_idx):
        node_id = node_ids[i]
        
        logger.info("Getting articles for node id {0}".format(node_id))

        article_ids = get_article_ids(node_id)

        for a in article_ids:
            articles.append((node_id,) + get_article_texts(a))
    
    logger.info("Finish getting articles from node id {0} to {1}".format(start_idx, end_idx))
    
    return articles

Finally, we have to download articles in a batch process, so we can store results in different files.

In [None]:
batch = 100
start, end = 0, 1000

for i in range(start, end, batch):
    from_idx = i
    to_idx = i + batch
    
    logger = logging.getLogger('Article logger')
    logger.info('Starting batch from index {0} to {1}'.format(from_idx, to_idx))
    
    articles = get_articles(node_ids, from_idx, to_idx)
    articles = pd.DataFrame(articles, columns=['author_id', 'article_id', 'title', 'abstract'])

    articles.to_csv('../data/sedici/node_articles_{}-{}.csv'.format(from_idx, to_idx), index=False)
    

##### LDA

Our first approach involves applying LDA to build expert profiles using topic models.

Here, **expert profiles** are defined as a set topic models, and this models are trained using documents asociated to every expert. For instance, in academics this documents could be research papers where the expert is an author.

**Project descriptions** are defined as the contributions to the topic models of the expert profiles.

The similarity between the project descriptions and the expert profiles can be measured using the Jensen-Shannon Distance (see [ref](https://www.kaggle.com/ktattan/lda-and-document-similarity#Similarity-Queries-and-Unseen-Data))