# Topic similarity

The aim of this notebook is to test a strategy to measure similarity between an **expert profile** and **project description**. 

Measuring similarity is suposed to be useful in determining how expert profiles fit for a given project description.

The poposed alternatives to test similarity include:

* LDA (topic modelling)
* word2vec

## LDA

Our first approach involves applying LDA to build expert profiles using topic models.

Here, **expert profiles** are defined as a set topic models, and this models are trained using documents asociated to every expert. For instance, in academics this documents could be research papers where the expert is an author.

**Project descriptions** are defined as the contributions to the topic models of the expert profiles.

The similarity between the project descriptions and the expert profiles can be measured using the Jensen-Shannon Distance (see [ref](https://www.kaggle.com/ktattan/lda-and-document-similarity#Similarity-Queries-and-Unseen-Data))

### Load the data

In [35]:
from os import listdir

In [36]:
filenames = [x for x in listdir('../data/sedici') if 'node_articles_ciencias_informaticas' in x]

In [37]:
df = pd.DataFrame()

for filename in filenames:
    df = pd.concat([df, pd.read_csv('../data/sedici/{0}'.format(filename))], ignore_index=True)

In [38]:
df.shape

(32425, 4)

In [39]:
df.head()

Unnamed: 0,author_id,article_id,title,abstract
0,55286,24446,Análisis automático del rendimiento de ejecuci...,La programación paralela más tradicional oblig...
1,55286,9582,Development and tuning framework of master/wor...,Parallel/distributed programming is a complex ...
2,55286,20899,Método de Reducción de Incertidumbre basado en...,La problemática existente a raíz de la falta d...
3,55286,9524,Process tracking for dynamic tuning applicatio...,The computational resources need by the scient...
4,55286,23166,Mapas de riesgo de incendios forestales basado...,La valoración del riesgo en los incendios fore...


In [40]:
concat_results = []

author_ids = df['author_id'].unique()

for author_id in author_ids:
    
    content = ''
    
    for idx, row in df[df['author_id'] == author_id].iterrows():
        content = content + '. '.join([str(row['title']), str(row['abstract'])]) + '\n' 
        
    concat_results.append((author_id, content))
    
df_content = pd.DataFrame(concat_results, columns=['author_id', 'content'])    

In [41]:
df_content.head()

Unnamed: 0,author_id,content
0,55286,Análisis automático del rendimiento de ejecuci...
1,59726,Simplicidade no desenvolvimento ágil de softwa...
2,63510,Adaptabilidad en familia de aplicaciones web. ...
3,54180,Uso de VRPN en la implementación de una BCI p...
4,58677,m-Experiencia de articulación universidad-escu...


In [42]:
len(df_content)

3325

Now, we have a data set where each row contains the author id as well as their associated content of scientific productions (titles and abstracts).

Since we will be training models for a single language. Next step is to determine what are the proportion of languages usage in their content. 

## Language usage

ToDo