Latent Dirichlet Allocation
===

Preparación
----

In [1]:
import pandas as pd

scopus = pd.read_csv("https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/scopus-abstracts.csv")

abstract = scopus['Abstract']

Descripción del problema
---

Uno de los principales problemas abordados en minería de texto consiste en la extracción de los temas o tópicos a los que pertenece documento. Por ejemplo, una noticia podría pertener simultáneamente a los temas de religión y economía (el escándalo por el manejo de fondos del Vaticano). Cuando se tiene un conjunto de documentos, se desea extraer los tópicos subyacentes sobre los que tratan los documentos.

Scikit-learn contiene una implementación de la metodología Latent Dirichlet Allocation, la cual permite extraer los tópicos de un conjunto de documentos. Véase https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

Utilice esta metodología para extraer los tópicos subyacentes en los abstracts de los artículos. Tenga en cuenta que:

1. Debe establecer como obtener el número apropiado de tópicos a obtener.

2. Debe eliminar las stop-words.

3. En T-Lab sugieren reducir las palabras a sustantivos, adjetivos, verbos y adverbios únicamente. Cómo podría realizar esto en su código=?

4. Cómo podría verificar si la cantidad de temas es apropiada desde el punto de vista de su contenido (las palabras que contiene y los temas que trata)?


In [39]:
#Se muestran los registros almacenados en la columna Abstracts del dataset

abstract

0       Mobility is one of the fundamental requirement...
1       The recent rise of the political extremism in ...
2       The power of the press to shape the informatio...
3       Identifying influential nodes in a network is ...
4       To complement traditional dietary surveys, whi...
                              ...                        
1897    In this article, we intend to show how useful ...
1898    In recent geographical information science lit...
1899    The fact that many decisions need a combinatio...
1900    The report from Woolmark Business Intelligence...
1901    Changing consumer lifestyles and increased tim...
Name: Abstract, Length: 1902, dtype: object

In [40]:
#Se importan las librerias a utilizar en el laboratorio

import warnings
# init
warnings.filterwarnings("ignore")
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
#np.random.seed(2018)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [43]:
#Función para generar los lemmas y los stem correspondientes, 
#asimismo eliminar las stopwords y las palabras que no tengan más de 3 letras

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocesing(row):

    text_stemmed = []
    text_splited = []
    new_text = []

    row = row.replace('.', '')
    row = row.replace(',', '')
    text_splited = row.split(' ')

    for a in text_splited:
        text_stemmed.append(stemmer.stem(WordNetLemmatizer().lemmatize(a, pos='v')))

    text_stemmed

    for a in range(len(text_stemmed)):
        if text_stemmed[a] not in gensim.parsing.preprocessing.STOPWORDS and len(text_stemmed[a]) > 3:
            new_text.append(text_stemmed[a])
    
    return new_text

In [46]:
#Generar un dataframe con la columna de abstracts más 
#la columna de los abstracts preprocesados por nuestra función de preprocesamiento creada anteriormente.

new_df=pd.DataFrame()

new_df['Abstract'] = abstract
new_df['Abstract_preprocesed'] = new_df['Abstract'].map(preprocesing)

new_df

Unnamed: 0,Abstract,Abstract_preprocesed
0,Mobility is one of the fundamental requirement...,"[mobil, fundament, requir, human, life, signif..."
1,The recent rise of the political extremism in ...,"[recent, rise, polit, extrem, western, countri..."
2,The power of the press to shape the informatio...,"[power, press, shape, inform, landscap, popul,..."
3,Identifying influential nodes in a network is ...,"[identifi, influenti, network, fundament, issu..."
4,"To complement traditional dietary surveys, whi...","[complement, tradit, dietari, survey, costli, ..."
...,...,...
1897,"In this article, we intend to show how useful ...","[articl, intend, exploratori, spatial, data, a..."
1898,In recent geographical information science lit...,"[recent, geograph, inform, scienc, literatur, ..."
1899,The fact that many decisions need a combinatio...,"[fact, mani, decis, need, combin, inform, sour..."
1900,The report from Woolmark Business Intelligence...,"[report, woolmark, busi, intellig, consid, rec..."


In [47]:
abstract_preprocesed = new_df['Abstract_preprocesed']

# Se crea un diccionario para computar el número de apariciones de cada término

dictionary = gensim.corpora.Dictionary(abstract_preprocesed)

count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 2019
1 abil
2 adapt
3 appli
4 assess
5 author(s)
6 chang
7 climat
8 concept
9 data
10 defin


In [49]:
#Se filtran del diccionario las palabras que no se repiten en más de 15 registros y se dejan los 100000 términos más frecuentes
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

#Almacena el término y su frecuencia
bow_corpus = [dictionary.doc2bow(doc) for doc in abstract_preprocesed]


In [50]:
from gensim import corpora, models

#Se analiza qué tan común o poco común es una palabra entre el bow_corpus. 
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

from pprint import pprint

for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.059093572623355545),
 (1, 0.0778919649050463),
 (2, 0.073195683314149),
 (3, 0.04423026501499936),
 (4, 0.05879918995997059),
 (5, 0.028534959984836576),
 (6, 0.10880116465250018),
 (7, 0.09726679319184532),
 (8, 0.0638259051392301),
 (9, 0.06589707868189897),
 (10, 0.05954203214876568),
 (11, 0.030386373441651435),
 (12, 0.22636094495787815),
 (13, 0.062424520789007495),
 (14, 0.09726679319184532),
 (15, 0.44571846847757157),
 (16, 0.4574043030158056),
 (17, 0.0662936478109321),
 (18, 0.0829235559884092),
 (19, 0.0877824950569692),
 (20, 0.0579371132887808),
 (21, 0.1750928394428846),
 (22, 0.17772638900405616),
 (23, 0.044713339443339405),
 (24, 0.07139865966285804),
 (25, 0.05724164760701321),
 (26, 0.0829235559884092),
 (27, 0.05939161230771521),
 (28, 0.09265142866266941),
 (29, 0.04047503315619716),
 (30, 0.08537369278140047),
 (31, 0.07090901234010802),
 (32, 0.060683308325971165),
 (33, 0.07957040339168775),
 (34, 0.3509639172486234),
 (35, 0.2694767744748783),
 (36, 0.0

In [36]:
#Se aplica el algoritmo Latent Dirichlet Allocation para determinar los topicos
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [37]:
#Se muestran los tópicos y las palabras relacionadas a cada uno
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.036*"model" + 0.011*"develop" + 0.009*"algorithm" + 0.009*"paper" + 0.009*"manag" + 0.008*"propos" + 0.007*"process" + 0.007*"result" + 0.006*"present" + 0.006*"studi"
Topic: 1 
Words: 0.014*"network" + 0.012*"propos" + 0.010*"urban" + 0.008*"result" + 0.008*"cluster" + 0.008*"model" + 0.008*"approach" + 0.008*"effect" + 0.007*"method" + 0.007*"studi"
Topic: 2 
Words: 0.018*"model" + 0.010*"product" + 0.009*"qualiti" + 0.008*"inform" + 0.008*"paper" + 0.008*"research" + 0.008*"method" + 0.007*"develop" + 0.007*"differ" + 0.007*"measur"
Topic: 3 
Words: 0.013*"social" + 0.011*"predict" + 0.009*"result" + 0.008*"differ" + 0.008*"approach" + 0.007*"model" + 0.007*"studi" + 0.007*"inform" + 0.007*"method" + 0.007*"propos"
Topic: 4 
Words: 0.012*"analysi" + 0.010*"paper" + 0.009*"perform" + 0.008*"studi" + 0.007*"model" + 0.007*"inform" + 0.007*"market" + 0.007*"approach" + 0.006*"research" + 0.006*"develop"
Topic: 5 
Words: 0.010*"network" + 0.009*"studi" + 0.009*"develo

In [38]:
#Se prueba el modelo con un registro aleatorio del dataset en favor de evaluar el tópico al cual pertenece
for index, score in sorted(lda_model[bow_corpus[315]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.684433102607727	 
Topic: 0.017*"method" + 0.016*"algorithm" + 0.011*"propos" + 0.010*"cluster" + 0.010*"model" + 0.009*"studi" + 0.009*"process" + 0.008*"result" + 0.008*"paper" + 0.008*"dataset"

Score: 0.3034411072731018	 
Topic: 0.012*"result" + 0.011*"method" + 0.011*"algorithm" + 0.011*"network" + 0.011*"time" + 0.010*"propos" + 0.008*"paper" + 0.008*"base" + 0.008*"model" + 0.007*"perform"
