# Topic Modeling

## Introduccion

Otra técnica para el análisis de textos, es el **Topic Modeling**. El objetivo del Top Modeling es encontrar los 'temas' presentes en el corpus.  Se puede utilizar en buscadores, automatización de atención al cliente, ...

Cada documento en el corpus estará formado por al menos un tema.  En este notebook, realizaremos el top modeling a través de **Latent Dirichlet Allocation (LDA)**.
El LDA es un aprendizaje no supervisado a través de una nube de palabras.  A través de él podemos encontrar, temas ocultos y clasificar los documentos en base a los temas obtenidos entre otros.

https://es.wikipedia.org/wiki/Latent_Dirichlet_Allocation  
https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2

Para realizar un top modeling, necesitamos:
* Document Term Matrix (corpus)
* Los términos (topics) que queremos usar.

Una vez aplicada el top modeling, es necesario interpretar los resultados para ver si tienen sentido. En el caso de que no lo tengan, se pueden variar el número de temas, los términos en el document-term matrix, los parámetros del modelo o incluso probar un modelo diferente.

## Topic Modeling - Prueba #1 (Todo el texto)

In [1]:
# Importar los módulos LDA con gensim
#!pip install gensim

In [1]:
# Cargamos el document-term matrix generado previamente
import pandas as pd
import pickle
import numpy as np

datos = pd.read_pickle('dtm_stop.pkl')
datos

Unnamed: 0,aaaaah,aaaaahhhhhhh,aaaaauuugghhhhhh,aaaahhhhh,aaah,aah,abc,abcs,ability,abject,...,zee,zen,zeppelin,zero,zillion,zombie,zombies,zoning,zoo,éclair
ali,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
anthony,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
bill,1,0,0,0,0,0,0,1,0,0,...,0,0,0,1,1,1,1,1,0,0
bo,0,1,1,1,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
dave,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hasan,0,0,0,0,0,0,0,0,0,0,...,2,1,0,1,0,0,0,0,0,0
jim,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
joe,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
john,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
louis,0,0,0,0,0,3,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0


In [2]:
from gensim import matutils, models
import scipy.sparse


In [3]:
# Uno de los requerimientos para el LDA es un term-document matrix transpuesto
tdm = datos.transpose()
tdm.head()

Unnamed: 0,ali,anthony,bill,bo,dave,hasan,jim,joe,john,louis,mike,ricky
aaaaah,0,0,1,0,0,0,0,0,0,0,0,0
aaaaahhhhhhh,0,0,0,1,0,0,0,0,0,0,0,0
aaaaauuugghhhhhh,0,0,0,1,0,0,0,0,0,0,0,0
aaaahhhhh,0,0,0,1,0,0,0,0,0,0,0,0
aaah,0,0,0,0,1,0,0,0,0,0,0,0


In [4]:
# Cambiamos el formato de la matriz a 'gensim'
# Pasos necesarios df --> matriz dispersa --> corpus gensim
matriz_dispersa = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(matriz_dispersa)

In [5]:
# Gensim necesita de un diccionario con todos los términos y su ubicación en el corpus.
# Recuperamos la matriz generada en el script 2
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [6]:
id2word

{3681: 'ladies',
 2751: 'gentlemen',
 7249: 'welcome',
 6272: 'stage',
 157: 'ali',
 7356: 'wong',
 3096: 'hi',
 3075: 'hello',
 6661: 'thank',
 1355: 'coming',
 5910: 'shit',
 1042: 'cause',
 4796: 'pee',
 4202: 'minutes',
 2271: 'everybody',
 6974: 'um',
 2295: 'exciting',
 1707: 'day',
 7422: 'year',
 6924: 'turned',
 7431: 'yes',
 283: 'appreciate',
 6970: 'uh',
 6617: 'tell',
 2765: 'getting',
 4565: 'older',
 2781: 'girl',
 412: 'automatic',
 6695: 'thought',
 2665: 'fuck',
 6375: 'straight',
 3492: 'jealous',
 2597: 'foremost',
 4147: 'metabolism',
 2784: 'girls',
 2102: 'eat',
 6021: 'sixpack',
 6665: 'thatthat',
 551: 'beautiful',
 3365: 'inner',
 6683: 'thigh',
 1245: 'clearance',
 2429: 'feet',
 6677: 'theres',
 3228: 'huge',
 2708: 'gap',
 3816: 'light',
 5032: 'potential',
 5261: 'radiating',
 6712: 'throughand',
 6055: 'sleep',
 3378: 'insomnia',
 195: 'ambien',
 1992: 'download',
 4103: 'meditation',
 4520: 'oasis',
 4959: 'podcast',
 949: 'calm',
 1120: 'chatter',
 5394

Ya tenemos el corpus y el diccionario ubicación:palabra, necesitamos especificar otros 2 parámetros:
- El total de temas y
- El total de iteraciones en el entrenamiento. 

Probamos con 2 temas y veremos si el resultado tiene sentido.

In [7]:
np.random.seed(222)
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

[(0,
  '0.006*"say" + 0.005*"fuck" + 0.005*"theyre" + 0.004*"really" + 0.004*"life" + 0.004*"good" + 0.004*"cause" + 0.004*"going" + 0.004*"fucking" + 0.004*"love"'),
 (1,
  '0.010*"fucking" + 0.008*"shit" + 0.006*"fuck" + 0.006*"went" + 0.005*"hes" + 0.005*"didnt" + 0.005*"going" + 0.005*"theyre" + 0.005*"day" + 0.005*"say"')]

In [8]:
# LDA for num_topics = 3
np.random.seed(222)
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.007*"cause" + 0.006*"really" + 0.005*"says" + 0.005*"thing" + 0.005*"mean" + 0.005*"life" + 0.005*"say" + 0.004*"way" + 0.004*"good" + 0.004*"didnt"'),
 (1,
  '0.007*"shit" + 0.005*"ok" + 0.005*"lot" + 0.004*"gotta" + 0.004*"wanna" + 0.004*"husband" + 0.004*"cause" + 0.003*"day" + 0.003*"women" + 0.003*"pregnant"'),
 (2,
  '0.011*"fucking" + 0.007*"fuck" + 0.007*"shit" + 0.006*"say" + 0.006*"going" + 0.006*"want" + 0.005*"theyre" + 0.005*"didnt" + 0.005*"hes" + 0.005*"went"')]

In [9]:
# LDA for num_topics = 4
np.random.seed(222)
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.006*"cause" + 0.006*"life" + 0.006*"thing" + 0.005*"hes" + 0.005*"really" + 0.005*"little" + 0.005*"old" + 0.004*"say" + 0.004*"way" + 0.004*"good"'),
 (1,
  '0.006*"love" + 0.006*"shit" + 0.005*"ok" + 0.005*"want" + 0.005*"stuff" + 0.005*"bo" + 0.004*"repeat" + 0.004*"fucking" + 0.004*"hes" + 0.004*"lot"'),
 (2,
  '0.010*"fucking" + 0.007*"fuck" + 0.006*"shit" + 0.006*"say" + 0.006*"going" + 0.005*"theyre" + 0.005*"didnt" + 0.005*"want" + 0.005*"went" + 0.005*"hes"'),
 (3,
  '0.001*"fucking" + 0.000*"fuck" + 0.000*"say" + 0.000*"shit" + 0.000*"didnt" + 0.000*"hes" + 0.000*"theyre" + 0.000*"good" + 0.000*"going" + 0.000*"really"')]

Lo que obtenemos es la probabilidad de que una palabra, aparezca en un tema.
Pero los resultados son pobres.  Hemos probado a mejorarlo, modificando los parámetros, probemos ahora modificando los términos usados.

## Topic Modeling - Prueba #2 (Sólo sustantivos)

Un truco habitual suele ser usar sólo sustantivos, sólo adjetivos, ...
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. -> para comprobar la etiqueta para filtrar por sustantivos 

In [10]:
# Creamos una función para extraer los sustantivos de un texto
from nltk import word_tokenize, pos_tag

def sustantivos(texto):
    '''Dada una cadena de texto, se tokeniza y devuelve sólo los sustantivos.'''
    # Aquí es donde nos quedamos sólo con los sustantivos.
    es_sustantivo = lambda pos: pos[:2] == 'NN'
    
    tokenizado = word_tokenize(texto)
    todo_sustantivos = [palabra for (palabra, pos) in pos_tag(tokenizado) if es_sustantivo(pos)] 
    return ' '.join(todo_sustantivos)

In [11]:
# Leemos los datos limpios generados previamente
datos_limpios = pd.read_pickle('datos_limpios.pkl')
datos_limpios

Unnamed: 0,transcripcion
ali,ladies and gentlemen please welcome to the sta...
anthony,thank you thank you thank you san francisco th...
bill,all right thank you thank you very much thank...
bo,bo what old macdonald had a farm e i e i o and...
dave,this is dave he tells dirty jokes for a living...
hasan,whats up davis whats up im home i had to bri...
jim,ladies and gentlemen please welcome to the ...
joe,ladies and gentlemen welcome joe rogan wha...
john,all right petunia wish me luck out there you w...
louis,introfade the music out lets roll hold there l...


In [12]:
# Descargamos la librería para poder normalizar las palabras, según su contexto y análisis morfológico.
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\german\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [15]:
# Extraemos los sustantivos
datos_sustantivos = pd.DataFrame(datos_limpios ['transcripcion'].apply(sustantivos))
datos_sustantivos

Unnamed: 0,transcripcion
ali,ladies gentlemen stage ali hi thank hello na s...
anthony,thank thank people i em i francisco city world...
bill,thank thank pleasure georgia area oasis i june...
bo,macdonald farm e i o farm pig e i i snort macd...
dave,jokes living stare work profound train thought...
hasan,whats davis whats home i netflix la york i son...
jim,ladies gentlemen stage mr jim jefferies thank ...
joe,ladies gentlemen joe fuck thanks phone fuckfac...
john,petunia thats hello hello chicago thank crowd ...
louis,music lets lights lights thank i i place place...


In [16]:
# Creamos un nuevo corpus sólo con los sustantivos
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Quitamos las stopwords, puesto que vamos a generar un nuevo corpus
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said', 
                  'aaaaahhhhhhh', 'aaaaauuugghhhhhh', 'aaaahhhhh', 'aah', 'aaaaah']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Corpus sólo con sustantivos
cvs = CountVectorizer(stop_words=stop_words)
datos_cvs = cvs.fit_transform(datos_sustantivos['transcripcion'])
datos_dtms = pd.DataFrame(datos_cvs.toarray(), columns=cvs.get_feature_names_out())
datos_dtms.index = datos_sustantivos.index
datos_dtms

Unnamed: 0,abc,abcs,ability,abortion,abortions,abuse,acc,accent,accents,acceptance,...,yummy,ze,zealand,zee,zeppelin,zillion,zombie,zombies,zoo,éclair
ali,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
anthony,0,0,0,2,0,0,0,1,0,0,...,0,0,10,0,0,0,0,0,0,0
bill,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,1,1,0,0
bo,0,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
dave,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hasan,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
jim,0,0,0,0,0,0,0,4,0,0,...,0,0,0,0,0,0,0,0,0,0
joe,0,0,0,0,0,1,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
john,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
louis,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# Generar el corpus gensim
corpuss = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(datos_dtms.transpose()))

# Generar el diccionario de vocabulario
id2words = dict((v, k) for k, v in cvs.vocabulary_.items())

In [18]:
# Empezamos por 2 temas
np.random.seed(222)
ldas = models.LdaModel(corpus=corpuss, num_topics=2, id2word=id2words, passes=10)
ldas.print_topics()

[(0,
  '0.007*"way" + 0.007*"man" + 0.007*"shit" + 0.007*"dad" + 0.007*"day" + 0.007*"fuck" + 0.006*"cause" + 0.006*"life" + 0.006*"house" + 0.006*"thing"'),
 (1,
  '0.011*"thing" + 0.010*"day" + 0.009*"hes" + 0.008*"life" + 0.008*"shit" + 0.007*"man" + 0.006*"cause" + 0.006*"guy" + 0.006*"gon" + 0.006*"women"')]

In [19]:
# topics = 3
np.random.seed(222)
ldas = models.LdaModel(corpus=corpuss, num_topics=3, id2word=id2words, passes=10)
ldas.print_topics()

[(0,
  '0.010*"dad" + 0.007*"day" + 0.007*"life" + 0.006*"joke" + 0.006*"school" + 0.006*"mom" + 0.006*"stuff" + 0.005*"man" + 0.005*"way" + 0.005*"shes"'),
 (1,
  '0.013*"shit" + 0.011*"life" + 0.010*"thing" + 0.009*"hes" + 0.008*"gon" + 0.008*"cause" + 0.008*"guy" + 0.007*"day" + 0.007*"dude" + 0.007*"lot"'),
 (2,
  '0.009*"day" + 0.009*"thing" + 0.008*"man" + 0.008*"cause" + 0.007*"way" + 0.007*"fuck" + 0.006*"things" + 0.006*"hes" + 0.006*"shit" + 0.006*"guy"')]

In [20]:
# topics = 4
np.random.seed(222)
ldas = models.LdaModel(corpus=corpuss, num_topics=4, id2word=id2words, passes=10)
ldas.print_topics()

[(0,
  '0.008*"dad" + 0.008*"life" + 0.008*"fuck" + 0.008*"house" + 0.007*"cause" + 0.007*"man" + 0.007*"way" + 0.007*"shes" + 0.006*"kind" + 0.006*"girl"'),
 (1,
  '0.008*"stuff" + 0.008*"shit" + 0.008*"bo" + 0.007*"lot" + 0.007*"repeat" + 0.007*"man" + 0.007*"day" + 0.006*"eye" + 0.006*"god" + 0.006*"contact"'),
 (2,
  '0.012*"thing" + 0.011*"day" + 0.009*"shit" + 0.008*"hes" + 0.008*"life" + 0.008*"man" + 0.007*"guy" + 0.007*"way" + 0.007*"cause" + 0.006*"fuck"'),
 (3,
  '0.001*"life" + 0.001*"day" + 0.001*"lot" + 0.001*"dad" + 0.001*"man" + 0.001*"hes" + 0.001*"shes" + 0.001*"shit" + 0.000*"cause" + 0.000*"way"')]

In [21]:
# topics = 5
np.random.seed(222)
ldas = models.LdaModel(corpus=corpuss, num_topics=5, id2word=id2words, passes=10)
ldas.print_topics()

[(0,
  '0.014*"fuck" + 0.011*"man" + 0.010*"house" + 0.010*"shit" + 0.009*"kids" + 0.008*"life" + 0.008*"theyre" + 0.007*"gon" + 0.007*"cause" + 0.007*"things"'),
 (1,
  '0.008*"joke" + 0.008*"stuff" + 0.008*"hes" + 0.008*"thing" + 0.007*"day" + 0.007*"bo" + 0.007*"man" + 0.007*"years" + 0.006*"repeat" + 0.006*"id"'),
 (2,
  '0.012*"day" + 0.011*"thing" + 0.009*"cause" + 0.008*"guy" + 0.008*"way" + 0.006*"shit" + 0.006*"gon" + 0.006*"hes" + 0.006*"man" + 0.006*"life"'),
 (3,
  '0.016*"dad" + 0.009*"life" + 0.009*"shes" + 0.008*"mom" + 0.007*"school" + 0.007*"parents" + 0.006*"girl" + 0.006*"home" + 0.006*"hes" + 0.005*"house"'),
 (4,
  '0.015*"shit" + 0.011*"lot" + 0.010*"life" + 0.009*"man" + 0.009*"hes" + 0.009*"thing" + 0.009*"women" + 0.008*"cause" + 0.007*"fuck" + 0.006*"guy"')]

## Topic Modeling - Prueba #3 (Sustantivos y Adjetivos)

In [22]:
# Función para extraer los sustantivos y adjetivos
def sust_adj(texto):
    '''Dado un texto, lo tokeniza y devuelve sólo los sustantivos y adjetivos.'''
    es_sust_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenizado = word_tokenize(texto)
    todo_sust_adj = [palabra for (palabra, pos) in pos_tag(tokenizado) if es_sust_adj(pos)] 
    return ' '.join(todo_sust_adj)

In [23]:
# Aplicamos la función a los datos limpios
datos_sust_adj = pd.DataFrame(datos_limpios['transcripcion'].apply(sust_adj))
datos_sust_adj

Unnamed: 0,transcripcion
ali,ladies gentlemen welcome stage ali wong hi wel...
anthony,thank san francisco thank good people surprise...
bill,right thank thank pleasure greater atlanta geo...
bo,old macdonald farm e i i o farm pig e i i snor...
dave,dirty jokes living stare most hard work profou...
hasan,whats davis whats im home i netflix special la...
jim,ladies gentlemen welcome stage mr jim jefferie...
joe,ladies gentlemen joe fuck san francisco thanks...
john,right petunia august thats good right hello he...
louis,music lets lights lights thank much i i i nice...


In [26]:
# Creación del nuevo corpus, ahora sólo con sustantivos y adjetivos.  Además eliminamos las stop words con max_df superior a 0.8
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
datos_cvna = cvna.fit_transform(datos_sust_adj['transcripcion'])
datos_dtmna = pd.DataFrame(datos_cvna.toarray(), columns=cvna.get_feature_names_out())
datos_dtmna.index = datos_sust_adj.index
datos_dtmna

Unnamed: 0,abc,abcs,ability,abject,able,ablebodied,abortion,abortions,absolute,abuse,...,ze,zealand,zee,zeppelin,zero,zillion,zombie,zombies,zoo,éclair
ali,1,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
anthony,0,0,0,0,0,0,2,0,0,0,...,0,10,0,0,0,0,0,0,0,0
bill,0,1,0,0,1,0,0,0,1,0,...,1,0,0,0,0,1,1,1,0,0
bo,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
dave,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
hasan,0,0,0,0,1,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,0
jim,0,0,0,0,1,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
joe,0,0,0,0,2,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
john,0,0,0,0,3,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
louis,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
# Creación del corpus gensim
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(datos_dtmna.transpose()))

# Diccionario de vocabulario
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [28]:
# topics = 2
np.random.seed(2)
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.005*"parents" + 0.005*"mom" + 0.003*"bo" + 0.003*"hasan" + 0.003*"jenny" + 0.003*"clinton" + 0.003*"york" + 0.003*"comedy" + 0.003*"friend" + 0.003*"repeat"'),
 (1,
  '0.005*"joke" + 0.003*"ass" + 0.003*"gun" + 0.003*"son" + 0.003*"jokes" + 0.002*"guns" + 0.002*"mad" + 0.002*"mom" + 0.002*"dead" + 0.002*"door"')]

In [29]:
# topics = 3
np.random.seed(222)
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.005*"joke" + 0.005*"mom" + 0.004*"hasan" + 0.004*"mad" + 0.003*"parents" + 0.003*"ahah" + 0.003*"son" + 0.003*"gun" + 0.003*"anthony" + 0.003*"door"'),
 (1,
  '0.004*"joke" + 0.004*"parents" + 0.003*"bo" + 0.003*"jenny" + 0.003*"clinton" + 0.003*"repeat" + 0.003*"eye" + 0.003*"dog" + 0.003*"friend" + 0.003*"mom"'),
 (2,
  '0.006*"ass" + 0.004*"guns" + 0.003*"husband" + 0.003*"ok" + 0.003*"pregnant" + 0.003*"dick" + 0.003*"business" + 0.003*"girlfriend" + 0.003*"class" + 0.003*"mom"')]

In [30]:
# topics = 4
np.random.seed(222)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.007*"joke" + 0.005*"anthony" + 0.004*"gun" + 0.004*"religion" + 0.004*"stupid" + 0.004*"mom" + 0.004*"mad" + 0.003*"jokes" + 0.003*"dick" + 0.003*"grandma"'),
 (1,
  '0.005*"joke" + 0.004*"bo" + 0.004*"clinton" + 0.004*"jenny" + 0.004*"parents" + 0.004*"repeat" + 0.003*"friend" + 0.003*"mom" + 0.003*"eye" + 0.003*"comedy"'),
 (2,
  '0.005*"ass" + 0.005*"ahah" + 0.005*"son" + 0.005*"guns" + 0.004*"husband" + 0.004*"gay" + 0.003*"asian" + 0.003*"nigga" + 0.003*"business" + 0.003*"ok"'),
 (3,
  '0.007*"hasan" + 0.006*"parents" + 0.005*"mom" + 0.005*"tit" + 0.004*"brown" + 0.004*"york" + 0.004*"date" + 0.003*"ok" + 0.003*"bike" + 0.003*"ha"')]

## Identificando los temas de cada documento

De los 10 'topic models' que hemos extraido, el caso que parece tener más sentido (!) es el 4º tema de la prueba con sustantivos y adjetivos.  Afinamos ahora el proceso a través de más iteraciones.

In [32]:
# Modelo LDA final (de momento)
np.random.seed(222)
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=180)
ldana.print_topics()

[(0,
  '0.007*"joke" + 0.005*"anthony" + 0.004*"gun" + 0.004*"stupid" + 0.004*"religion" + 0.004*"mad" + 0.004*"mom" + 0.003*"dick" + 0.003*"jokes" + 0.003*"anybody"'),
 (1,
  '0.005*"joke" + 0.004*"bo" + 0.004*"jenny" + 0.004*"clinton" + 0.004*"parents" + 0.004*"repeat" + 0.003*"friend" + 0.003*"eye" + 0.003*"mom" + 0.003*"contact"'),
 (2,
  '0.005*"ass" + 0.005*"ahah" + 0.005*"son" + 0.005*"guns" + 0.004*"husband" + 0.004*"gay" + 0.003*"business" + 0.003*"nigga" + 0.003*"asian" + 0.003*"ok"'),
 (3,
  '0.007*"hasan" + 0.006*"parents" + 0.005*"mom" + 0.005*"tit" + 0.004*"brown" + 0.004*"york" + 0.004*"date" + 0.003*"bike" + 0.003*"ok" + 0.003*"birthday"')]

Ahora tocaría etiquetar estos temas.
Podría ser:
* 0: Bromas
* 1: Insultos y religión
* 2: Familia
* 3: ¿?

In [29]:
# Comprobamos los temas que contiene cada transcripción
corpus_transformado = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformado], datos_dtmna.index))

[(3, 'ali'),
 (2, 'anthony'),
 (0, 'bill'),
 (3, 'bo'),
 (0, 'dave'),
 (2, 'hasan'),
 (2, 'jim'),
 (1, 'joe'),
 (2, 'john'),
 (3, 'louis'),
 (2, 'mike'),
 (0, 'ricky')]

## Ejercicios

1. Prueba a modificar los parámetros para obtener unos mejores resultados.
2. Crea un nuevo topic model que incluya términos de una parte diferente de la oración (https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) y comprueba si se obtienen mejores temas. (Comprobar si analizando otros elementos obtenemos tópicos más representativos):