# Detección de tópicos por tweet en threads

## Importacion de librerias y definicion de funciones

Se importan las librerias necesarias y se definen las funciones para tokenizar, lemmatizar y para preparar el texto para el LDA.

En la tokenizacion se eliminan los hashtags y los usuarios citados

In [1]:
import os

In [69]:
import spacy
spacy.load('en')

from spacy.lang.en import English
parser = English()

def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            continue
        elif token.orth_.startswith('#'):
            continue
        elif token.orth_.startswith('@'):
            continue
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

import nltk

nltk.download('wordnet')
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma
    
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > 3]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\carlo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\carlo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Importacion de archivos

Se cargan los archivos csv y se agrupan los tweets por threads, para luego crear un diccionario de tweets por cada thread (thread 1 : tweet1, tweet2...)

In [2]:
import random
import pandas as pd


In [3]:

csv1 = pd.read_csv('five_ten.csv', encoding='iso-8859-1')
csv1_grouped_by_thread = csv1.groupby(['thread_number'])
threads1 = {}
documentos1 = []

csv2 = pd.read_csv('ten_fifteen.csv', encoding='iso-8859-1')
csv2_grouped_by_thread = csv2.groupby(['thread_number'])
threads2 = {}
documentos2 = []

csv3 = pd.read_csv('fifteen_twenty.csv', encoding='iso-8859-1')
csv3_grouped_by_thread = csv3.groupby(['thread_number'])
threads3 = {}
documentos3 = []

csv4 = pd.read_csv('twenty_twentyfive.csv', encoding='iso-8859-1')
csv4_grouped_by_thread = csv4.groupby(['thread_number'])
threads4 = {}
documentos4 = []

csv5 = pd.read_csv('twentyfive_thirty.csv', encoding='iso-8859-1')
csv5_grouped_by_thread = csv5.groupby(['thread_number'])
threads5 = {}
documentos5 = []

## Creación de diccionario de tweets por threads

Se agruparán los tweets de cada hilo en un diccionario para cada archivo.

In [4]:
for thread, data in dict(list(csv1_grouped_by_thread)).items():
    threads1[thread] = list(data['text'])
    


for thread, data in dict(list(csv2_grouped_by_thread)).items():
    threads2[thread] = list(data['text'])
    


for thread, data in dict(list(csv3_grouped_by_thread)).items():
    threads3[thread] = list(data['text'])
    


for thread, data in dict(list(csv4_grouped_by_thread)).items():
    threads4[thread] = list(data['text'])
    


for thread, data in dict(list(csv5_grouped_by_thread)).items():
    threads5[thread] = list(data['text'])
    



In [5]:
threads1

{'Thread 1': ['Extraordinary evidence at Treasury committee from Jon Thompson, CEO of HMRC on customs and Brexit today https://t.co/DJhIQhmVwJ',
  "The Brexiter favourite Max Fac - would cost business between Â£17 and Â£20bn a year\r\r\n\r\r\n- that's almost 1% of GDP\r\r\n\r\r\n- jusâ?¦ https://t.co/0MwIcwre4t",
  'How does he arrive at the figure\r\r\n\r\r\n200m export consignments at an average cost of Â£32.50 each = Â£6.5bn (times two beâ?¦ https://t.co/KxnkU2QiVO',
  "Theresa May's New Customs Partnership is much cheaper for business (almost zero cost)  because it seeks to replicatâ?¦ https://t.co/0LcsJHah0H",
  'Mr Thompson said he did not expect the EU to reciprocate over the customs partnership. \r\r\n\r\r\nWhat that means is UK collâ?¦ https://t.co/9c3uhhnZGX',
  'Both would not be ready by 2021. Max Fac needs 3 years. Customs Partnership requires 5, Mr Thompson said.\r\r\n\r\r\nThe bordâ?¦ https://t.co/luLzgUsiR4',
  '"We think we can manage the risk - we think we can" he sai

## LDA para cada thread de cada CSV

Se definen la cantidad de topicos a detectar, en conjunto con la cantidad de palabras que se mostraran al imprimir los topicos detectados.

La detección de tópicos se realizará a cada thread de todos los archivos CSV, por lo que se considerará cada tweet del thread como un documento.

In [74]:
import gensim
from gensim import corpora
NUM_TOPICS = 5
NUM_WORDS = 5
import pickle

### CSV five_ten

In [75]:
THIS_FOLDER = os.getcwd()
threads_leer = threads1
carpeta_guardar = "tpcsv1"

#Poblar text_data

for hilos in threads_leer:
    camino = os.path.join(THIS_FOLDER, carpeta_guardar)
    text_data = []
    documentos = []
    dictionary = []
    corpus = []
    print(hilos)
    documentos = threads_leer[hilos]

    #print(documentos)

    for line in documentos:
        #print(line)
        tokens = prepare_text_for_lda(line)
        if random.random() > .009:
            #print(tokens)
            text_data.append(tokens)

    #print(text_data) 
    NDIC = camino+"\\"+hilos+"_t_dictionary1.gensim"
    NMOD = camino+"\\"+hilos+"_t_model1.gensim"
    NCOR = camino+"\\"+hilos+"_t_corpus1.pkl"
    dictionary = corpora.Dictionary(text_data)
    corpus = [dictionary.doc2bow(text) for text in text_data]
    pickle.dump(corpus, open(NCOR, 'wb'))
    dictionary.save(NDIC)

    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
    ldamodel.save(NMOD)
    topics = ldamodel.print_topics(num_words=NUM_WORDS)
    for topic in topics:
        print(topic)

Thread 1
(0, '0.068*"thompson" + 0.068*"customs" + 0.046*"partnership" + 0.046*"say" + 0.025*"ready"')
(1, '0.056*"cost" + 0.056*"partnership" + 0.056*"customs" + 0.056*"almost" + 0.056*"business"')
(2, '0.059*"cost" + 0.059*"would" + 0.059*"business" + 0.059*"almost" + 0.059*"jusâ"')
(3, '0.113*"think" + 0.062*"say" + 0.062*"sure" + 0.062*"backdoorâ" + 0.062*"sound"')
(4, '0.019*"cost" + 0.019*"say" + 0.019*"partnership" + 0.019*"customs" + 0.019*"would"')
Thread 10
(0, '0.052*"money" + 0.052*"immigration" + 0.052*"things" + 0.052*"expensive" + 0.052*"spend"')
(1, '0.048*"white" + 0.048*"fairâ" + 0.048*"nativist" + 0.048*"fearmongering" + 0.048*"official"')
(2, '0.035*"expand" + 0.035*"worker" + 0.035*"guest" + 0.035*"detection" + 0.035*"fraud"')
(3, '0.040*"real" + 0.040*"currentâ" + 0.040*"daca" + 0.040*"gutting" + 0.040*"much"')
(4, '0.059*"immigration" + 0.059*"create" + 0.032*"model" + 0.032*"give" + 0.032*"world"')
Thread 100
(0, '0.066*"trump" + 0.036*"ways" + 0.036*"billionair

(4, '0.140*"pussy" + 0.140*"note" + 0.140*"things" + 0.140*"endthread" + 0.140*"iâ??ll"')
Thread 39
(0, '0.013*"oscar" + 0.013*"would" + 0.013*"like" + 0.013*"remezcla" + 0.013*"struggle"')
(1, '0.036*"struggle" + 0.036*"exclusive" + 0.036*"journos" + 0.036*"latino" + 0.036*"routinely"')
(2, '0.044*"oscar" + 0.044*"year" + 0.044*"presenceâ" + 0.044*"credentialed" + 0.044*"time"')
(3, '0.050*"latinx" + 0.027*"steppeâ" + 0.027*"establish" + 0.027*"table" + 0.027*"seat"')
(4, '0.053*"diversity" + 0.029*"everyone" + 0.029*"eloquent" + 0.029*"lack" + 0.029*"critic"')
Thread 4
(0, '0.056*"better" + 0.056*"helpless" + 0.056*"working" + 0.056*"problem" + 0.056*"aboutâ"')
(1, '0.133*"ignore" + 0.072*"voter" + 0.072*"trump" + 0.072*"medium" + 0.072*"convince"')
(2, '0.036*"stop" + 0.036*"people" + 0.036*"ballot" + 0.036*"register" + 0.036*"youâ"')
(3, '0.023*"security" + 0.023*"going" + 0.023*"clearance" + 0.023*"laws" + 0.023*"gorsuch"')
(4, '0.023*"voter" + 0.023*"find" + 0.023*"sure" + 0.023*

(3, '0.055*"district" + 0.055*"benavides" + 0.055*"mark" + 0.055*"texas" + 0.055*"recently"')
(4, '0.017*"january" + 0.017*"arrest" + 0.017*"democrat" + 0.017*"benavides" + 0.017*"mark"')
Thread 7
(0, '0.066*"theâ" + 0.066*"expert" + 0.066*"whatever" + 0.066*"belief" + 0.066*"politician"')
(1, '0.054*"kenney" + 0.054*"thread" + 0.054*"vacuum" + 0.054*"misguide" + 0.054*"horribly"')
(2, '0.098*"/thread" + 0.098*"ableg" + 0.016*"know" + 0.016*"theâ" + 0.016*"kenney"')
(3, '0.091*"policy" + 0.050*"know" + 0.050*"shunning" + 0.050*"best" + 0.050*"evidence"')
(4, '0.041*"dedicate" + 0.041*"people" + 0.041*"life" + 0.041*"better" + 0.041*"career"')
Thread 70
(0, '0.067*"qanon" + 0.067*"days" + 0.067*"traffic" + 0.067*"schiffabouttohitthefan" + 0.067*"balazs"')
(1, '0.299*"days" + 0.050*"balazs" + 0.050*"marina" + 0.050*"spiritcooking" + 0.050*"andre"')
(2, '0.086*"standard" + 0.086*"also" + 0.086*"own" + 0.086*"satanicâ" + 0.086*"hotel"')
(3, '0.314*"qanon" + 0.314*"schiffabouttohitthefan" +

### CSV ten_fifteen

In [76]:
THIS_FOLDER = os.getcwd()
threads_leer = threads2
carpeta_guardar = "tpcsv2"

#Poblar text_data

for hilos in threads_leer:
    camino = os.path.join(THIS_FOLDER, carpeta_guardar)
    text_data = []
    documentos = []
    dictionary = []
    corpus = []
    print(hilos)
    documentos = threads_leer[hilos]

    #print(documentos)

    for line in documentos:
        #print(line)
        tokens = prepare_text_for_lda(line)
        if random.random() > .009:
            #print(tokens)
            text_data.append(tokens)

    #print(text_data) 
    NDIC = camino+"\\"+hilos+"_t_dictionary1.gensim"
    NMOD = camino+"\\"+hilos+"_t_model1.gensim"
    NCOR = camino+"\\"+hilos+"_t_corpus1.pkl"
    dictionary = corpora.Dictionary(text_data)
    corpus = [dictionary.doc2bow(text) for text in text_data]
    pickle.dump(corpus, open(NCOR, 'wb'))
    dictionary.save(NDIC)

    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
    ldamodel.save(NMOD)
    topics = ldamodel.print_topics(num_words=NUM_WORDS)
    for topic in topics:
        print(topic)

Thread 1
(0, '0.077*"labour" + 0.029*"want" + 0.029*"conditionsâ" + 0.029*"cheap" + 0.029*"agency"')
(1, '0.068*"outside" + 0.068*"corbyn" + 0.046*"single" + 0.025*"future" + 0.025*"trade"')
(2, '0.062*"corbyn" + 0.033*"customs" + 0.033*"union" + 0.033*"single" + 0.033*"ensure"')
(3, '0.011*"labour" + 0.011*"corbyn" + 0.011*"means" + 0.011*"rule" + 0.011*"market"')
(4, '0.065*"corbyn" + 0.045*"rule" + 0.045*"â??we" + 0.045*"accept" + 0.024*"economic"')
Thread 10
(0, '0.071*"trump" + 0.054*"mueller" + 0.037*"probe" + 0.037*"witness" + 0.037*"russianð??·ð??ºbusiness"')
(1, '0.039*"teamð??ºð??¸mueller" + 0.039*"source" + 0.039*"question" + 0.039*"focus" + 0.039*"say"')
(2, '0.052*"trump" + 0.029*"campaign" + 0.029*"prospective" + 0.029*"even" + 0.029*"though"')
(3, '0.049*"miss" + 0.049*"universe" + 0.049*"moscow" + 0.049*"trump" + 0.034*"mueller"')
(4, '0.043*"trump" + 0.024*"close" + 0.024*"pal" + 0.024*"testify" + 0.024*"congress"')
Thread 11
(0, '0.012*"drop" + 0.012*"establishment" +

(1, '0.035*"matrilineal" + 0.035*"typo" + 0.035*"correction" + 0.035*"moore" + 0.035*"2016"')
(2, '0.045*"work" + 0.045*"theory" + 0.045*"value" + 0.045*"production" + 0.025*"relations"')
(3, '0.081*"clan" + 0.081*"land" + 0.050*"mengen" + 0.034*"contradiction" + 0.034*"notion"')
(4, '0.042*"social" + 0.042*"focus" + 0.042*"life" + 0.042*"production" + 0.042*"value"')
Thread 26
(0, '0.041*"merkel" + 0.041*"angela" + 0.041*"enough" + 0.022*"qanon" + 0.022*"thestorm"')
(1, '0.021*"pope" + 0.021*"little" + 0.021*"paul" + 0.021*"moustache" + 0.021*"resemblance"')
(2, '0.027*"berlin" + 0.027*"templin" + 0.027*"raise" + 0.027*"north" + 0.027*"east"')
(3, '0.042*"kasner" + 0.023*"daughter" + 0.023*"dorothea" + 0.023*"horst" + 0.023*"given"')
(4, '0.050*"hitlerâ??s" + 0.027*"hitler" + 0.027*"name" + 0.027*"father" + 0.027*"vatican"')
Thread 27
(0, '0.076*"tonight" + 0.041*"speech" + 0.041*"unroll" + 0.041*"please" + 0.041*"response"')
(1, '0.063*"qanon" + 0.034*"watch" + 0.034*"pullback" + 0.0

(0, '0.057*"party" + 0.039*"third" + 0.039*"corrupt" + 0.039*"congress" + 0.039*"relationship"')
(1, '0.033*"voter" + 0.033*"sovereignty" + 0.033*"aadhaar" + 0.018*"go" + 0.018*"theft"')
(2, '0.010*"court" + 0.010*"ready" + 0.010*"direct" + 0.010*"2019" + 0.010*"polls"')
(3, '0.071*"aadhaar" + 0.037*"election" + 0.037*"analytica" + 0.037*"cambridge" + 0.020*"coloniser"')
(4, '0.029*"selectively" + 0.029*"genuine" + 0.029*"influence" + 0.029*"india" + 0.029*"causing"')
Thread 58
(0, '0.062*"need" + 0.034*"phase" + 0.034*"ridicule" + 0.034*"calm" + 0.034*"well"')
(1, '0.036*"post" + 0.020*"facebookgate" + 0.020*"child" + 0.020*"fall" + 0.020*"better"')
(2, '0.049*"happening" + 0.049*"quiet" + 0.049*"stealth" + 0.049*"winning" + 0.049*"chess"')
(3, '0.054*"ridicule" + 0.054*"demoralize" + 0.054*"fight" + 0.054*"ignore" + 0.054*"first"')
(4, '0.041*"wreckage" + 0.041*"world" + 0.041*"perfect" + 0.041*"forget" + 0.041*"clear"')
Thread 59
(0, '0.067*"facebook" + 0.025*"using" + 0.025*"firefo

(2, '0.058*"operation" + 0.040*"gladio" + 0.022*"arm" + 0.022*"covert" + 0.022*"military"')
(3, '0.026*"shooting" + 0.026*"thatâ" + 0.026*"fierce" + 0.026*"since" + 0.026*"unrelenting"')
(4, '0.036*"deep" + 0.036*"gladio" + 0.035*"viaâ" + 0.035*"barrage" + 0.035*"fast"')
Thread 89
(0, '0.056*"tell" + 0.056*"tonight" + 0.056*"qanon" + 0.030*"meaning" + 0.030*"everything"')
(1, '0.071*"qanon" + 0.043*"tonight" + 0.030*"found" + 0.030*"intercept" + 0.030*"usss"')
(2, '0.074*"qanon" + 0.041*"anon" + 0.041*"reference" + 0.041*"connect" + 0.041*"film"')
(3, '0.087*"iron" + 0.087*"6/14" + 0.060*"protect" + 0.060*"qanon" + 0.033*"anon"')
(4, '0.014*"please" + 0.014*"unroll" + 0.014*"tell" + 0.014*"qanon" + 0.014*"protect"')
Thread 9
(0, '0.046*"mueller" + 0.025*"barrellâ?¼ï¸" + 0.025*"ð???breakingð" + 0.025*"dealings" + 0.025*"cohenâ??s"')
(1, '0.047*"trump" + 0.047*"cohen" + 0.025*"tower" + 0.025*"russian" + 0.025*"bring"')
(2, '0.045*"cohen" + 0.024*"say" + 0.024*"sater" + 0.024*"left" + 0.0

### CSV fifteen_twenty

In [77]:
THIS_FOLDER = os.getcwd()
threads_leer = threads3
carpeta_guardar = "tpcsv3"

#Poblar text_data

for hilos in threads_leer:
    camino = os.path.join(THIS_FOLDER, carpeta_guardar)
    text_data = []
    documentos = []
    dictionary = []
    corpus = []
    print(hilos)
    documentos = threads_leer[hilos]

    #print(documentos)

    for line in documentos:
        #print(line)
        tokens = prepare_text_for_lda(line)
        if random.random() > .009:
            #print(tokens)
            text_data.append(tokens)

    #print(text_data) 
    NDIC = camino+"\\"+hilos+"_t_dictionary1.gensim"
    NMOD = camino+"\\"+hilos+"_t_model1.gensim"
    NCOR = camino+"\\"+hilos+"_t_corpus1.pkl"
    dictionary = corpora.Dictionary(text_data)
    corpus = [dictionary.doc2bow(text) for text in text_data]
    pickle.dump(corpus, open(NCOR, 'wb'))
    dictionary.save(NDIC)

    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
    ldamodel.save(NMOD)
    topics = ldamodel.print_topics(num_words=NUM_WORDS)
    for topic in topics:
        print(topic)

Thread 1
(0, '0.051*"meeting" + 0.051*"...." + 0.051*"news" + 0.028*"whistleblower" + 0.028*"true"')
(1, '0.072*"excite" + 0.027*"june" + 0.027*"publâ" + 0.027*"lie" + 0.027*"printing"')
(2, '0.060*"document" + 0.033*"working" + 0.033*"still" + 0.033*"continue" + 0.033*"verification"')
(3, '0.036*"receive" + 0.036*"inside" + 0.036*"source" + 0.036*"network" + 0.036*"arrive"')
(4, '0.090*"news" + 0.049*"broadcaster" + 0.049*"slant" + 0.049*"distort" + 0.049*"intentionally"')
Thread 10
(0, '0.033*"trump" + 0.018*"enquãªte" + 0.018*"exigeant" + 0.018*"franchi" + 0.018*"rubicon"')
(1, '0.027*"bien" + 0.027*"mois" + 0.027*"justice" + 0.027*"ministã¨re" + 0.027*"avaient"')
(2, '0.048*"trump" + 0.032*"pour" + 0.025*"breaking" + 0.017*"mueller" + 0.017*"badaboum"')
(3, '0.033*"trump" + 0.033*"cette" + 0.018*"thread" + 0.018*"2017" + 0.018*"situation"')
(4, '0.038*"dans" + 0.026*"c\'est" + 0.026*"lation" + 0.026*"trump" + 0.014*"kaboom"')
Thread 11
(0, '0.037*"well" + 0.037*"already" + 0.020*"p

Thread 41
(0, '0.021*"american" + 0.021*"carry" + 0.021*"work" + 0.021*"money" + 0.021*"much"')
(1, '0.035*"also" + 0.019*"political" + 0.019*"inside" + 0.019*"financier" + 0.019*"investor"')
(2, '0.032*"clintonfoundation" + 0.017*"foundation" + 0.017*"use" + 0.017*"right" + 0.017*"examination"')
(3, '0.034*"begin" + 0.034*"russia" + 0.019*"hillary" + 0.019*"examininâ" + 0.019*"explain"')
(4, '0.034*"clinton" + 0.034*"russia" + 0.034*"campaign" + 0.018*"starting" + 0.018*"2016"')
Thread 42
(0, '0.058*"british" + 0.040*"muslim" + 0.040*"hindu" + 0.022*"bring" + 0.022*"take"')
(1, '0.073*"hindu" + 0.038*"muslim" + 0.026*"rule" + 0.026*"education" + 0.026*"begin"')
(2, '0.039*"years" + 0.039*"aurangzeb" + 0.039*"jahaandar" + 0.039*"dara" + 0.021*"first"')
(3, '0.033*"could" + 0.033*"structure" + 0.033*"populaâ" + 0.033*"reason" + 0.033*"stake"')
(4, '0.053*"shah" + 0.037*"nadirshah" + 0.037*"muhammad" + 0.020*"years" + 0.020*"mughal"')
Thread 43
(0, '0.071*"bootpruitt" + 0.054*"pruitt" + 

(4, '0.048*"mccabe" + 0.033*"andrew" + 0.033*"director" + 0.033*"today" + 0.033*"announce"')
Thread 73
(0, '0.045*"obstruction" + 0.025*"congressional" + 0.025*"acts" + 0.025*"away" + 0.025*"oversight"')
(1, '0.046*"american" + 0.025*"witness" + 0.025*"community" + 0.025*"intelligence" + 0.025*"attack"')
(2, '0.083*"trump" + 0.024*"america" + 0.024*"mueller" + 0.024*"destroy" + 0.024*"constitutional"')
(3, '0.040*"behavior" + 0.040*"innocent" + 0.022*"stop" + 0.022*"people" + 0.022*"else"')
(4, '0.034*"investigation" + 0.034*"...." + 0.034*"mccarthy" + 0.034*"least" + 0.019*"trustâ"')
Thread 74
(0, '0.042*"×§×?×" + 0.042*"torah" + 0.029*"make" + 0.029*"translation" + 0.016*"work"')
(1, '0.039*"turn" + 0.039*"essay" + 0.021*"people" + 0.021*"masterwork" + 0.021*"jewish"')
(2, '0.073*"moses" + 0.038*"woman" + 0.021*"clear" + 0.021*"need" + 0.021*"back"')
(3, '0.039*"×§×?×" + 0.039*"state" + 0.039*"stuff" + 0.039*"make" + 0.021*"holiness"')
(4, '0.023*"sinai" + 0.023*"together.â" + 0.023*

### CSV twenty_twentyfive

In [78]:
THIS_FOLDER = os.getcwd()
threads_leer = threads4
carpeta_guardar = "tpcsv4"

#Poblar text_data

for hilos in threads_leer:
    camino = os.path.join(THIS_FOLDER, carpeta_guardar)
    text_data = []
    documentos = []
    dictionary = []
    corpus = []
    print(hilos)
    documentos = threads_leer[hilos]

    #print(documentos)

    for line in documentos:
        #print(line)
        tokens = prepare_text_for_lda(line)
        if random.random() > .009:
            #print(tokens)
            text_data.append(tokens)

    #print(text_data) 
    NDIC = camino+"\\"+hilos+"_t_dictionary1.gensim"
    NMOD = camino+"\\"+hilos+"_t_model1.gensim"
    NCOR = camino+"\\"+hilos+"_t_corpus1.pkl"
    dictionary = corpora.Dictionary(text_data)
    corpus = [dictionary.doc2bow(text) for text in text_data]
    pickle.dump(corpus, open(NCOR, 'wb'))
    dictionary.save(NDIC)

    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
    ldamodel.save(NMOD)
    topics = ldamodel.print_topics(num_words=NUM_WORDS)
    for topic in topics:
        print(topic)

Thread 1
(0, '0.031*"could" + 0.031*"trust" + 0.017*"reason" + 0.017*"fact" + 0.017*"common"')
(1, '0.026*"trade" + 0.026*"policy" + 0.026*"idea" + 0.014*"want" + 0.014*"political"')
(2, '0.016*"reason" + 0.016*"directive" + 0.016*"measure" + 0.016*"make" + 0.016*"mention"')
(3, '0.024*"economic" + 0.024*"integration" + 0.024*"influence" + 0.024*"longer" + 0.024*"know"')
(4, '0.029*"like" + 0.029*"leave" + 0.029*"good" + 0.029*"policy" + 0.029*"idea"')
Thread 10
(0, '0.065*"incumbent" + 0.045*"congressrun2018" + 0.025*"democratic" + 0.024*"davis" + 0.024*"biss&amp;wallace"')
(1, '0.030*"contribute" + 0.030*"300,000" + 0.030*"east" + 0.030*"bloomington" + 0.030*"carbondale"')
(2, '0.051*"congressrun2018" + 0.035*"2018" + 0.035*"illinois" + 0.035*"midterms2018" + 0.035*"election"')
(3, '0.072*"congressrun2018" + 0.044*"incumbent" + 0.031*"daniel" + 0.017*"hultgren" + 0.017*"roldan"')
(4, '0.087*"congressrun2018" + 0.070*"incumbent" + 0.020*"john" + 0.020*"deter" + 0.020*"il18"')
Thread 1

(0, '0.020*"leftist" + 0.020*"laws" + 0.020*"problem" + 0.020*"idiot" + 0.011*"chin"')
(1, '0.031*"surprise" + 0.017*"isis" + 0.017*"point" + 0.017*"people" + 0.017*"nikolas"')
(2, '0.021*"people" + 0.021*"white" + 0.021*"shooting" + 0.021*"nikolas" + 0.021*"cruz"')
(3, '0.027*"always" + 0.015*"every" + 0.015*"numerous" + 0.015*"violence" + 0.015*"trot"')
(4, '0.020*"school" + 0.020*"show" + 0.020*"kill" + 0.020*"shooter" + 0.020*"middle"')
Thread 41
(0, '0.045*"state" + 0.031*"require" + 0.031*"steal" + 0.031*"gun" + 0.031*"fall"')
(1, '0.026*"people" + 0.026*"gun" + 0.026*"conviction" + 0.026*"narcotic" + 0.026*"allow"')
(2, '0.051*"state" + 0.035*"background" + 0.035*"check" + 0.018*"dealer" + 0.018*"federal"')
(3, '0.049*"state" + 0.049*"contribute" + 0.027*"dealer" + 0.027*"steal" + 0.027*"federally"')
(4, '0.043*"nics" + 0.023*"permit" + 0.023*"handgun" + 0.023*"mental" + 0.023*"information"')
Thread 42
(0, '0.035*"also" + 0.035*"feel" + 0.019*"first" + 0.019*"sure" + 0.019*"bett

(0, '0.035*"person" + 0.035*"say" + 0.035*"five" + 0.035*"arrest" + 0.035*"judge"')
(1, '0.077*"bail" + 0.062*"money" + 0.033*"murder" + 0.033*"poor" + 0.033*"happen"')
(2, '0.032*"answer" + 0.032*"murder" + 0.032*"city" + 0.032*"obama" + 0.017*"become"')
(3, '0.047*"know" + 0.018*"free" + 0.018*"criminal" + 0.018*"country" + 0.018*"even"')
(4, '0.076*"repeat" + 0.076*"offender" + 0.032*"system" + 0.032*"hold" + 0.017*"release"')
Thread 73
(0, '0.029*"state" + 0.029*"deep" + 0.029*"group" + 0.016*"start" + 0.016*"evelyn"')
(1, '0.031*"know" + 0.017*"look" + 0.017*"government" + 0.017*"barack" + 0.017*"executive"')
(2, '0.034*"piece" + 0.019*"start" + 0.019*"player" + 0.019*"farkas" + 0.019*"evelyn"')
(3, '0.027*"department" + 0.027*"include" + 0.027*"toscas" + 0.027*"george" + 0.027*"go"')
(4, '0.041*"include" + 0.015*"want" + 0.015*"going" + 0.015*"easily" + 0.015*"searchable"')
Thread 74
(0, '0.034*"ritual" + 0.019*"show" + 0.019*"sacrifice" + 0.019*"death" + 0.019*"experience"')
(1,

### CSV twentyfive_thirty

In [79]:
THIS_FOLDER = os.getcwd()
threads_leer = threads5
carpeta_guardar = "tpcsv5"

#Poblar text_data

for hilos in threads_leer:
    camino = os.path.join(THIS_FOLDER, carpeta_guardar)
    text_data = []
    documentos = []
    dictionary = []
    corpus = []
    print(hilos)
    documentos = threads_leer[hilos]

    #print(documentos)

    for line in documentos:
        #print(line)
        tokens = prepare_text_for_lda(line)
        if random.random() > .009:
            #print(tokens)
            text_data.append(tokens)

    #print(text_data) 
    NDIC = camino+"\\"+hilos+"_t_dictionary1.gensim"
    NMOD = camino+"\\"+hilos+"_t_model1.gensim"
    NCOR = camino+"\\"+hilos+"_t_corpus1.pkl"
    dictionary = corpora.Dictionary(text_data)
    corpus = [dictionary.doc2bow(text) for text in text_data]
    pickle.dump(corpus, open(NCOR, 'wb'))
    dictionary.save(NDIC)

    ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
    ldamodel.save(NMOD)
    topics = ldamodel.print_topics(num_words=NUM_WORDS)
    for topic in topics:
        print(topic)

Thread 1
(0, '0.028*"paraguay" + 0.028*"commando" + 0.028*"time" + 0.015*"brazil" + 0.015*"say"')
(1, '0.020*"attack" + 0.020*"gaza" + 0.020*"2017" + 0.020*"palace" + 0.020*"total"')
(2, '0.019*"force" + 0.019*"weapon" + 0.019*"allies" + 0.019*"kill" + 0.019*"nuclear"')
(3, '0.041*"tunnel" + 0.022*"carter" + 0.022*"jimmy" + 0.022*"north" + 0.012*"escape"')
(4, '0.041*"north" + 0.033*"korean" + 0.025*"rafaat" + 0.017*"toumani" + 0.017*"korea"')
Thread 10
(0, '0.052*"lucky" + 0.036*"naik" + 0.036*"reveal" + 0.036*"conversation" + 0.036*"phone"')
(1, '0.040*"swedish" + 0.040*"lucky" + 0.031*"model" + 0.021*"country" + 0.021*"case"')
(2, '0.034*"atala" + 0.034*"lucky" + 0.023*"druglord" + 0.023*"drug" + 0.023*"yasin"')
(3, '0.045*"indian" + 0.031*"case" + 0.031*"agent" + 0.031*"allege" + 0.031*"lucky"')
(4, '0.051*"israeli" + 0.041*"police" + 0.041*"lucky" + 0.021*"drug" + 0.021*"2010"')
Thread 100
(0, '0.036*"still" + 0.019*"concern" + 0.019*"woman" + 0.019*"voter" + 0.019*"silence"')
(1,

(3, '0.037*"trump" + 0.020*"keep" + 0.020*"others" + 0.020*"election" + 0.020*"president"')
(4, '0.043*"say" + 0.043*"spirit" + 0.030*"even" + 0.030*"next" + 0.030*"chosen"')
Thread 41
(0, '0.032*"love" + 0.017*"look" + 0.017*"first" + 0.017*"good" + 0.017*"harry"')
(1, '0.065*"royalwedding" + 0.053*"harryandmeghan" + 0.040*"power" + 0.028*"look" + 0.015*"husband"')
(2, '0.042*"wedding" + 0.023*"black" + 0.023*"invitation" + 0.023*"getting" + 0.023*"harry\x92s"')
(3, '0.070*"royalwedding" + 0.031*"black" + 0.031*"harryandmeghan" + 0.021*"there\x92s" + 0.021*"prince"')
(4, '0.064*"royalwedding" + 0.054*"harryandmeghan" + 0.033*"castle" + 0.033*"windsor" + 0.023*"look"')
Thread 42
(0, '0.033*"israel" + 0.033*"post" + 0.033*"trump" + 0.018*"flypaper" + 0.018*"strategy"')
(1, '0.019*"fortify" + 0.019*"yemeni" + 0.019*"commando" + 0.019*"operate" + 0.019*"foot"')
(2, '0.038*"israel" + 0.038*"golan" + 0.038*"clearing" + 0.026*"ground" + 0.026*"minefield"')
(3, '0.036*"tank" + 0.036*"turret" 

(2, '0.110*"find" + 0.110*"info" + 0.075*"poll" + 0.041*"thanks" + 0.041*"city"')
(3, '0.048*"location" + 0.048*"poll" + 0.048*"always" + 0.048*"state" + 0.048*"casting"')
(4, '0.142*"poll" + 0.131*"find" + 0.119*"info" + 0.096*"place" + 0.027*"calendar"')
Thread 73
(0, '0.052*"server" + 0.042*"devices" + 0.022*"\x95clinton" + 0.022*"lots" + 0.012*"review"')
(1, '0.028*"agent" + 0.028*"laptop" + 0.019*"clinton" + 0.019*"steal" + 0.019*"computer"')
(2, '0.024*"trump" + 0.024*"foundation" + 0.024*"drop" + 0.024*"laptop" + 0.013*"yeah"')
(3, '0.033*"waiting" + 0.033*"guccifer2" + 0.018*"interest" + 0.018*"congress" + 0.018*"hack"')
(4, '0.018*"2015" + 0.018*"corrupt" + 0.018*"email" + 0.018*"election" + 0.018*"crap"')
Thread 74
(0, '0.015*"make" + 0.015*"windrushgeneration" + 0.015*"another" + 0.015*"repeating" + 0.015*"following"')
(1, '0.058*"mail" + 0.021*"better" + 0.021*"describe" + 0.021*"case" + 0.021*"change"')
(2, '0.028*"tell" + 0.028*"essay" + 0.015*"immigration" + 0.015*"paper

# Deteccion de topicos por threads

Al contrario del apartado anterior, se buscarán tópicos en el archivo completo, por lo que se considerará cada thread como un documento, para esto se unirán los tweets siendo considerados parrafos separados por saltos de linea "\n".


In [80]:
string = " \n "

for thread, data in dict(list(csv1_grouped_by_thread)).items():
    threads1[thread] = string.join(list(data['text']))    
Tthreads1 = list(threads1.values())

for thread, data in dict(list(csv2_grouped_by_thread)).items():
    threads2[thread] = string.join(list(data['text']))
Tthreads2 = list(threads2.values())

for thread, data in dict(list(csv3_grouped_by_thread)).items():
    threads3[thread] = string.join(list(data['text']))
Tthreads3 = list(threads3.values())

for thread, data in dict(list(csv4_grouped_by_thread)).items():
    threads4[thread] = string.join(list(data['text']))
Tthreads4 = list(threads4.values())

for thread, data in dict(list(csv5_grouped_by_thread)).items():
    threads5[thread] = string.join(list(data['text']))
Tthreads5 = list(threads5.values())

In [81]:
Tthreads1

['Extraordinary evidence at Treasury committee from Jon Thompson, CEO of HMRC on customs and Brexit today https://t.co/DJhIQhmVwJ \n The Brexiter favourite Max Fac - would cost business between Â£17 and Â£20bn a year\r\r\n\r\r\n- that\'s almost 1% of GDP\r\r\n\r\r\n- jusâ?¦ https://t.co/0MwIcwre4t \n How does he arrive at the figure\r\r\n\r\r\n200m export consignments at an average cost of Â£32.50 each = Â£6.5bn (times two beâ?¦ https://t.co/KxnkU2QiVO \n Theresa May\'s New Customs Partnership is much cheaper for business (almost zero cost)  because it seeks to replicatâ?¦ https://t.co/0LcsJHah0H \n Mr Thompson said he did not expect the EU to reciprocate over the customs partnership. \r\r\n\r\r\nWhat that means is UK collâ?¦ https://t.co/9c3uhhnZGX \n Both would not be ready by 2021. Max Fac needs 3 years. Customs Partnership requires 5, Mr Thompson said.\r\r\n\r\r\nThe bordâ?¦ https://t.co/luLzgUsiR4 \n "We think we can manage the risk - we think we can" he said. He didn\'t sound so 

In [82]:
from gensim import corpora
import gensim
NUM_TOPICS = 20
NUM_WORDS = 10
import pickle

### CSV five_ten

In [83]:
THIS_FOLDER = os.getcwd()
threads_leer = Tthreads1
carpeta_guardar = "Ttpcsv1"

#Poblar text_data


camino = os.path.join(THIS_FOLDER, carpeta_guardar)
text_data = []
documentos = []
dictionary = []
corpus = []
documentos = threads_leer

#print(documentos)

for line in documentos:
    #print(line)
    tokens = prepare_text_for_lda(line)
    if random.random() > .009:
        #print(tokens)
        text_data.append(tokens)

#print(text_data) 
NDIC = camino+"\\t_dictionary1.gensim"
NMOD = camino+"\\t_model1.gensim"
NCOR = camino+"\\t_corpus1.pkl"
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
pickle.dump(corpus, open(NCOR, 'wb'))
dictionary.save(NDIC)

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save(NMOD)
topics = ldamodel.print_topics(num_words=NUM_WORDS)
for topic in topics:
    print(topic)

(0, '0.013*"investigation" + 0.011*"qanon" + 0.011*"security" + 0.011*"report" + 0.010*"committee" + 0.010*"governmental" + 0.010*"homeland" + 0.010*"staff" + 0.010*"affairs" + 0.010*"unite"')
(1, '0.012*"smart" + 0.010*"samba" + 0.010*"kevin" + 0.010*"reynolds" + 0.010*"also" + 0.007*"trump" + 0.007*"disable" + 0.007*"use" + 0.007*"wharton" + 0.007*"reagan"')
(2, '0.015*"indictment" + 0.015*"lard" + 0.013*"seal" + 0.011*"followthewhiterabbit" + 0.009*"eastern" + 0.009*"loã¯se" + 0.009*"immigration" + 0.007*"qanonâ" + 0.007*"releasethememo" + 0.007*"well"')
(3, '0.011*"woman" + 0.008*"tech" + 0.008*"kobebryant" + 0.008*"mccabe" + 0.008*"like" + 0.006*"know" + 0.006*"many" + 0.006*"work" + 0.005*"make" + 0.005*"mission"')
(4, '0.014*"peer" + 0.012*"trump" + 0.012*"make" + 0.012*"mkultra" + 0.009*"campaign" + 0.009*"node" + 0.009*"full" + 0.007*"russia" + 0.007*"trumpâ??s" + 0.007*"bitcoin"')
(5, '0.015*"north" + 0.015*"dakota" + 0.010*"vote" + 0.010*"northdakota" + 0.008*"candidate" + 0

### CSV Ten_fifteen

In [84]:
THIS_FOLDER = os.getcwd()
threads_leer = Tthreads2
carpeta_guardar = "Ttpcsv2"

#Poblar text_data


camino = os.path.join(THIS_FOLDER, carpeta_guardar)
text_data = []
documentos = []
dictionary = []
corpus = []
documentos = threads_leer

#print(documentos)

for line in documentos:
    #print(line)
    tokens = prepare_text_for_lda(line)
    if random.random() > .009:
        #print(tokens)
        text_data.append(tokens)

#print(text_data) 
NDIC = camino+"\\t_dictionary1.gensim"
NMOD = camino+"\\t_model1.gensim"
NCOR = camino+"\\t_corpus1.pkl"
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
pickle.dump(corpus, open(NCOR, 'wb'))
dictionary.save(NDIC)

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save(NMOD)
topics = ldamodel.print_topics(num_words=NUM_WORDS)
for topic in topics:
    print(topic)

(0, '0.021*"grassleymemo" + 0.015*"page" + 0.015*"declassify" + 0.012*"portion" + 0.009*"operation" + 0.009*"steele" + 0.007*"angry" + 0.007*"gladio" + 0.006*"like" + 0.006*"people"')
(1, '0.011*"grassley" + 0.010*"cmte" + 0.008*"hastings" + 0.008*"know" + 0.007*"would" + 0.007*"meet" + 0.007*"â?«ï¸?repub" + 0.007*"trump" + 0.007*"document" + 0.005*"many"')
(2, '0.012*"woman" + 0.009*"northkorea" + 0.008*"memo" + 0.008*"france" + 0.008*"take" + 0.008*"theâ" + 0.006*"know" + 0.006*"hold" + 0.006*"libya" + 0.006*"sarkozy"')
(3, '0.018*"trump" + 0.015*"mueller" + 0.010*"putin" + 0.009*"corbyn" + 0.007*"trumpâ??s" + 0.006*"subpoena" + 0.006*"apollo" + 0.005*"want" + 0.005*"know" + 0.005*"kushco"')
(4, '0.020*"cult" + 0.011*"words" + 0.011*"child" + 0.009*"code" + 0.009*"signal" + 0.009*"family" + 0.006*"system" + 0.006*"every" + 0.006*"grow" + 0.006*"hand"')
(5, '0.008*"memo" + 0.007*"make" + 0.006*"election" + 0.006*"news" + 0.006*"company" + 0.006*"medium" + 0.006*"work" + 0.005*"like" +

### CSV fifteen_twenty

In [85]:
THIS_FOLDER = os.getcwd()
threads_leer = Tthreads3
carpeta_guardar = "Ttpcsv3"

#Poblar text_data


camino = os.path.join(THIS_FOLDER, carpeta_guardar)
text_data = []
documentos = []
dictionary = []
corpus = []
documentos = threads_leer

#print(documentos)

for line in documentos:
    #print(line)
    tokens = prepare_text_for_lda(line)
    if random.random() > .009:
        #print(tokens)
        text_data.append(tokens)

#print(text_data) 
NDIC = camino+"\\t_dictionary1.gensim"
NMOD = camino+"\\t_model1.gensim"
NCOR = camino+"\\t_corpus1.pkl"
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
pickle.dump(corpus, open(NCOR, 'wb'))
dictionary.save(NDIC)

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save(NMOD)
topics = ldamodel.print_topics(num_words=NUM_WORDS)
for topic in topics:
    print(topic)

(0, '0.012*"iran" + 0.007*"alex" + 0.006*"trump" + 0.005*"zwaan" + 0.005*"like" + 0.005*"tell" + 0.004*"people" + 0.004*"mueller" + 0.004*"nuclear" + 0.004*"manafort"')
(1, '0.010*"trump" + 0.008*"claim" + 0.007*"hindu" + 0.006*"wray" + 0.005*"would" + 0.005*"take" + 0.004*"perkinscoie" + 0.004*"qanon" + 0.004*"iran" + 0.004*"british"')
(2, '0.011*"trump" + 0.010*"like" + 0.008*"qanon" + 0.007*"many" + 0.006*"tesla" + 0.006*"know" + 0.006*"people" + 0.005*"week" + 0.005*"breaking" + 0.004*"good"')
(3, '0.009*"prince" + 0.009*"harry" + 0.007*"britain" + 0.007*"ù?ù?ù?ù" + 0.007*"ù?ø§ù" + 0.007*"oscar" + 0.006*"meghan" + 0.006*"markle" + 0.006*"ù?ù?ù?ù?ù" + 0.006*"ù?ù?ù"')
(4, '0.017*"trump" + 0.012*"obama" + 0.008*"rabe" + 0.008*"page" + 0.006*"leak" + 0.006*"back" + 0.005*"apps" + 0.005*"click" + 0.005*"people" + 0.004*"liberty"')
(5, '0.019*"qanon" + 0.019*"student" + 0.019*"fakenewsawards" + 0.019*"bible" + 0.008*"german" + 0.007*"photo" + 0.007*"military" + 0.007*"world" + 0.004*"mak

### CSV twenty_twentyfive

In [86]:
THIS_FOLDER = os.getcwd()
threads_leer = Tthreads4
carpeta_guardar = "Ttpcsv4"

#Poblar text_data


camino = os.path.join(THIS_FOLDER, carpeta_guardar)
text_data = []
documentos = []
dictionary = []
corpus = []
documentos = threads_leer

#print(documentos)

for line in documentos:
    #print(line)
    tokens = prepare_text_for_lda(line)
    if random.random() > .009:
        #print(tokens)
        text_data.append(tokens)

#print(text_data) 
NDIC = camino+"\\t_dictionary1.gensim"
NMOD = camino+"\\t_model1.gensim"
NCOR = camino+"\\t_corpus1.pkl"
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
pickle.dump(corpus, open(NCOR, 'wb'))
dictionary.save(NDIC)

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save(NMOD)
topics = ldamodel.print_topics(num_words=NUM_WORDS)
for topic in topics:
    print(topic)

(0, '0.008*"demand" + 0.008*"iran" + 0.006*"kurd" + 0.005*"trump" + 0.005*"take" + 0.004*"attack" + 0.004*"tell" + 0.004*"would" + 0.004*"time" + 0.004*"must"')
(1, '0.039*"democratic" + 0.035*"candidate" + 0.023*"vote" + 0.021*"georgia" + 0.020*"2018" + 0.011*"district" + 0.009*"november" + 0.008*"dossier" + 0.008*"register" + 0.008*"investigation"')
(2, '0.019*"trump" + 0.011*"know" + 0.008*"people" + 0.007*"make" + 0.006*"manafort" + 0.006*"brexit" + 0.006*"fact" + 0.005*"would" + 0.005*"vote" + 0.005*"trade"')
(3, '0.019*"incumbent" + 0.016*"justice" + 0.015*"congressrun2018" + 0.014*"trump" + 0.012*"qanon" + 0.012*"anon" + 0.010*"relate" + 0.010*"militia" + 0.010*"news" + 0.010*"fulldisclosure"')
(4, '0.015*"qanon" + 0.010*"trump" + 0.009*"fusion" + 0.009*"maga" + 0.008*"post" + 0.007*"wwg1wga" + 0.007*"food" + 0.007*"dprk" + 0.006*"people" + 0.006*"shapiro"')
(5, '0.017*"wethepeople" + 0.014*"libertyrising" + 0.014*"noconcon" + 0.012*"trump" + 0.011*"state" + 0.011*"theresistance

### CSV twentyfive_thirty

In [87]:
THIS_FOLDER = os.getcwd()
threads_leer = Tthreads5
carpeta_guardar = "Ttpcsv5"

#Poblar text_data


camino = os.path.join(THIS_FOLDER, carpeta_guardar)
text_data = []
documentos = []
dictionary = []
corpus = []
documentos = threads_leer

#print(documentos)

for line in documentos:
    #print(line)
    tokens = prepare_text_for_lda(line)
    if random.random() > .009:
        #print(tokens)
        text_data.append(tokens)

#print(text_data) 
NDIC = camino+"\\t_dictionary1.gensim"
NMOD = camino+"\\t_model1.gensim"
NCOR = camino+"\\t_corpus1.pkl"
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
pickle.dump(corpus, open(NCOR, 'wb'))
dictionary.save(NDIC)

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save(NMOD)
topics = ldamodel.print_topics(num_words=NUM_WORDS)
for topic in topics:
    print(topic)

(0, '0.020*"system" + 0.010*"head" + 0.009*"canal" + 0.009*"balance" + 0.007*"vestibular" + 0.007*"back" + 0.007*"around" + 0.005*"movement" + 0.005*"forth" + 0.005*"eyes"')
(1, '0.018*"page" + 0.007*"carter" + 0.007*"stigmergy" + 0.006*"2016" + 0.006*"c\x92est" + 0.005*"idea" + 0.005*"russia" + 0.005*"gowdy" + 0.005*"avec" + 0.005*"group"')
(2, '0.026*"trump" + 0.014*"bombshell" + 0.010*"page" + 0.010*"qanon" + 0.009*"mueller" + 0.008*"russia" + 0.007*"wwg1wga" + 0.006*"know" + 0.006*"maga" + 0.006*"post"')
(3, '0.013*"trump" + 0.006*"flynn" + 0.006*"obama" + 0.005*"russia" + 0.005*"know" + 0.004*"would" + 0.004*"steele" + 0.004*"investigation" + 0.004*"rice" + 0.004*"make"')
(4, '0.022*"path" + 0.021*"openingday" + 0.021*"thingskidshavetaughtme" + 0.020*"qanon" + 0.019*"thestorm" + 0.017*"releasethevideo" + 0.010*"transgender" + 0.007*"activism" + 0.007*"ecodescom" + 0.006*"woman"')
(5, '0.035*"contd" + 0.021*"mulehead" + 0.010*"flynn" + 0.006*"trump" + 0.006*"view" + 0.006*"evidence

## Megacorpus

Como tercera alternativa de análisis, se decide unir todos los threads que se tienen en un megacorpus, por lo que se utilizan cada thread de todos los archivos como un documento, luego se detectan los topicos presentes en los aproximadamente 500 documentos entregados.

In [88]:
megatexto = Tthreads1+Tthreads2+Tthreads3+Tthreads4+Tthreads5

In [89]:
megatexto

['Extraordinary evidence at Treasury committee from Jon Thompson, CEO of HMRC on customs and Brexit today https://t.co/DJhIQhmVwJ \n The Brexiter favourite Max Fac - would cost business between Â£17 and Â£20bn a year\r\r\n\r\r\n- that\'s almost 1% of GDP\r\r\n\r\r\n- jusâ?¦ https://t.co/0MwIcwre4t \n How does he arrive at the figure\r\r\n\r\r\n200m export consignments at an average cost of Â£32.50 each = Â£6.5bn (times two beâ?¦ https://t.co/KxnkU2QiVO \n Theresa May\'s New Customs Partnership is much cheaper for business (almost zero cost)  because it seeks to replicatâ?¦ https://t.co/0LcsJHah0H \n Mr Thompson said he did not expect the EU to reciprocate over the customs partnership. \r\r\n\r\r\nWhat that means is UK collâ?¦ https://t.co/9c3uhhnZGX \n Both would not be ready by 2021. Max Fac needs 3 years. Customs Partnership requires 5, Mr Thompson said.\r\r\n\r\r\nThe bordâ?¦ https://t.co/luLzgUsiR4 \n "We think we can manage the risk - we think we can" he said. He didn\'t sound so 

In [90]:
from gensim import corpora
import gensim
NUM_TOPICS = 20
NUM_WORDS = 10
import pickle

In [96]:
THIS_FOLDER = os.getcwd()
threads_leer = megatexto
carpeta_guardar = "mega"

#Poblar text_data

camino = os.path.join(THIS_FOLDER, carpeta_guardar)
text_data = []
documentos = []
dictionary = []
corpus = []
documentos = threads_leer

#print(documentos)

for line in documentos:
    #print(line)
    tokens = prepare_text_for_lda(line)
    if random.random() > .009:
        #print(tokens)
        text_data.append(tokens)

print(text_data) 
NDIC = camino+"\\t_dictionary1.gensim"
NMOD = camino+"\\t_model1.gensim"
NCOR = camino+"\\t_corpus1.pkl"
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
pickle.dump(corpus, open(NCOR, 'wb'))
dictionary.save(NDIC)

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save(NMOD)
topics = ldamodel.print_topics(num_words=NUM_WORDS)
for topic in topics:
    print(topic)




(0, '0.036*"qanon" + 0.020*"post" + 0.007*"anon" + 0.005*"qalert" + 0.005*"say" + 0.005*"drop" + 0.005*"maga" + 0.005*"last" + 0.004*"4/4/18" + 0.004*"would"')
(1, '0.008*"path" + 0.008*"openingday" + 0.008*"thingskidshavetaughtme" + 0.006*"releasethevideo" + 0.006*"simulation" + 0.006*"thestorm" + 0.005*"qanon" + 0.004*"politico" + 0.004*"obama" + 0.004*"many"')
(2, '0.015*"qanon" + 0.007*"investigation" + 0.007*"trump" + 0.006*"maga" + 0.006*"post" + 0.005*"wwg1wga" + 0.005*"gitmo" + 0.005*"obama" + 0.005*"also" + 0.005*"like"')
(3, '0.010*"democratic" + 0.009*"candidate" + 0.008*"incumbent" + 0.006*"vote" + 0.005*"election" + 0.005*"people" + 0.004*"medium" + 0.004*"qanon" + 0.003*"social" + 0.003*"eunoia"')
(4, '0.038*"qanon" + 0.034*"part" + 0.010*"post" + 0.008*"snowden" + 0.007*"bring" + 0.005*"qanon8chan" + 0.004*"say" + 0.003*"obama" + 0.003*"call" + 0.003*"county"')
(5, '0.053*"qanon" + 0.021*"thestorm" + 0.021*"greatawakening" + 0.020*"internetbillofrights" + 0.011*"wethepe

### Analisis de resultados de megacorpus

Luego de detectar los topicos, se clasificarán los threads de un archivo según los tópicos obtenidos.

In [92]:
for hilo in Tthreads1:
    hilito = prepare_text_for_lda(hilo)
    hilito_bow = dictionary.doc2bow(hilito)
    print(hilo)
    print(ldamodel.get_document_topics(hilito_bow))

Extraordinary evidence at Treasury committee from Jon Thompson, CEO of HMRC on customs and Brexit today https://t.co/DJhIQhmVwJ 
 The Brexiter favourite Max Fac - would cost business between Â£17 and Â£20bn a year

- that's almost 1% of GDP

- jusâ?¦ https://t.co/0MwIcwre4t 
 How does he arrive at the figure

200m export consignments at an average cost of Â£32.50 each = Â£6.5bn (times two beâ?¦ https://t.co/KxnkU2QiVO 
 Theresa May's New Customs Partnership is much cheaper for business (almost zero cost)  because it seeks to replicatâ?¦ https://t.co/0LcsJHah0H 
 Mr Thompson said he did not expect the EU to reciprocate over the customs partnership. 

What that means is UK collâ?¦ https://t.co/9c3uhhnZGX 
 Both would not be ready by 2021. Max Fac needs 3 years. Customs Partnership requires 5, Mr Thompson said.

The bordâ?¦ https://t.co/luLzgUsiR4 
 "We think we can manage the risk - we think we can" he said. He didn't sound so sure. 

And "the potential backdoorâ?¦ https://t.co/Ti1nbbjfp

(6) Well, what the futâ?¦ https://t.co/0DqGzc3x8v
[(16, 0.98020834)]
Part 1: March 7-8 thread, #Qanon asks to play "where's #Snowden" in the picture below. In China/HK? Live operationâ?¦ https://t.co/vSCVeUvsNG 
 Part 2: #Qanon paints the battle as vs the CIA

Brings up #Snowden's journey out of US, recapped here:â?¦ https://t.co/F6y9egU4PC 
 Part 3: #Qanon returns to #internetbillofrights, and how Q sees this as the necessary step to prevent censorship byâ?¦ https://t.co/bJQboQQ1cR 
 Part 4: #Qanon again asks if those who #FollowTheWhiteRabbit trust AG Jeff Sessions? Indicates Sessions delivers thâ?¦ https://t.co/KRNcYpSMcF 
 #BOOM. #Qanon called another one?

Jeff Sessions.

https://t.co/NtSCMgoLS5 
 Part 5: #Qanon updates on apparent #Snowden operation. Posts pictures of a building in what looks like many neighboâ?¦ https://t.co/gnSfYNoDgA
[(5, 0.3352982), (14, 0.6154754), (16, 0.034571175)]
Hmmm...Q is on the hunt for @Snowden. This looks like it could be Hong Kong. I lived there f

 Part of #MKUltra that led to its exposure was the death of Frank Olsen who was drugged with LSD by the CIA withoutâ?¦ https://t.co/v1lITW5aMc
[(13, 0.98827165)]
You have heard about our #AI Code and #global summit, but take a look at some of our other recommendations in theâ?¦ https://t.co/7FQAkxkN3o 
 We had a lot to say about #data and #AI, and who should have access to it. https://t.co/72ikDcsUKm 
 We also thought carefully about #diversity and #representation in #artificialintelligence and how we should addressâ?¦ https://t.co/Dl4rX3IEDj 
 The #LordsAIreport also raised the importance of the #UK maintaining #research and #innovation in #AI https://t.co/j2FwdupNBD 
 Our report also focused on #lifelonglearning and the need to adapt to an #AI work environment https://t.co/GftBuGLpQL
[(2, 0.72253096), (9, 0.2501963)]
(1) Anatomy of an MSM segment:
A: Present two extremes of a topic and pander to their differences.
B: Hire "expertsâ?¦ https://t.co/4cjKD3z0ZA 
 (2) When we are beset by

 Ã?videmment, Ã  l'Ã©poque, elle ne pouvait pas. Aux historienâ?¢neâ?¢s d'aujourd'hui de rendre justice Ã  ces femmes violÃ©eâ?¦ https://t.co/dw9PtgH45b
[(9, 0.99188036)]
Her: Because every time I said no he tightened his hold around me 

Him: She wasnâ??t that attractive 

Kobe Bryant iâ?¦ https://t.co/1IlCnwSuMG 
 After initially lying to police #KobeBryant admits the sex only lasted for 5 mins, he grabbed her by the throat theâ?¦ https://t.co/eAgor5SXl9 
 Too many lacerations to count.
NOT consistent with consensual sex.
Youâ??re full of s**t @TheAcademy 
#KobeBryantâ?¦ https://t.co/23OHy5I1jT 
 #KobeBryant said the bruise around her neck was just what happens and suggested asking Michelle (not his wife) â??youâ?¦ https://t.co/i0VhKiJug6 
 Evidence included a T-shirt stained with the 19 year olds blood. Oh and #KobeBryant made her stay and clean up befoâ?¦ https://t.co/4uwM9Bjg1a 
 #KobeBryantâ??s accuser immediately told a bellhop she was raped, she was crying and her clothing was 

 Now is NOT the time to be COMPLACENT!  There are SO many distractions around us, the attack on our #2A, theâ?¦ https://t.co/abHlpqMQeK
[(9, 0.9793478)]
The #NRA is a flat-out, kick-ass grassroots organization and their power derives from their members. 

There memberâ?¦ https://t.co/3n9YuXHCfZ 
 You wonâ??t know that if you watch the #FakeNewsMedia, Drive-By Media and their continual harangue, which is as prediâ?¦ https://t.co/vELZte1UEy 
 How in the world can political differences be put aside when even the marchers and their agenda sound very similarâ?¦ https://t.co/6iI6VT6IEi 
 The long-term goal of the left is to eliminate all guns, to confiscate every gun in this country.Â Of course, I donâ??â?¦ https://t.co/f7pS0xeneL 
 And whenever you start bringing the NRA into this, you politicize it, and anybody who thinks that the student marchâ?¦ https://t.co/ez2Zn9Njox 
 The Democrats set themselves up as above the political fray.Â They triangulate.Â They say, â??We are not these skank D

 Also sign up at Gab.  We need to support alternative social media sites that support free speech.  You can find meâ?¦ https://t.co/AKOi2pPmEh
[(11, 0.9879747)]
THREAD: Today we learned that the FBI is investigating connections between the NRA, Russia, and the Trump campaign.â?¦ https://t.co/ylslcfnWL0 
 In August of 2016, Trumpâ??s rhetoric &amp; behavior caused reliable GOP donors to abandon his campaign, but the NRA *incâ?¦ https://t.co/mB9ljx5BuU 
 In fact, the NRA spent more on Trumpâ??s 2016 campaign than Trumpâ??s own super PACs. It was the most the NRA has everâ?¦ https://t.co/VuoNXTvJEW 
 And the $30 million that the NRA spent on Trump might only be the tip of the iceberg -- reporting from @McClatchyDCâ?¦ https://t.co/q2PqPoX8Rk 
 The money isnâ??t the only peculiarity. Trump tried to hire David Clarke, an NRA spokesperson who had traveled to Rusâ?¦ https://t.co/gCM5sQ5F2e 
 And Jared Kushner failed to disclose a proposed meeting from lifetime NRA member, Alexander Torshin. To

 @RogueCinderella or @roguejasmine would you mind posting this on @officialnmp and maybe even post it in the blog?
[(1, 0.98978496)]
Part 1: Feb 7-8, #Qanon responds to anons on the /qresearch 8ch board. Brings up #Snowden again. Questions rift betâ?¦ https://t.co/5dyqMYPmaq 
 Part 2: #Qanon references old #breadcrumbs. Soros, Rothschild, House of Saud; false flags (nukes?) and "kill box";â?¦ https://t.co/b9uKRpjlW7 
 Part 3: #Qanon posts picture of King Tower: https://t.co/MXZ0yDKzyh

"We see you (live) = national security is watcâ?¦ https://t.co/ZoyX3rzPHy 
 Part 4: #Qanon follows up on part 3 surveillance style pics with new photos. Says the hunters have become the hunteâ?¦ https://t.co/y2Q28IiQAn 
 Part 5: #Qanon says the window picture in Part 4 was confirmation of an arrest--just because it isn't in news doesnâ?¦ https://t.co/zxNBS4xNEO 
 Part 6: #Qanon brings up Hannity's tweet on #UraniumOne (which Q recently asked patriots to keep pushing into the nâ?¦ https://t.co/JC2WXZWOTU 

 https://t.co/Zuqe2tHrrH
[(3, 0.9743243)]
1. Again, so many people misunderstand the brilliance of Trump's negotiation strategy.

This is NOT Trump v KJU.

Iâ?¦ https://t.co/9K8r4RLpii 
 2. Xi expected that the THREAT to take away the Summit, by ordering their proxy KJU to start being hostile and bellâ?¦ https://t.co/7eIa3a4azf 
 3. Trump has already won bigly in this negotiation. He has increased US leverage significantly.

All US hostages arâ?¦ https://t.co/3Qz4QXD9Le 
 4. 30,000 foot view, folks:

This has been a TERRIBLE 15 months for the Chinese. The US is now positioned to pressuâ?¦ https://t.co/cBITVj0fmC 
 5. ALL leverage rests with POTUS Trump. He won't waste this opportunity.

Trump knows that he's outmanouvered Xi byâ?¦ https://t.co/m97xYRqsg2 
 6. Xi will now have to offer even MORE, just to get back to where he was less than 2 days ago.  

You watch - the Nâ?¦ https://t.co/ykiQ61h5FV 
 7. I've always thought Xi to be a total hack. He's predictable, unimaginative and also, 

 Itâ??s also illegal for domestic abusers to possess guns â?? Lautenberg Amendment. Buybacks donâ??t work (mandatory Austrâ?¦ https://t.co/uPpwlXxv8n
[(7, 0.98895353)]
The most common form of propaganda is repetition.

This is why the @BBC repeat over and over negative stories aboutâ?¦ https://t.co/trZiMUMLKg 
 Importantly, this also applies to outright #lies.

Mark Twain said "A Lie Can Travel Halfway Around the World Whileâ?¦ https://t.co/J0nKDgFqYl 
 But in today's world with instant communication it takes but the utterance of a lie a single time by a institutionâ?¦ https://t.co/cCv92WvJ25 
 Given this enormous power, and given their proclivity for #Lying and, their practice of repeating their lies endlesâ?¦ https://t.co/IENT78mynn 
 By this I mean - The @BBC should no longer be allowed to do anything other than entertainment as they simply cannotâ?¦ https://t.co/LKMdzDEp3f 
 Remember, a lie repeated often enough by a powerful organisation such as our Government and/or it's communic

 ALSO: Guy Who Owns Standard Hotel Chain Is Andre Balazs.. Here He is With Marina Abramovic #SpiritCooking #Satanicâ?¦ https://t.co/TdsgLsN9u9
[(19, 0.95476186)]
ð??¬ Do you think #threads can help make Twitter healthier?
Yesterday, @jack answered our question on Periscope ð???â?¦ https://t.co/39BtEPNMpL 
 Â« 280 and Threads have certainly allowed more depth in the conversations and a lot more critical thinking. One of tâ?¦ https://t.co/J4NeA74lXi 
 Â« I do think the more space we give people to think and be critical about what they see and express more critical tâ?¦ https://t.co/DLVL6luCII 
 Â« We do believe it has a potential but we donâ??t know and thatâ??s part of the point. [â?¦] We havenâ??t figured out how tâ?¦ https://t.co/Ur5axL1Ywq 
 Â« This is a step in first admitting that we donâ??t know how to measure that, second like, to do find the measurementâ?¦ https://t.co/DHhgzbxtpw 
 See when @jack answers our question or just watch the replay about Twitter's #health ð???
https://

 "The Inspector General already has 1.2million pages of investigation findings, and adding more by the day. The IG Râ?¦ https://t.co/JzSEnJJrxv
[(0, 0.18065663), (4, 0.40685976), (5, 0.08227106), (7, 0.08558709), (16, 0.03462552), (19, 0.20186043)]
1/ I've had a bait of #RINOS like #Rubio, #Graham and @JeffFlake. Mr. Moron can't get re-elected in AZ because of hâ?¦ https://t.co/ot2GQPnwsn 
 3/ #TermLimits. Yesterday while all the #idiots were on the "Sunday Slam the #POTUS45 Shows" little @marcorubio saiâ?¦ https://t.co/pLh4g8VGyU 
 2/ saying because it had gaps in it, the wall was called a fence in AZ. He needs to take his #putrid self to AZ andâ?¦ https://t.co/sBltzdgd6A 
 these #thugs. These people don't represent our values. They are making a mockery out of the Republican party. I hopâ?¦ https://t.co/mRfD8u4FZl 
 4/ @GenFlynn. So before you make your rounds, #LittleMarco, please prepare yourself or keep your #PieHole shut. Welâ?¦ https://t.co/LGOfcyyIou
[(4, 0.97888887)]
Anons alre

 11/ Can you spot S4 and S1 on $URA(nium) ?ð?¤? https://t.co/XQg2NfksDs
[(5, 0.05088458), (15, 0.93547904)]
1/ My theory on why @rogerkver acts the way he does. I think the root cause is that he doesn't understand #Bitcoinâ?¦ https://t.co/SQmcgwHW0H 
 2/ Any politician can make grand promises of solutions to problems, and expect others to make them happen. But builâ?¦ https://t.co/byypaoDhza 
 3/ If Bitcoin is to become a foundation for human and machine peer-to-peer interaction that nobody can control or câ?¦ https://t.co/GiW03rGsln 
 4/ To make it possible for all users of Bitcoin to run a full node, the resource requirements regarding CPU, memoryâ?¦ https://t.co/MW9fJCqpxW 
 5/ Users that don't run a full node are not "peers" in a peer-to-peer network, but must instead rely on trusted thiâ?¦ https://t.co/tBnksDxcRA 
 6/ Running a full node is the only way to send bitcoins and check that you have been payed, without having to relyâ?¦ https://t.co/XWHu1Gc4xO 
 7/ Increasing resource r

 That was Bannon's genius, the dark triad of microtargeting, emotional psychometrics and fake news. It had been trieâ?¦ https://t.co/FBnEBxLXvy
[(4, 0.9884147)]
Hey #NorthDakota #ND
#PrimaryElection JUNE 12, 2018
You donâ??t need to Register
For Primaries, find Candidates for aâ?¦ https://t.co/PzF28hXY0P 
 VOTER Registration / Qualifications in North Dakota #ND
Voter Registration is not required in #NorthDakota but youâ?¦ https://t.co/KiNoM1tp6a 
 ABSENTEE VOTING in #NorthDakota #ND
Request #Absentee Ballot by June 11, 2018
To Vote in June 12 #PrimaryElection
Aâ?¦ https://t.co/DiSWwGNwu7 
 VOTER ID in North Dakota
#VoterID #NorthDakota #ND
ID is required to vote in North Dakota
Driver License Locationsâ?¦ https://t.co/6CewzKMRXE 
 EARLY VOTING in North Dakota
#EarlyVoting #NorthDakota #ND https://t.co/hCvonitPfp 
 Be a Guardian of Democracy 
Be a Poll Worker in North Dakota
#PollWorker #NorthDakota #ND
To sign up, contact yourâ?¦ https://t.co/b8EjU1jrPW 
 Democratic Candidate for Congr

### Análisis de tópicos

Es posible analizar la relación entre los tópicos obtenidos a través de la librería pyLDAvis, la cual grafica la distancia entre los tópicos

In [93]:
dictionary = gensim.corpora.Dictionary.load(NDIC)
corpus = pickle.load(open(NCOR, 'rb'))
lda = gensim.models.ldamodel.LdaModel.load(NMOD)
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)