# Topic Modeling using Laten Drichlet Allocation on Twitter Accounts

(You can find detailed analysis of all the procedures in the document "cmpe-492-midterm.pdf" in this repository.)

In this notebook, we applied a probabilistic topic modeling algorithm called Latent Drichlet Allocation to detect what kind of topics did a specific user tweet about and what are those topics about.


## Latent Drichlet Allocation (LDA)

Think about a paper which is about using the data analysis to determine the number of genes the organism needs to survive. Assume that by hand, we highlight the words about data analysis in blue, evolutionary biology in pink and genetics in yellow. We see that blue,pink and yellow colors are in different proportions. LDA is a statistical model of document classification that tries to capture above mentioned concept. We define a topic to be a distribution over a dictionary. For instance, genetic topic has genetic related words with high probability and data analysis words with low probability. 

For each document we have, we generate the words in two-stage process:\cite{Blei:2012:PTM:2133806.2133826}
1. Randomly choose a distribution over topics
2. For each word in the document
  * Randomly choose a topic from the distribution over topics in step \#1
  * Randomly choose a word from the corresponding distribution over the vocabulary.
  
We can describe the generative process of LDA formally by the following joint distribution:

$$ p(\beta_{1:K} , \theta_{1:D}, z_{1:D}, w_{1:D}) = \prod_{i=1}^{K} p(\beta_i)   \prod_{d=1}^{D} p(\theta_d) (\prod_{n=1}^{N}  p(z_{d,n}|\theta_d) p(w_{d,n} | \beta_{1:K},z_{d,n}))$$

where $\beta_{1:K}$ are the topics, where each $\beta_k$ is distribution over vocabulary, $\theta_d$ is the topic proportions for document d, where $\theta_{d,k}$ is the topic proportion for topic k in document d, $z_d$ is the topic assignment for document d where $z_{d,n}$ is the topic assignment for the nth word in document d, finally the observed words for document d are $w_d$, where $w_{d,n}$ is the nth word in document d which is an element over a fixed dictionary. We can see that distribution is composed of dependent random variables which define the LDA.

Ref: David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84, April 2012.

## Natural Language Processing (NLP)

The language of twitter is generally close to daily language. People share their ideas and emotions at any time of the day. Other than normal texts, tweets can include hashtags, emoticons, pictures, videos, gifs, urls etc. Even normal text part of the tweets may consist of misspelled words. Apart from these, one user may tweet in lots of language. For example, one tweet may be in Turkish, and another one in English. So we need to make a cleanup before using those tweets. The list of applied processes:

* Remove Twitter Accounts that has less than 2000 words in their tweets
* Remove URLs
* Tokenization
* Stop words
* Remove non-English words from tweets
* Remove non-English accounts
* Delete accounts whose number of left tokens are less than 200
* Stemming
* Remove words that appears only once in the whole corpus

Importing the necessary libraries.

In [130]:
import langid
import logging
import nltk
import numpy as np
import re
import os
import sys
import time
from collections import defaultdict
from string import digits
import pyLDAvis.gensim
import pyLDAvis.sklearn
from gensim import corpora, models, similarities, matutils
import networkx as nx
import string
import math

from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import KMeans

from collections import Counter

#### Read and Remove Twitter Accounts that has less than 2000 words in their tweets

We have already collected tweets of random 900 followers of TRTWorld's twitter account. You can also find those Twitter API codes in this repo.

Here we are reading each user's tweets from files and saving them into a list (tweetList) if the number of words in the file greater than 2000 words.

In [131]:
tweetsList = []
userList = []

for file in os.listdir("tweets"):
    path = "tweets\\" + file
    f = open(path, 'r', encoding='utf-8')
    fread = f.read()
    if (len(fread.split()) > 2000):
        tweetsList.append(fread)
        userList.append(file[0:len(file)-4])
    f.close()

print(len(tweetsList))
print(len(userList))
print(userList[15])

825
825
106047757


In [39]:
#print(tweetsList[15])

#### Remove URLs

We have removed all urls which are starting with "http://" or "https://. So we excluded all pictures, videos, gifs etc. from the text.

In [132]:
def remove_urls(text):
    text = re.sub(r"(?:\@|http?\://)\S+", "", text)
    text = re.sub(r"(?:\@|https?\://)\S+", "", text)
    return text

def doc_rm_urls():
    return [ remove_urls(tweets) for tweets in tweetsList]

tweetsList = doc_rm_urls()

#print(tweetsList[15])

#### Tokenization
Tokenization is basically process of splitting text into words, phrases or other meaningful elements called tokens. We words as our tokens. To better process the text and to create a dictionary and a corpus we tokenized and converted to lower case all the tweets. We used nltk library with regexp to tokenize. 

In [133]:
# This returns a list of tokens / single words for each user
def tokenize_tweet():
    '''
        Tokenizes the raw text of each document
    '''
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    return [ tokenizer.tokenize(t.lower()) for t in tweetsList]

tokenized_tweets = tokenize_tweet()

# print(tokenized_tweets[15])

#### Stop words
Stop words usually refer to the most common words in a language. So being common makes stopwords less effective and sometimes misleading while making decisions. Thus generally stop words are words which are filtered out. We used nltk library to obtain general English stop words, also we determined some words ourselves and also added one and two character words from tweets to stop words.

In [134]:
# Remove stop words
stoplist_tw=['amp','get','got','hey','hmm','hoo','hop','iep','let','ooo','par',
            'pdt','pln','pst','wha','yep','yer','aest','didn','nzdt','via',
            'one','com','new','like','great','make','top','awesome','best',
            'good','wow','yes','say','yay','would','thanks','thank','going',
            'new','use','should','could','best','really','see','want','nice',
            'while','know']

unigrams = [ t for tweets in tokenized_tweets for t in tweets if len(t)==1]
bigrams  = [ t for tweets in tokenized_tweets for t in tweets if len(t)==2]

stoplist  = set(nltk.corpus.stopwords.words("english") + stoplist_tw + unigrams + bigrams)

tokenized_tweets = [[token for token in tweets if token not in stoplist]
                for tweets in tokenized_tweets]

#print(tokenized_tweets[15])

#### Remove non-English words from tweets
We used nltk corpus to remove non-English words form tweets.

In [135]:
# remove non-english words

words = set(nltk.corpus.words.words())

tokenized_tweets = [[token for token in tweets if token in words or not token.isalpha()]
                for tweets in tokenized_tweets]

#print(tokenized_tweets[15])

#### Remove non-English accounts
It is an extension process to removing non-English words. After removing non-English words from tweets, we removed accounts from our corpus whose tweets are majorly not in English. We used a library called langid to detect English accounts.

In [136]:
# Delete Accounts whose tweets are not majorly in English
print(len(tokenized_tweets))
tokenized_tweets = [tweets for tweets in tokenized_tweets if langid.classify(' '.join(tweets))[0] == 'en']
print(len(tokenized_tweets))

825
820


#### Delete accounts whose number of left tokens are less than 200
After all those preprocessing on tweets, we have removed lots of words from original tweets. Some of the accounts, which are possibly not majorly in English but still includes English words, effected more but still existed in the corpus. So to eliminate those misleading accounts from the corpus we deleted accounts whose number of left tokens are less than 200.

In [137]:
# Delete Accounts whose length of tokenized tweets are less than 200
print(len(tokenized_tweets))
tokenized_tweets = [tweets for tweets in tokenized_tweets if len(tweets) > 200]
print(len(tokenized_tweets))

820
820


#### Stemming
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. nltk library has mainly 3 kinds of stemming tools for English: lancaster, porter and snowball. We chose Snowball stemmer because it uses a more developed algorithm then Porter Stemmer (Snowball is also called as Porter2) and less aggressive than Lancaster.

In [138]:
# Porter Stemmer and Snowball Stemmer (Porter2) - We useed Snowball Stemmer
# http://stackoverflow.com/questions/10554052/what-are-the-major-differences-and-benefits-of-porter-and-lancaster-stemming-alg

#ps = nltk.stem.PorterStemmer()
#print(ps.stem('I am going'))

sno = nltk.stem.SnowballStemmer('english')

tokenized_tweets = [[sno.stem(token) for token in tweets]
          for tweets in tokenized_tweets]

In [11]:
# Sort words in documents
#for tweets in tokenized_tweets:
#    tweets.sort()

### Dictionary and Corpus

To properly use the Twitter data that we have preprocessed, we need to put into a shape that will be understandable by Topic Modeling algorithms. Bag-of-words representation is perfect fit for those kind of algorithms. In bag-of-words we first created a dictionary which consists of all the words from our preprocessed twitter data as values and their ids as keys. Then we created our corpus. Each element of the corpus corresponds to one Twitter account. Each element consists tuples which includes dictionary id of words and the number of that words' occurrences in that account. We used a very useful python library called Gensim to create our dictionary and corpus.

In [143]:
# Build a dictionary where for each document each word has its own id
dictionary = corpora.Dictionary(tokenized_tweets)
dictionary.compactify()

print(len(dictionary))

# Build the corpus: vectors with occurence of each word for each document
# convert tokenized documents to vectors
corpus = [dictionary.doc2bow(tweets) for tweets in tokenized_tweets]

print(len(corpus))

print(dictionary)

67380
820
Dictionary(67380 unique tokens: ['24x16', 'toyshop', 'f60', 'proctor', 'foundri']...)


#### Remove words that appears at most 3 times in the whole corpus
This process removes some kind of outlier words (like non-English, meaningless or heavily degenerated words) from the corpus which are passed undetected from the former natural language processes.

In [144]:
# Removing words that appears only once in the whole corpus

dictCtr = np.zeros(len(dictionary))

for c in corpus:
    for tuples in c:
        dictCtr[tuples[0]] = dictCtr[tuples[0]] + tuples[1]
        
badids = []
for i in range(len(dictCtr)):
    if dictCtr[i] < 4:
        badids.append(i)
        
        
dictionary.filter_tokens(bad_ids=badids)
dictionary.compactify()

corpus = [dictionary.doc2bow(tweets) for tweets in tokenized_tweets]

print(dictionary)

Dictionary(25035 unique tokens: ['steeper', 'foundri', 'apprais', 'protection', 'invinc']...)


In [145]:
tweetList = []

for c in corpus:
    str = ''
    for tokens in c:
        str = str + ((dictionary[tokens[0]]+' ') * tokens[1])
    tweetList.append(str)

#print(tweetList[15])
print(len(tweetList))
# tweetList = [' '.join(tweets) for tweets in tokenized_tweets]

820


## Training LDA

We used the Python library called Gensim to train our corpus using LDA model. LDA has 3 main parameters need to be optimized. Finding the right parameters for LDA can be considered as an art:

* K, the number of topics
* Alpha, which dictates how many topics a document potentially has. The lower alpha, the lower the number of topics per documents
* Beta, which dictates the number of word per document. Similarly to Alpha, the lower Beta is, the lower the number for words per topic.

Since we are dealing with tweets, we assumed that each follower would have a limited number of topics to tweet about and therefore set alpha to a low value 0.001. (default value is 1.0/num\_topics). We left beta to its default setting. We tried several different values for the number of topics. Too few topics result in heterogeneous set of words while too many diffuse the information with the same words shared across many topics.

In [53]:
#lda_params = {'num_topics': 10, 'passes': 20, 'alpha': 0.001}
lda_params = {'num_topics': 10, 'passes': 20}


print("Running LDA with: %s  " % lda_params)
#lda = models.LdaModel(corpus, id2word=dictionary,
#                        num_topics=lda_params['num_topics'],
#                        passes=lda_params['passes'],
#                        alpha = lda_params['alpha'])

lda = models.LdaModel(corpus, id2word=dictionary,
                        num_topics=lda_params['num_topics'],
                        passes=lda_params['passes'])
print()
lda.print_topics()

Running LDA with: {'passes': 20, 'num_topics': 10}  



[(0,
  '0.095*"data" + 0.040*"analyt" + 0.024*"big" + 0.020*"learn" + 0.011*"scienc" + 0.011*"busi" + 0.009*"cloud" + 0.009*"machin" + 0.008*"intellig" + 0.006*"secur"'),
 (1,
  '0.021*"learn" + 0.012*"data" + 0.011*"python" + 0.010*"deep" + 0.007*"work" + 0.007*"machin" + 0.006*"paper" + 0.006*"neural" + 0.005*"time" + 0.005*"code"'),
 (2,
  '0.017*"stem" + 0.014*"learn" + 0.011*"today" + 0.009*"check" + 0.009*"code" + 0.009*"day" + 0.009*"love" + 0.008*"help" + 0.007*"educ" + 0.007*"work"'),
 (3,
  '0.013*"market" + 0.013*"twitter" + 0.011*"tech" + 0.010*"busi" + 0.009*"social" + 0.009*"way" + 0.008*"digit" + 0.008*"data" + 0.008*"need" + 0.007*"join"'),
 (4,
  '0.015*"robot" + 0.007*"world" + 0.007*"news" + 0.006*"industri" + 0.006*"tech" + 0.005*"research" + 0.005*"futur" + 0.005*"analysi" + 0.005*"market" + 0.005*"press"'),
 (5,
  '0.009*"work" + 0.009*"time" + 0.007*"think" + 0.007*"peopl" + 0.006*"day" + 0.005*"need" + 0.005*"love" + 0.005*"look" + 0.004*"well" + 0.004*"much"'),

### Visualization of LDA

The output of the LDA model gives us lots of useful information as expected, word distributions over topics and topic distribution over users. However those information are all hard to read and interpret by looking. Fortunately, we found a library called LDAvis to explore and interpret the results of LDA. LDAvis maps topic similarity by calculating a semantic distance between topics (via Jensen Shannon Divergence)

From this part, you can view all our trials with different parameters and different NLP applications. You can check the change log below to better understand the difference. Top graphic is the latest while bottom one is our first trial.

### Change Log

The numbers after dictionary, corpus and lda file names corresponds to change log number.

dictionary = "TRTWORLD_Followers_5-7(10).dict"
corpus = "TRTWORLD_Followers_5-7(10).mm"
lda = "TRTWORLD_Followers_5-7_25-20-0001.lda"

1:	Faulty Train. (40 Topics)
2:	First correct train with full data. (40 Topics)
3:	Remove words that appears at most 1 time in the whole corpus. (40 Topics)
4:	Remove words that appears at most 2 times in the whole corpus. (40 Topics)
5:	Remove words that appears at most 1 time in the whole corpus. (20 Topics)
6: 	Added process deleting accounts whose length of tokenized tweets are less than 200. (30 Topics)
7: 	Added Snowball Stemmer (Porter2). (30 Topics)
8:	(20 Topics)
9:	Remove words that appears at most 2 times in the whole corpus. (30 Topics)
10:	Remove words that appears at most 3 times in the whole corpus. (25 Topics)
11: (15 Topics)
12: (20 Topics)

In [54]:
# Save Data
dictionary.save('Burak(2).dict')
corpora.MmCorpus.serialize('Burak(2).mm', corpus)
lda.save("Burak(2).lda")

In [55]:
# Loaded Data
# dictionary.save('Burak(2).dict')
# corpora.MmCorpus.serialize('Burak(2).mm', corpus)
# lda.save("Burak(2).lda")
# lda_params = {'num_topics': 10, 'passes': 20}

followers_data =  pyLDAvis.gensim.prepare(lda,corpus, dictionary)
pyLDAvis.display(followers_data)

In [22]:
# Loaded Data
# dictionary.save('Burak(1).dict')
# corpora.MmCorpus.serialize('Burak(1).mm', corpus)
# lda.save("Burak(1).lda")
# lda_params = {'num_topics': 7, 'passes': 20}

followers_data =  pyLDAvis.gensim.prepare(lda,corpus, dictionary)
pyLDAvis.display(followers_data)

## Alternative Word2Vec Things (optional)

In [154]:
wordModel = models.Word2Vec(tokenized_tweets, size=30, window=5, min_count=3, workers=4)

print(wordModel)

Word2Vec(vocab=28932, size=30, alpha=0.025)


In [155]:
#print(len(wordModel.wv.index2word))
vocab = wordModel.wv.index2word
wordvectors = wordModel.wv[vocab]

In [156]:
kmeansList = np.asarray(wordvectors)

kmeans = KMeans(n_clusters=500).fit(kmeansList)

In [157]:
clusters = {}
labels = {}
centers = []
inVocab = {}

for i in range(0,500):
    clusters[i] = []

for i, label in enumerate(kmeans.labels_):
    clusters[label].append(vocab[i])
    labels[vocab[i]] = label
    
for c in kmeans.cluster_centers_:
    centers.append(wordModel.similar_by_vector(c)[0][0])
    
for v in vocab:
    inVocab[v] = 1

In [158]:
# Change words in tweets with their cluster center words
tweets2 = [[centers[labels[r]] for r in row if r in inVocab]
          for row in tokenized_tweets]

In [159]:
# Build a dictionary where for each document each word has its own id
dictionaryVW = corpora.Dictionary(tweets2)
dictionaryVW.compactify()

print(len(dictionaryVW))

# Build the corpus: vectors with occurence of each word for each document
# convert tokenized documents to vectors
corpusVW = [dictionaryVW.doc2bow(tweets) for tweets in tweets2]

print(len(corpusVW))

print(dictionaryVW)

495
820
Dictionary(495 unique tokens: ['daughter', 'das', 'pretenti', 'uncov', 'facil']...)


In [160]:
# Normalize word counts by dividing it to the number of elements in its cluster
corpusVW2 = [[(r[0], int(math.ceil(r[1]/ len(clusters[labels[dictionaryVW[r[0]]]]))) ) for r in row]
          for row in corpusVW]

In [161]:
tweetListVW = []

for c in corpusVW2:
    str = ''
    for tokens in c:
        str = str + ((dictionaryVW[tokens[0]]+' ') * tokens[1])
    tweetListVW.append(str)

#print(tweetListVW[15])
print(len(tweetListVW))
# tweetListVW = [' '.join(tweets) for tweets in tokenized_tweets]

820


In [18]:
#lda_params = {'num_topics': 10, 'passes': 20, 'alpha': 0.001}
lda_params = {'num_topics': 10, 'passes': 20}


print("Running LDA with: %s  " % lda_params)
#lda = models.LdaModel(corpusVW2, id2word=dictionaryVW,
#                        num_topics=lda_params['num_topics'],
#                        passes=lda_params['passes'],
#                        alpha = lda_params['alpha'])

lda = models.LdaModel(corpusVW2, id2word=dictionaryVW,
                        num_topics=lda_params['num_topics'],
                        passes=lda_params['passes'])
print()
lda.print_topics()

Running LDA with: {'passes': 20, 'num_topics': 10}  



[(0,
  '0.171*"analyt" + 0.035*"learn" + 0.028*"social" + 0.021*"scientist" + 0.015*"next" + 0.015*"driven" + 0.015*"custom" + 0.015*"media" + 0.014*"technolog" + 0.013*"real"'),
 (1,
  '0.220*"analyt" + 0.046*"intellig" + 0.044*"driven" + 0.039*"learn" + 0.025*"cloud" + 0.025*"daili" + 0.024*"artifici" + 0.019*"technolog" + 0.016*"biz" + 0.015*"custom"'),
 (2,
  '0.053*"learn" + 0.049*"open" + 0.045*"analyt" + 0.037*"spark" + 0.024*"next" + 0.021*"last" + 0.020*"sourc" + 0.020*"payrol" + 0.017*"languag" + 0.015*"anaconda"'),
 (3,
  '0.053*"analyt" + 0.028*"code" + 0.025*"exploratori" + 0.024*"open" + 0.021*"done" + 0.018*"learn" + 0.017*"scientist" + 0.014*"look" + 0.014*"paper" + 0.012*"sourc"'),
 (4,
  '0.306*"drone" + 0.037*"vision" + 0.029*"flight" + 0.022*"comput" + 0.020*"dtk12chat" + 0.015*"daili" + 0.015*"unman" + 0.013*"coverag" + 0.012*"technolog" + 0.011*"inspire1"'),
 (5,
  '0.137*"robot" + 0.018*"engin" + 0.017*"3dprint" + 0.016*"learn" + 0.016*"technolog" + 0.016*"design

In [19]:
# Save Data
dictionary.save('Burak(3).dict')
corpora.MmCorpus.serialize('Burak(3).mm', corpus2)
lda.save("Burak(3).lda")

In [20]:
# Corpus2
# Normalized version

followers_data =  pyLDAvis.gensim.prepare(lda,corpusVW2, dictionaryVW)
pyLDAvis.display(followers_data)

In [132]:
# Normal Corpus
# Not normalized version

followers_data =  pyLDAvis.gensim.prepare(lda,corpusVW, dictionaryVW)
pyLDAvis.display(followers_data)

## sklearn NMF - LDA

In [150]:
n_samples = len(tweetList)
n_features = len(dictionary)
n_topics = 10
n_top_words = 20

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [151]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features)
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(tweetList)
print("done in %0.3fs." % (time() - t0))

# Fit the NMF model
print("Fitting the NMF model with tf-idf features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

#http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb#topic=0&lambda=1&term=
nmf_vis_data = pyLDAvis.sklearn.prepare(nmf, tfidf, tfidf_vectorizer)
pyLDAvis.display(nmf_vis_data)

Extracting tf-idf features for NMF...
done in 8.397s.
Fitting the NMF model with tf-idf features, n_samples=820 and n_features=25035...
done in 2.937s.

Topics in NMF model:
Topic #0:
trump sure yeah thing someth though actual pretti never write book pleas alway hope mayb might hard bad idea research
Topic #1:
analyt cloud market intellig machin 2017 predict artifici custom secur digit manag enterpris technolog strategi valu innov industri key transform
Topic #2:
deep neural machin paper intellig research artifici generat model convolut reinforc comput imag recurr nips2016 network recognit infer workshop algorithm
Topic #3:
robot industri autonom pepper 3dprint human booth pour humanoid kit artifici intellig weld gripper ces2016 technolog prosthet collabor abb industrie40
Topic #4:
python notebook tutori statist analysi anaconda machin scipy2015 introduct visual scientist regress pip git spark sourc instal analyt cluster guid
Topic #5:
stem educ school classroom teach steam maker compu

In [152]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features)
t0 = time()
tf = tf_vectorizer.fit_transform(tweetList)
print("done in %0.3fs." % (time() - t0))

print("Fitting LDA models with tf features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=20, learning_method='online', learning_offset=50., random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

lda_vis_data = pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)
pyLDAvis.display(lda_vis_data)

Extracting tf features for LDA...
done in 8.991s.
Fitting LDA models with tf features, n_samples=820 and n_features=25035...
done in 79.714s.

Topics in LDA model:
Topic #0:
entrepreneur techinclusion15 b2bmarket publish talent compani profil industri unlimit hire aas229 scientist valuabl small ia2015 market empower employ astronomi never
Topic #1:
stem educ school teach maker class inspir scratch comput workshop program classroom pleas student creativ summer communiti regist robot club
Topic #2:
analyt market cloud intellig 2017 secur digit technolog innov industri daili social manag custom valu strategi mobil artifici key 2016
Topic #3:
print 3dprint robot raspberri kit maker pleas appl home wait ship weekend board sorri littl old order car light custom
Topic #4:
sure paper trump thing actual someth pretti write though research never yeah alway mayb hard book might point bad idea
Topic #5:
robot control electron system bot shield arm industri kit sensor board vision motion technolog 

### sklearn NMF-LDA with word2vec

In [162]:
n_samples = len(tweetListVW)
n_features = len(dictionaryVW)
n_topics = 10
n_top_words = 20

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [163]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features)
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(tweetListVW)
print("done in %0.3fs." % (time() - t0))

# Fit the NMF model
print("Fitting the NMF model with tf-idf features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

#http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb#topic=0&lambda=1&term=
nmf_vis_data = pyLDAvis.sklearn.prepare(nmf, tfidf, tfidf_vectorizer)
pyLDAvis.display(nmf_vis_data)

Extracting tf-idf features for NMF...
done in 1.462s.
Fitting the NMF model with tf-idf features, n_samples=820 and n_features=495...
done in 0.360s.

Topics in NMF model:
Topic #0:
deep generat comput machin languag convolut text tutori recognit variat classif sourc gradient word detect physic self review sentiment nips2016
Topic #1:
old night hous ago offic sourc vote music away white review word media earli money deserv servic stay comput area
Topic #2:
robot shield booth 3dprint bot camera servo iste2016 driverless tutori sourc drive virtual comput stand assist wireless sale solar educ
Topic #3:
custom intellig servic sap drive retail booth survey media regist artifici self leadership chief analyst reach guid financi optim offic
Topic #4:
python tutori machin sourc regress scipy2015 languag text instal cheat guid classif git markdown seri math sheet docker stack linear
Topic #5:
apach sourc impala docker booth sap machin regist servic stack earli lake area graph python announc seri

In [164]:
# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features)
t0 = time()
tf = tf_vectorizer.fit_transform(tweetListVW)
print("done in %0.3fs." % (time() - t0))

print("Fitting LDA models with tf features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features))
lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=20, learning_method='online', learning_offset=50., random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

lda_vis_data = pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)
pyLDAvis.display(lda_vis_data)

Extracting tf features for LDA...
done in 1.249s.
Fitting LDA models with tf features, n_samples=820 and n_features=495...
done in 16.011s.

Topics in LDA model:
Topic #0:
apach sourc machin booth shield impala tutori hoy splice python docker regist area announc earli east tabl instal offic seri
Topic #1:
school educ math classroom iste2016 summer lab regist earli comput night gift graduat award booth child chi anniversari hous music
Topic #2:
intellig artifici payrol machin realiti virtual fiction self augment servic rift deep drive assist disrupt leadership sap cancer biolog financi
Topic #3:
machin python deep languag tutori comput generat intellig sourc convolut text artifici regress recognit optim detect seri classif apach money
Topic #4:
comput tableau club classroom school math virtual vote feedback tune languag reach guid tabl self educ recognit seri inspire16 custom
Topic #5:
old night ago hous away music white offic review vote sourc comput word media drive deserv earli self 