This notebook runs two topic modelling algorithms to model topics appearing in Data Science Stack Exchange. </br>
The first algorithm is LDA with 15 topics, the second algorithm is NMF with 15 topics.

# Index
- Read Data Science Stack Exchange database
- Run LDA and show most probable words in each of the 15 topics
- Run NMF and show most probable words in each of the 15 topics

# Warning!
<b> Some of the files which are necessary to run this notebook are not contained on this repository! </b>

In [None]:
#%matplotlib inline

import pylab

import xml.etree.ElementTree as ET
import json
import pandas as pd
import re
import numpy as np
import itertools

# visualization
import matplotlib
from matplotlib import rc
import matplotlib.pyplot as plt
import seaborn as sns

#os stuff
import os
from os import listdir
from os.path import isfile, join

plt.style.use('seaborn-white')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['legend.fontsize'] = 'medium'
matplotlib.rcParams['figure.titlesize'] = 'small'


pd.set_option('display.max_colwidth', -1)

pylab.ion()

### Some functions to clean the data

In [2]:
def re_link(s):
    return re.sub(r'<http\S+>', ' <link> ', s)

def re_emoji(s):
    return re.sub(r':\S+:', ' <emoji> ', s)

def re_html(s):
    return re.sub(r'<\S+>', ' ', s)

def re_html_plus(s):
    return re.sub(r'<[^>]+>', ' ', s)

def re_html_plus(s):
    return re.sub(r'<.+?>', ' ', s)

# Read DS SE

In [3]:
df_SE=pd.read_json('DSSE_Posts.json')
data_SE=df_SE.copy() #contains posts titles, bodies, answers and dates, BUT NO TAGS

df_SE_tags=pd.read_json('DSSE_History.json')
data_SE_tags=df_SE_tags.copy() #contains posts titles, dates AND TAGS

In [4]:
#Add tags to df_SE, matching on the date the post is first published
df_tags = pd.merge(data_SE_tags[data_SE_tags.time.isnull()], data_SE_tags[-data_SE_tags.time.isnull()], on=['ID','ID'])
df_tags = df_tags.drop('time_x', 1)
df_tags.rename(columns={'text_x': 'tags', 'text_y': 'text','time_y': 'time'}, inplace=True)
times_to_tags = dict(zip(df_tags['time'].values,df_tags['tags'].values))
data_SE['tags'] = data_SE.time.map(times_to_tags)
data_SE = data_SE.drop('time', 1)

In [5]:
data_SE.head(1)

Unnamed: 0,ID,text,type,tags
0,5,How can I do simple machine learning without hard-coding behavior?,title,<machine-learning>


In [6]:
#For processing I can join the text of posts with the same index
df_ID_tags = data_SE[data_SE.tags.notnull()][['ID','tags']]

In [7]:
data_SE['text'] = data_SE['text'].apply(lambda x : ' '+ x )
s_sum_text = data_SE[['ID','text']].groupby(['ID']).text.sum()
df_sum_text = pd.DataFrame({'ID':s_sum_text.index, 'text':s_sum_text.values})

In [8]:
df_sum_text.head(1)

Unnamed: 0,ID,text
0,5,"How can I do simple machine learning without hard-coding behavior? <p>I've always been interested in machine learning, but I can't figure out one thing about starting out with a simple ""Hello World"" example - how can I avoid hard-coding behavior?</p>\n\n<p>For example, if I wanted to ""teach"" a bot how to avoid randomly placed obstacles, I couldn't just use relative motion, because the obstacles move around, but I don't want to hard code, say, distance, because that ruins the whole point of machine learning.</p>\n\n<p>Obviously, randomly generating code would be impractical, so how could I do this?</p>\n <p>Not sure if this fits the scope of this SE, but here's a stab at an answer anyway.</p>\n\n<p>With all AI approaches you have to decide what it is you're modelling and what kind of uncertainty there is. Once you pick a framework that allows modelling of your situation, you then see which elements are ""fixed"" and which are flexible. For example, the model may allow you to define your own network structure (or even learn it) with certain constraints. You have to decide whether this flexibility is sufficient for your purposes. Then within a particular network structure, you can learn parameters given a specific training dataset.</p>\n\n<p>You rarely hard-code behavior in AI/ML solutions. It's all about modelling the underlying situation and accommodating different situations by tweaking elements of the model.</p>\n\n<p>In your example, perhaps you might have the robot learn how to detect obstacles (by analyzing elements in the environment), or you might have it keep track of where the obstacles were and which way they were moving.</p>\n"


In [9]:
df_text_tags = pd.merge(df_ID_tags,df_sum_text, on=['ID','ID'])
df_text_tags['text'] = df_text_tags['text'].apply(lambda x : re_html_plus(x) )
df_text_tags['length'] = df_text_tags['text'].apply(lambda x : len(x) )

Look for the most used tags

In [10]:
tags_val =df_text_tags.tags.values
alltags = [tag for stringtags in tags_val for tag in re.findall('<(.*?)>', stringtags) ]
alltags_table = [re.findall('<(.*?)>', _) for _ in tags_val]
n_tags = [len(_) for _ in alltags_table]

In [11]:
def tag_list(tags):
    re.findall('<(.*?)>', tags)

In [12]:
from collections import Counter

tag_count = Counter(alltags)
dict_tag_count  = dict(tag_count)
tag_count_sorted = sorted(dict_tag_count.items(), key=lambda x: x[1], reverse = True)
sub_tags=dict(tag_count_sorted[:20]).keys()

In [13]:
#The 20 most used tags in DS_SE
sub_tags

[u'clustering',
 u'nlp',
 u'statistics',
 u'feature-selection',
 u'machine-learning',
 u'classification',
 u'keras',
 u'python',
 u'data-mining',
 u'scikit-learn',
 u'bigdata',
 u'dataset',
 u'deep-learning',
 u'r',
 u'neural-network',
 u'text-mining',
 u'time-series',
 u'tensorflow',
 u'predictive-modeling',
 u'regression']

# LDA for Data Science Stack Exchange (gensim)

In [30]:
texts  = df_text_tags.text.values

In [31]:
from sklearn.model_selection import train_test_split
texts_train, texts_test = train_test_split(texts, test_size=0.2, random_state=42)

In [32]:
import string
import gensim
from gensim import corpora, models, similarities
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.parsing.preprocessing import preprocess_string,strip_tags,strip_punctuation, remove_stopwords,strip_numeric
import pyLDAvis.gensim
import nltk

pyLDAvis.enable_notebook()

# spacy for lemmatization
import spacy

In [33]:
# stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['q','w','e','t','y'\
                   ,'u', 'i', 'o', 'p', 'l', 'j', 'h', 'g', 'f', 'd', 's', 'a', \
                   'z', 'x', 'c', 'v', 'b', 'n', 'm'])

# stemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
#stemmer = SnowballStemmer("english")
stemmer = WordNetLemmatizer()

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

In [34]:
def parser(text): #,stem): 
    #tokens = remove_special(text)
    CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation]
    tokens = preprocess_string(text, CUSTOM_FILTERS)
    tokens = re.sub(r'\b\d+(?:\.\d+)?\s+', '', ' '.join(tokens)).split()
    tokens = [x for x in tokens if not x in stop_words]
   
    return tokens

In [35]:
#tokenize corpus and create dictionary
tokenized_texts = [parser(_) for _ in texts]
dict_DSSE = corpora.Dictionary(tokenized_texts)
dict_DSSE.filter_extremes(no_below=10, no_above=0.9)
corpus = [dict_DSSE.doc2bow(_) for _ in tokenized_texts]

In [36]:
#dict_DSSE.save('dict_DSSE_10plus.gensim')
#corpora.Dictionary.load('dict_stemmed_DSSE.gensim') #To reload dictionary
print('Dictionary size: {}').format(len(dict_DSSE))

Dictionary size: 7482


In [37]:
lengths = [len(_) for _ in corpus]

In [38]:
lda = models.LdaModel(corpus, num_topics=15, id2word=dict_DSSE, chunksize=200, passes=10)

In [39]:
# topics visualization
p = pyLDAvis.gensim.prepare(lda, corpus, dict_DSSE, sort_topics=False)
pyLDAvis.save_html(p, 'lda_test.html')

In [40]:
def get_lda_topics(model, num_topics):
    word_dict = {};
    for i in range(num_topics):
        words = model.show_topic(i, topn = 20);
        words = [_[0] for _ in words];
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words
    return pd.DataFrame(word_dict)

Show the 20 most significant words for each topic

In [41]:
get_lda_topics(lda, 15)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10,Topic # 11,Topic # 12,Topic # 13,Topic # 14,Topic # 15
0,data,python,https,model,words,state,features,input,data,class,image,function,learning,gt,file
1,time,r,com,train,text,amp,feature,network,would,classification,images,k,data,np,spark
2,values,using,generator,test,word,learning,model,output,one,classes,cnn,sum,machine,import,files
3,series,code,paper,data,vector,probability,regression,tf,use,label,img,gradient,deep,lt,py
4,value,use,http,training,embedding,action,linear,layer,like,cluster,width,frac,science,self,memory
5,variables,tensorflow,org,set,similarity,distribution,tree,batch,problem,labels,feature,mean,interest,print,data
6,variable,graph,html,accuracy,using,reward,decision,keras,could,classifier,object,value,learn,df,lib
7,categorical,data,google,validation,vectors,beta,models,size,need,clustering,features,partial,algorithms,plt,site
8,one,like,github,dataset,sentence,states,using,neural,different,k,red,matrix,course,array,line
9,column,want,www,loss,document,value,use,activation,want,score,size,cost,company,return,hadoop


# Non-negative matrix factorization

In [43]:
texts  = df_text_tags.text.values

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.preprocessing import normalize
from sklearn.decomposition import NMF

In [45]:
def parser(text): #,stem): 
    #tokens = remove_special(text)
    CUSTOM_FILTERS = [lambda x: x.lower(), strip_tags, strip_punctuation]
    tokens = preprocess_string(text, CUSTOM_FILTERS)
    tokens = re.sub(r'\b\d+(?:\.\d+)?\s+', '', ' '.join(tokens)).split()
    tokens = [x for x in tokens if not x in stop_words]
    #stemming
    #tokens = [stemmer.lemmatize(x) for x in tokens]
    #tokens = [re.sub("\'", "",re.sub('\S*@\S*\s?', '', word.lower())) for word in nltk.word_tokenize(text)]
    #tokens = [x for x in tokens if not x in string.punctuation]
    return tokens

In [46]:
texts_NMF = [' '.join(parser( _ )) for _ in texts]

In [47]:
vect = CountVectorizer(analyzer='word',max_df=0.9, min_df=10,max_features=5000)#, max_features=5000)
vect_model = vect.fit(texts_NMF)
x_counts = vect_model.transform(texts_NMF)

In [48]:
transformer = TfidfTransformer(smooth_idf=False)
tfidf_model = transformer.fit(x_counts)
x_tfidf = tfidf_model.transform(x_counts)
x_tfidf_norm = normalize(x_tfidf, norm='l1', axis=1)

In [49]:
#obtain a NMF model.
num_topics=15
model_NMF = NMF(n_components=num_topics, init='nndsvd');
#fit the model
model_NMF.fit(x_tfidf_norm)

NMF(alpha=0.0, beta_loss='frobenius', init='nndsvd', l1_ratio=0.0,
  max_iter=200, n_components=15, random_state=None, shuffle=False,
  solver='cd', tol=0.0001, verbose=0)

In [50]:
def get_nmf_topics(model, n_top_words):
    
    #the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
    feat_names = vect.get_feature_names()
    
    word_dict = {};
    for i in range(num_topics):
        
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = model.components_[i].argsort()[:-20 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words;
    
    return pd.DataFrame(word_dict);

Show the most significant words for each topic

In [51]:
get_nmf_topics(model_NMF, num_topics)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10,Topic # 11,Topic # 12,Topic # 13,Topic # 14,Topic # 15
0,learning,gt,model,test,regression,df,words,features,time,na,clustering,network,data,partial,image
1,machine,lt,tf,train,tree,pandas,word,feature,series,lt,cluster,neural,science,frac,images
2,python,java,keras,validation,variables,dataframe,text,selection,lstm,missing,distance,layer,set,theta,class
3,spark,label,loss,training,variable,pd,word2vec,pca,day,,clusters,output,orange,amp,cnn
4,deep,string,batch,model,decision,np,document,importance,predict,inferred,similarity,input,analysis,function,classes
5,user,list,shape,accuracy,linear,columns,documents,extraction,rnn,id,matrix,layers,big,sigma,dataset
6,algorithm,xml,input,set,categorical,csv,sentence,correlation,prediction,values,means,weights,like,sum,object
7,use,frame,activation,cross,logistic,id,vectors,categorical,days,error,cosine,networks,file,cost,classification
8,would,row,dense,score,model,column,vector,vector,date,frame,points,hidden,new,error,label
9,https,function,size,class,trees,import,topic,importances,sequence,auto,algorithm,convolutional,would,gradient,detection
