data-science topic modeling - Using data science for explaining what is data science..

note: for viewing and playing with the results, click here

Data collection

This project is based on "Data Science Stack Exchange" - website which dedicated to questions and answers about data science.

And "Cross Validated" which is more focused on statistics.

To extract the tags from all the posts there I ran the following query in stack exchange's Data Explorer:

SELECT Tags 
FROM Posts
WHERE Tags IS NOT NULL

The query result looks like this:

<machine-learning><neural-network><deep-learning>
<statistics><time-series>
<machine-learning>
<python><keras><convnet><audio-recognition>
<statistics><unbalanced-classes>

Where each row represents a post.

extract transform load

Convert the data into list of lists:

(we use 2 data sources: "Data Science Stack Exchange" and "Cross Validated")

lst = []
reader = csv.reader(open('QueryResults.csv'))
for line in reader:
    lst.append(unicode(line)[3:-3].split('><'))
reader2 = csv.reader(open('QueryResults2.csv'))
for line in reader2:
    lst.append(unicode(line)[3:-3].split('><'))

After we converted the data into list of lists, we used gensim to format the data :

dictionary = gensim.corpora.Dictionary(lst)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in lst]
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=8)

The most important parameter here is the num_topics which determine for how many topics we want to divide the model - too many topics will result in very narrow topics but too few may lead to ambiguous topics..

visualization

For visualization we used pyLDAvis

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Capture.png		Capture.png
QueryResults.csv		QueryResults.csv
QueryResults2.csv		QueryResults2.csv
README.md		README.md
Untitled.ipynb		Untitled.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-science topic modeling - Using data science for explaining what is data science..

Data collection

extract transform load

visualization

About

Releases

Packages

Languages

eeddaann/data-science-topic-modeling

Folders and files

Latest commit

History

Repository files navigation

data-science topic modeling - Using data science for explaining what is data science..

Data collection

extract transform load

visualization

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages