<img style="float: right; margin-right: 30px;" src="https://cscw.acm.org/2016/images/logo@257px.png">

The following is a brief example of using [topic modeling](https://en.wikipedia.org/wiki/Topic_model) within a [Jupyter](http://jupyter.org) notebook. Jupyter is a web based programming and publishing environment that works with over 40 different programming languages.

Since Jonathan and Karen went to [CSCW 2016](https://cscw.acm.org/2016/) a few weeks ago I thought it might be fun to try to use topic modeling to try to characterize the papers that were submitted there.

I downloaded the PDFs for all 142 papers and converted them to text. Since the formatting was fairly structured I was also able to extract the abstracts from the text. You can see both the paper text and the abstracts in the [data](data) directory.

I then wrote a bit of helper code using Python's [Gensim](https://radimrehurek.com/gensim/) topic modeling library to (hopefully) illustrate a little bit of how topic modeling works. The first thing we need to do is to import some of this helper code.

In [280]:
from imp import reload
reload(topicmodel)

from topicmodel import papers, abstracts, topics

The `papers` and `abstracts` functions are Python generators that return each word from each of the CSCW papers and abstracts. So for example we can see the words in the first abstract by calling the `abstracts` function and calling `next` on the result:

In [253]:
next(abstracts())

['As',
 'the',
 'world',
 'becomes',
 'more',
 'digitized',
 'and',
 'interconnected',
 'i',
 'ormation',
 'that',
 'was',
 'once',
 'considered',
 'to',
 'be',
 'private',
 'such',
 'as',
 'one',
 's',
 'health',
 'status',
 'is',
 'now',
 'being',
 'shared',
 'publicly',
 'To',
 'understand',
 'this',
 'new',
 'phenomenon',
 'better',
 'it',
 'is',
 'crucial',
 'to',
 'study',
 'what',
 'types',
 'of',
 'health',
 'information',
 'are',
 'being',
 'shared',
 'on',
 'social',
 'media',
 'and',
 'why',
 'as',
 'well',
 'as',
 'by',
 'whom',
 'In',
 'this',
 'paper',
 'we',
 'study',
 'the',
 'traits',
 'of',
 'users',
 'who',
 'share',
 'their',
 'personal',
 'health',
 'and',
 'ﬁtness',
 'related',
 'inform',
 'ion',
 'on',
 'social',
 'media',
 'by',
 'analyzing',
 'ﬁtness',
 'status',
 'updates',
 'that',
 'MyFitnessPal',
 'users',
 'have',
 'shared',
 'via',
 'Twitter',
 'We',
 'investigate',
 'how',
 'certain',
 'features',
 'like',
 'user',
 'proﬁle',
 'ﬁtness',
 'activity',
 'an

As you can see the text isn't pefect, but it's largely good enough for these purposes. We can then generate a topic model using the `topics` function. `topics` will create a [Latent Dirichlet Allocation](https://www.quora.com/What-is-a-good-explanation-of-Latent-Dirichlet-Allocation) model and then use the [Umass Topic Coherence](http://ciir-publications.cs.umass.edu/getpdf.php?id=956) algorithm to list the primary topics. I'll be the first to admit that I have little to no idea what that means. It's what the Gensim documentation told me. Perhaps Philip Resnick will have gone over some of this terminology beforehand.

In [281]:
topics(abstracts)

1. that, with, this, data, from
2. that, with, social, their, design
3. that, with, their, social, work
4. that, social, with, work, media
5. that, with, their, online, content


The results are not very interesting because we mostly just see the most commonly ocurring words in most English text. However we can create a list (or really a Python set) of words to ignore when doing the modeling.

In [260]:
words = set(["the", "of", "to", "a"])

And then we can call the `topics` function again, but this time passing in our list of words to ignore, which are usuall called *stop words*:

In [282]:
topics(abstracts, ignore=words)

1. social, that, with, their, study
2. that, with, their, online, this
3. that, with, from, work, this
4. that, with, data, work, this
5. that, with, more, their, people


But as you can see there are a lot more words that would be good to ignore. Luckily other people have run into this issue before and compiled lists of these extremely common English words, and I've included them here so we can import them.

In [263]:
from stopwords import stopwords

Here's how many words are in the list:

In [265]:
len(stopwords)

319

So now we can try running `topics` again with this longer list of words to ignore:

In [283]:
topics(abstracts, ignore=stopwords)

1. data, people, social, study, present
2. social, media, online, work, study
3. social, users, study, data, design
4. privacy, social, data, online, community
5. technology, work, social, design, parents


Now finally things are getting a little bit more interesting! 

One important thing to note about LDA is that it is a *generative model* which uses *randomness* as part of the algorithm. So if we run `topics` again with the exact same options it will generate different results. How does this impact the way you might use topic modeling?

In [284]:
topics(abstracts, ignore=stopwords)

1. social, media, work, online, study
2. data, people, content, paper, users
3. social, paper, learning, study, data
4. work, social, collaborative, online, data
5. study, social, people, users, trust


On inspection it looks like there words that are used a lot in CSCW papers that might be useful to add to our ignore list:

In [298]:
stopwords.add("social")
stopwords.add("work")
stopwords.add("study")
stopwords.add("paper")
stopwords.add("data")
stopwords.add("online")
stopwords.add("design")
stopwords.add("technology")
stopwords.add("users")
stopwords.add("media")
stopwords.add("people")
stopwords.add("results")
stopwords.add("content")
stopwords.add("information")

In [302]:
topics(abstracts, ignore=stopwords)

1. support, MOOCs, understanding, collaborative, access
2. strategies, children, provide, present, trust
3. network, networks, community, user, practice
4. support, task, present, communities, learning
5. privacy, different, sharing, research, communication


At this point it might be useful to try to assign labels to some of the topic groups:

1. **learning**: support, MOOCs, understanding, collaborative, access
2. **privacy**: strategies, children, provide, present, trust
3. **communities**: network, networks, community, user, practice
4. **task modeling**: support, task, present, communities, learning
5. **publishing**: privacy, different, sharing, research, communication

One very important thing to know about LDA topic modeling is that it is a generative statistical technique: it uses *randomness* as part of the algorithm. So if we run our `topics` helper function again with the exact same options we will get different results:

In [303]:
topics(abstracts, ignore=stopwords)

1. mobile, support, present, practices, using
2. trust, food, network, based, monitoring
3. friends, sharing, group, experience, collaborative
4. support, children, life, family, community
5. strategies, privacy, research, communication, different


How does this randomness impact how you can use topic modeling as a tool in different problem domains?

The `topics` helper function has some additional knobs you can turn to change the output. For example you can change the number of topics you would like to see:

In [301]:
topics(abstracts, ignore=stopwords, num_topics=10)

1. children, parents, family, present, time
2. privacy, decay, surgeon, video, shared
3. food, approach, task, workers, privacy
4. sharing, trust, inﬂuence, user, systems
5. provide, feedback, communities, expert, support
6. support, research, MOOCs, practices, friends
7. mobile, trust, network, task, interaction
8. group, communities, newcomers, interactions, face
9. collaborative, students, time, community, porters
10. self, communication, privacy, SAHDs, consent


And you can change the number of words in each topic:

In [306]:
topics(abstracts, ignore=stopwords, num_words=3)

1. privacy, interaction, mobile
2. trust, collaborative, present
3. support, communities, provide
4. children, strategies, privacy
5. face, collaborative, practice


Remember the `papers` generator we imported at the beginning? Well that contains all the text of the paper. Here's the text of the first paper:

In [311]:
next(papers())

['CSCW',
 'FEBRUARY',
 'MARCH2',
 '2016',
 'FRANCISCO',
 'Persistent',
 'Sharing',
 'Fitness',
 'Status',
 'Twitter',
 'Kunwoo',
 'Park',
 'KAIST',
 'South',
 'Korea',
 'park',
 'kaist',
 'Meeyoung',
 'KAIST',
 'South',
 'Korea',
 'meeyoungcha',
 'kaist',
 'Ingmar',
 'Weber',
 'QCRI',
 'Qatar',
 'iweber',
 'Chul',
 'MyFitnessPal',
 'United',
 'States',
 'clee',
 'myﬁtnesspal',
 'ABSTRACT',
 'world',
 'becomes',
 'more',
 'digitized',
 'interconnected',
 'formation',
 'that',
 'once',
 'considered',
 'private',
 'such',
 'health',
 'status',
 'being',
 'shared',
 'publicly',
 'understand',
 'this',
 'phenomenon',
 'better',
 'crucial',
 'study',
 'what',
 'types',
 'health',
 'information',
 'being',
 'shared',
 'social',
 'media',
 'well',
 'whom',
 'this',
 'paper',
 'study',
 'traits',
 'users',
 'share',
 'their',
 'personal',
 'health',
 'ﬁtness',
 'related',
 'informa',
 'tion',
 'social',
 'media',
 'analyzing',
 'ﬁtness',
 'status',
 'updates',
 'that',
 'MyFitnessPal',
 'users'

Let's run the papers through the LDA topic model:

In [307]:
topics(papers, ignore=stopwords)

1. trust, community, present, children, research
2. MOOCs, support, collaborative, humanitarian, remote
3. mobile, sharing, support, interaction, audience
4. workers, present, task, different, food
5. privacy, systems, time, networks, friends


Does looking at the fulltext of the paper change the modeling at all? Try playing around with the code if you want by adding stopwords, or changing the number of topics or words returned. 

That's all folks!