# Topic Modeling demo (English)

In [1]:
import numpy as np
import numpy.matlib
import pandas as pd
import lda

## Loading data
The data we're using for this demo comes from the official Facebook page for the UN Global Goals. Most posts are in English, and they will address a variety of different topics. Some posts come from the Global Goals account, while the majority are contributed by visitors to their page. The first five entries are shown below.

In [11]:
xls = pd.read_excel('Global Goals.xls')
pd.set_option('display.max_colwidth', 1000)
xls[['Author','Contents']].head(5)

Unnamed: 0,Author,Contents
0,Global Goals for Sustainable Development,"Global Goals for Sustainable Development\n\nhttps://www.facebook.com/globalgoalsUN/videos/10153715606741026/\n\nhttps://video.xx.fbcdn.net/hvideo-xfa1/v/t42.1790-2/12978276_1587737764851663_1662405109_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InN2ZV9zZCJ9&rl=406&vabr=226&oh=5235205ecf260dcf96e6948066657b0a&oe=570827AC\n\nWorld Health Day 2016: Halt the rise, take the steps needed to...\n\nTomorrow, 7 April, is World Health Day, and this year the United Nations is bringing attention to #diabetes. \n\nNewly released figures from the World Health Organization (WHO) show that the prevalence of diabetes has grown steadily – nearly quadrupling from 108 million to 422 million adults since 1980. That is 1 in 11 adults around the world. This WHO video illustrates just how important Goal 3 of the #GlobalGoals is. \n\nMore information: http://who.int/campaigns/world-health-day/2016/en/"
1,Suresh Kumar,"Global Goal No.3 Please discuss and support for Global Goals-------Why not ATM for Medicine like Bank ATM ??? Save Environment/Reduce Drug Addiction& Reduce Price of Medicine by 50%•-- lnkd.in/bB9bbsf \nFor the benefit of Billions of people living in Developing and Under-developed countries mostly in Asia and Africa. {28215 views till 05.11.15}\nTweet ------@sureshkito \nThanks and Regards \n( Favored /Retweeted at @sureshkito by :- UN DESA DPAD ,Global Dev Lab of USAID,Musimbi Kanyoro --President and CEO Global fund woman, Jyrki Katainen --Vice President EU Commission, USAID Policy, DIV at USAID , UNDP, SDG2030, Lenni Montiel UNDESA --UN Asstt Secy General,Healthmanagement.org ,Healthcare.gov of USA,UN Social 500, Melissa c lott, Stockade Magazine, Sustainability news, UNICEF Innocenti , World we want 2015, UNDP Asia-Pacific ,UNEP_EU, SEED Awards , World Resource Institute , UN_Expo2015, Zayed energy prize , Irwin Kula , Ulrich J v Vuuren, Alex Dehgan-X , Linda scott @Prof..."
2,Barbara Schneider,"URGENT NEED ( because I have not received a reply from the UNO, although I am sending reminders for many years) , I am sure that there is still the need FOR ""better communication"", meaning all complaints send to the UNO , incl. Human Rights Council should get immediately a registration number and the applicant/ complainant should receive immediately a registration number for further communication. URGENT.......PLEASE SUPPORT THIS MATTER! In addition Human Rights Complaints in mother tongue should possible too........ !"
3,Devesh Kumar,"Good people .. .. .. \nIt's our lives "" Thanks """
4,Suresh Kumar,"Global Goal No.13 Stop polluting cities by big cars, GPRS Capsule electric trolleys is alternative. https://www.facebook.com/Global-Goal-No13-A-solution-for-City-Pollution-The-Smart-City-147941658897870/"


## Preparing for topic modeling
Topic modeling algorithms often struggle to identify the topic of very short pieces of text. We'll get around this by only paying attention to posts with more than 250 characters -- there are 1,120 such posts. Those long posts are then passed to an algorithm called a *vectorizer* that turns the set of posts into a matrix of numbers, because computers generally prefer to work with numbers. This is done by identifying the *vocabulary* of all words that appear at least once (10,275 of them) and counting the number of times that each word appears in each post. This means we have a 1120x10275 array of numbers (known as a *document-term matrix* or DTM), for 1,120 documents and 10,275 words.

In [36]:
def getlen(x):
    try:
        return(len(x))
    except:
        return(0)
xls['strlen'] = xls['Contents'].apply(getlen)
long_posts = xls[xls.strlen > 250].reset_index(drop=True)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words='english') 
vect.fit(long_posts.Contents)
long_dtm = vect.transform(long_posts.Contents)
long_dtm.shape

(1120, 10275)

## Fitting the topic model
The algorithm we're using is called LDA (Latent Dirichlet Allocation). The basic idea is that we give it a set of documents (or more precisely a DTM), and ask it to identify a specific number of topics (20 in this case). Each document is given a *probability* of belonging to each topic. For each topic, every word has a *weight* -- the words with the highest weights are the ones that are most important to the topic.

Part of the model-fitting algorithm requires a random-number generator -- this means that the results will be slightly different each time. We're going to fit the model twice, because we suspect that topics that are consistent between the two versions of the fitted model will be more reliable.

In [13]:
myLDA1 = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
myLDA1.fit(long_dtm)
myLDA2 = lda.LDA(n_topics=20, n_iter=1500, random_state=2)
myLDA2.fit(long_dtm)

<lda.lda.LDA at 0x227dbf5d7f0>

In [28]:
tw1 = myLDA1.topic_word_
tw2 = myLDA2.topic_word_
ldamat1 = myLDA1.transform(long_dtm)
ldamat2 = myLDA2.transform(long_dtm)
vocab = vect.get_feature_names()

## Summarizing topics
We can quickly get a sense of what some of the major topics on the Global Goals Facebook page are by looking at the 8 most heavily-weighted terms in each topic. Here are the results for the first version of the fitted model:

In [38]:
n_top_words = 8

def top_words(topic_dist,n_top_words=10):
    return(np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1])

for i, topic_dist in enumerate(tw1):
    topic_words = top_words(topic_dist,8)
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: la est ce le et information sa une
Topic 1: https video goals sustainable com www videos net
Topic 2: development health hunger organization awareness food poor information
Topic 3: world change climate global help forests wildlife poverty
Topic 4: love vietnam hope police read family choice leaders
Topic 5: atm reduce medicine global save environment 50 price
Topic 6: goals development global world https facebook united globalgoals
Topic 7: state international house family police criminal murder leaders
Topic 8: water access development goal countries people sanitation weather
Topic 9: world carbon new economic 000 years report growth
Topic 10: photos sustainable development global www goals type facebook
Topic 11: از ای که می در را است نه
Topic 12: com https www facebook 2015 real page vera
Topic 13: president undp global world usaid india policy energy
Topic 14: rights like want don women child people children
Topic 15: development sustainable climate http united nations gl

It appears that topic 0 consists of posts in French, and topic 11 of posts in Arabic. Topic 2 might be focused on food security, while topic 3 seems to be ablout climate change. To get a sense of how much we can trust the topics, we can look at the same output for the other version of the model and see which topics look similar.

In [39]:
for i, topic_dist in enumerate(tw2):
    topic_words = top_words(topic_dist,8)
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: fuel city cars cities goal global solution pollution
Topic 1: photos www sustainable goals development global facebook type
Topic 2: la est le information et en une réalité
Topic 3: com www https facebook real angrybirdshappyplanet angry birds
Topic 4: water people children need work sanitation forests day
Topic 5: love hope read choice level 1st valid important
Topic 6: sustainable development climate global goals http com https
Topic 7: world vera carlos 2015 help global https achieve
Topic 8: https video net videos goals oh global xx
Topic 9: development organization people rural support poor awareness health
Topic 10: president global undp world usaid india policy energy
Topic 11: people rights want women like don children family
Topic 12: development economic sustainable countries new world http social
Topic 13: women world girls gender youth young hunger poverty
Topic 14: atm reduce medicine asia 50 environment bank price
Topic 15: از ای که می را در زمین است
Topic 16: ed

This time, the French posts are in topic 2 and the Arabic ones are in topic 15. Some of the topics look similar to what we saw before, while others are different. Keep in mind that the most consistent topics aren't necessarily the largest or most important ones, just the ones that can be identified most reliably because people use consistent words and phrases in those topics.

## Finding matching topics
We know that only some topics will overlap between the two sets, and that the similar topics may be in a different order. We can easily measure the similarity between two topics by listing the top 100 words in each topic, and counting how many of those top words are in common between the two lists. If two topics have more than 50% of their top 100 words in common, we'll consider them to be a match.

In [40]:
def overlap(td1,td2,n=10):
    w1 = top_words(td1,n)
    w2 = top_words(td2,n)
    return(len(set(w1) & set(w2))/n)

match_pairs = []
n = tw1.shape[0]
for i in range(n):
    for j in range(n):
        o = overlap(tw1[i,:],tw2[j,:],100)
        if o > 0.5:
            match_pairs.append((i,j))
            
print(match_pairs)

[(0, 2), (1, 8), (2, 9), (10, 1), (11, 15), (12, 3), (13, 10), (15, 6), (17, 0), (18, 4)]


Out of our two sets of 20 topics, we have 10 matches. The numbers above mean that, for example, topic \#0 from the first model matches with \#2 from the second model, and topic \#1 from the first model matches wich \#8 from the second.
## Example posts
Each post has a probability -- a number between 0 and 1 -- measuring how likely it is to fit in with any topic. To better understand what is in a topic, we can find a post with a very high probability for that topic. The example below is for topic \#15 in the first model, where the first three words were "development", "sustainable", and "climate."

In [41]:
# Show key words and example post for a given match pair
def topic_example(i,n=500):
    print(long_posts.Contents[np.argmax(ldamat1[:,i])][:n])

i = 15
topic_example(i)

Global Goals for Sustainable Development

un.org

http://bit.ly/1lKA2Qb

Ban: What I expect from the Paris Climate Conference

Climate change carries no passport. ... Only through the United Nations can we respond collectively to this quintessentially global issue.

New op-ed from United Nations Secretary-General Ban Ki-moon: "What I expect from the UN Climate Change Conference in Paris." 

Read the full piece: http://bit.ly/1lKA2Qb #COP21


Now, on to the Arabic demo!