# Topic Modeling

**NOTE** This is pretty unstable in the sense that:

- It is based on the static Qualtrics output.
- The LP survey is still in flux - we don't know what the final qualitative questions will be.
- The "expected themes" have not been decided upon...
- ... Which means that we don't know which qualitative answers are important...
- ... Which means we haven't yet tagged the actual course content to see how these generated topics will map.

** *In this context, the code below is more "proof-of-concept" than anything else.* ** 


However, here's how it generally works:

1. Read in raw data and clean it (remove NaNs, format text)
2. Create "documents" based on qualitative answers. Questions (and their answers) will eventually be grouped together based on similarity (eg, three questions that ask about goals or values). *This is a decision that needs to be made by TLL - once it is made those question names can be hard-coded in.*
3. Perform LDA (for more info, see http://scikit-learn.org/0.18/auto_examples/applications/topics_extraction_with_nmf_lda.html)
4. Use the 3 most salient words from each topic as a "tag" (eg "education-global-policy")
5. Compare the distance of each document to each topic - the topic that a doc is 'closest' defines that doc's tag.
6. Add column of tags back to original data.
7. Export data.

**What feeds the recommendation/adaptive engine?**

This is all based on the assumption that the final tags (eg, "education-global-policy") will be what informs VPAL. However, since we do not know what the final questions will actually be, we can't say for sure what the topics may end up being. 

In [1]:
import numpy as np
import pandas as pd
import re
from simhash import Simhash
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import time

### Helper functions

In [2]:
# clean docs (lowercase, remove punctuation)
def clean_docs(s):
    s = s.lower()
    s = re.sub(r'[^\w]', ' ', s)
    return s

# clean and make docs
def make_docs(cols,df):
    docs = []
    rows = df.shape[0]
    small = df[cols]
    
    for row in range(2,rows):
        temp = [str(i) for i in small.loc[row] if len(str(i)) > 3]
        joined = " ".join(temp)
        cleaned = clean_docs(joined)
        
        docs.append(cleaned)
        
    return docs

# this returns a dictionary of topics/themes, each entry is 
# a list of the 20 most salient words for that topic
def get_topics(model, feature_names, no_top_words = 20):
    topics = {}
    for topic_idx, topic in enumerate(model.components_):
        topics["Topic{}".format(topic_idx)] = " ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]])
    return topics


# Do the actual topic modeling, return the topics
def do_lda(docs, topics = 4, no_features = 1000):
    
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
    tf = tf_vectorizer.fit_transform(docs)
    tf_feature_names = tf_vectorizer.get_feature_names()
    
    lda = LatentDirichletAllocation(n_topics=topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
    lda.fit(tf)
    
    topics = get_topics(lda, tf_feature_names)
    
    return topics

# How far away is each answer from each topic?
def get_distances(topics,docs):
    final_dist = []
    for text in docs:
        temp_dist = []
        for key in topics.keys():
            temp_dist.append(Simhash(topics[key]).distance(Simhash(text)))
        final_dist.append(np.argmin(temp_dist))
    return final_dist

# Based on that distance, what topic does each answer belong to?
def tag_docs(final_d, topics):
    tagged = []
    keys = list(topics.keys())
    for i in final_d:
        tagged.append("-".join(topics[keys[i]].split()[0:3]))

    return tagged

# join the list of tags to the original data
def tag_on_df(original_df,name,tags):
    filler = ["",""] # this is for a Qualtrics annoyance
    final_tags = filler+tags
    new_df = original_df.copy()
    new_df[name] = final_tags
    
    return new_df

        

In [3]:
# do all of the above
def final_analysis(original_df,columns,name):
    start_time = time.time()
    
    docs = make_docs(columns,original_df)
    topics = do_lda(docs)
    distances = get_distances(topics,docs)
    tags = tag_docs(distances,topics)
    final_df = tag_on_df(original_df,name,tags)
    
    elapsed_time = time.time() - start_time
    print("Elapsed time: ", time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))
    
    return final_df

### Example

In [4]:
# EXAMPLE:

data = pd.read_csv("master.csv")
question_cols = ["Q414", "Q240", "Q241"] #these questions have to do with goals and values

new_data = final_analysis(data,question_cols,"GoalsAndValues")
new_data.head()

# new_data.to_csv("output.csv")

Elapsed time:  00:00:05


Unnamed: 0,StartDate,EndDate,Status,IPAddress,Progress,Duration (in seconds),Finished,RecordedDate,ResponseId,RecipientLastName,...,Gender,Email,Program,MailingState,MailingCountry,PermState,PermCountry,HowConfident,Q237_1 - Topics,GoalsAndValues
0,Start Date,End Date,Response Type,IP Address,Progress,Duration (in seconds),Finished,Recorded Date,Response ID,Recipient Last Name,...,Gender,Email,Program,MailingState,MailingCountry,PermState,PermCountry,HowConfident,Q237_1 - Topics,
1,"{""ImportId"":""startDate"",""timeZone"":""America/De...","{""ImportId"":""endDate"",""timeZone"":""America/Denv...","{""ImportId"":""status""}","{""ImportId"":""ipAddress""}","{""ImportId"":""progress""}","{""ImportId"":""duration""}","{""ImportId"":""finished""}","{""ImportId"":""recordedDate"",""timeZone"":""America...","{""ImportId"":""_recordId""}","{""ImportId"":""recipientLastName""}",...,"{""ImportId"":""Gender""}","{""ImportId"":""Email""}","{""ImportId"":""Program""}","{""ImportId"":""MailingState""}","{""ImportId"":""MailingCountry""}","{""ImportId"":""PermState""}","{""ImportId"":""PermCountry""}","{""ImportId"":""HowConfident""}","{""ImportId"":""QID237_1_c7de10f363ca498aa380d2f8...",
2,2017-08-14 20:11:34,2017-08-14 22:04:53,IP Address,108.91.186.96,100,6799,True,2017-08-14 22:04:54,R_31veOCGegzf0DXr,Poikonen,...,Female,,"Mind, Brain & Education",MN,USA,MN,United States,Very confident,Unknown,education-students-teaching
3,2017-08-14 16:00:06,2017-08-15 15:58:22,IP Address,104.61.162.88,100,86295,True,2017-08-15 15:58:24,R_8bNEEDUOMcpxtbH,Woo,...,Female,,"Mind, Brain & Education",CA,USA,CA,United States,Very confident,Unknown,education-school-teaching
4,2017-08-14 12:16:06,2017-08-16 14:19:25,IP Address,65.112.8.131,100,180199,True,2017-08-16 14:19:27,R_0MqvbG3q3AbYh45,Wu,...,Female,,"Technology, Innovation & Educ",11,CHN,11,China,Very confident,Unknown,people-school-try
