## Topic Modeling

Quick topic modeling, to see whether anything interesting pops out.
Quick and dirty as a first pass, reference [AWS](https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html) and [DataCamp](https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python). 
(Interestingly there's some verbatim code copying across these two articles. Someone stole someone
elses' code and didn't cite them! tut!)

TODO: add evaluation of coherence and test different numbers of topics.

In [21]:
import pandas as pd
import numpy as np
import os
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
pd.set_option('display.max_colwidth', -1)

In [6]:
# Select local path vs kaggle kernel
path = os.getcwd()
if 'data-projects/kaggle_quora/notebooks' in path:
    data_dir = '../data/raw/'
else:
    data_dir = ''

dat = pd.read_csv(data_dir +'train.csv')

In [7]:
def preprocess_data(doc_set):
    """
    Input  : docuemnt list
    Purpose: preprocess text (tokenize, removing stopwords, and stemming)
    Output : preprocessed text
    """
    tokenizer = RegexpTokenizer(r'\w+')
    # create English stop words list
    en_stop = get_stop_words('en')
    # Create p_stemmer of class PorterStemmer
    p_stemmer = PorterStemmer()
    
    # list for tokenized documents
    texts = []

    # loop through document list
    for question in doc_set:

        # clean and tokenize document string
        raw = question.lower()
        tokens = tokenizer.tokenize(raw)

        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]

        # stem tokens
        stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]

        # add tokens to list
        texts.append(stemmed_tokens)
        
    return texts

def prepare_corpus(doc_clean):
    """
    Input  : clean document
    Purpose: create term dictionary of our courpus and Converting list of documents (corpus) into Document Term Matrix
    Output : term dictionary and Document Term Matrix
    """
    # turn our tokenized documents into a id <-> term dictionary
    dictionary = corpora.Dictionary(doc_clean)

    # convert tokenized documents into a document-term matrix
    corpus = [dictionary.doc2bow(text) for text in doc_clean]
    
    return dictionary, corpus

In [8]:
%timeit

# preprocess questions
doc_clean = preprocess_data(list(dat.question_text.values))
dat['question_text_processed'] = doc_clean

# Create corpus and dictionary
dictionary, corpus = prepare_corpus(doc_clean)

In [9]:
print(dictionary)
# Dictionary(153239 unique tokens: ['1960', 'nation', 'nationalist', 'provinc', 'quebec']...)

Dictionary(153239 unique tokens: ['1960', 'nation', 'nationalist', 'provinc', 'quebec']...)


I started with an LDA (Latent Dirichlet Allocation) model, as it is thought to generalize to new documents better than the simpler Latent Semantic Analysis ([more details here](https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05)). But it is much slower to fit, and of course I have no interest in generalizing to new documents! 

However, the results revealed some interesting gotchas in the preprocessing that are worth exploring before moving forward.

In [None]:
# %timeit

# # generate LDA model
# ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=20)
# ldamodel.save('lda.model')

# print(ldamodel.print_topics(num_topics=10, num_words=20))

[(0, '0.034*"s" + 0.013*"war" + 0.013*"class" + 0.012*"book" + 0.012*"read" + 0.010*"major" + 0.010*"black" + 0.010*"man" + 0.009*"c" + 0.008*"forc" + 0.008*"movi" + 0.007*"becom" + 0.007*"averag" + 0.007*"u" + 0.007*"tv" + 0.006*"known" + 0.006*"subject" + 0.006*"star" + 0.006*"process" + 0.005*"boy"'), (1, '0.016*"state" + 0.015*"student" + 0.013*"will" + 0.013*"s" + 0.013*"indian" + 0.012*"import" + 0.011*"system" + 0.010*"scienc" + 0.010*"govern" + 0.010*"manag" + 0.010*"india" + 0.009*"china" + 0.009*"human" + 0.009*"happen" + 0.008*"thing" + 0.008*"eat" + 0.008*"polit" + 0.008*"term" + 0.008*"histori" + 0.008*"form"'), (2, '0.038*"use" + 0.035*"best" + 0.033*"can" + 0.020*"way" + 0.018*"learn" + 0.018*"work" + 0.015*"compani" + 0.013*"start" + 0.011*"develop" + 0.011*"busi" + 0.010*"good" + 0.009*"market" + 0.009*"languag" + 0.009*"creat" + 0.008*"program" + 0.008*"product" + 0.008*"free" + 0.008*"make" + 0.008*"s" + 0.007*"onlin"'), (3, '0.022*"engin" + 0.019*"trump" + 0.014*"mean" + 0.011*"prepar" + 0.011*"studi" + 0.011*"univers" + 0.011*"best" + 0.011*"cours" + 0.010*"social" + 0.009*"complet" + 0.009*"exam" + 0.008*"can" + 0.008*"relat" + 0.008*"good" + 0.007*"gener" + 0.007*"s" + 0.007*"data" + 0.007*"math" + 0.007*"jee" + 0.007*"media"'), (4, '0.040*"can" + 0.015*"1" + 0.015*"use" + 0.013*"buy" + 0.012*"best" + 0.011*"number" + 0.011*"account" + 0.010*"servic" + 0.009*"get" + 0.009*"test" + 0.009*"car" + 0.008*"bank" + 0.008*"hous" + 0.008*"cost" + 0.007*"food" + 0.007*"build" + 0.007*"offic" + 0.006*"watch" + 0.006*"much" + 0.006*"song"'), (5, '0.026*"countri" + 0.018*"live" + 0.017*"american" + 0.015*"peopl" + 0.014*"india" + 0.012*"like" + 0.011*"world" + 0.010*"differ" + 0.010*"us" + 0.010*"consid" + 0.009*"s" + 0.009*"america" + 0.008*"usa" + 0.008*"chines" + 0.008*"muslim" + 0.007*"big" + 0.007*"mani" + 0.007*"citi" + 0.007*"uk" + 0.007*"non"'), (6, '0.031*"can" + 0.016*"quora" + 0.014*"question" + 0.014*"get" + 0.013*"ever" + 0.012*"t" + 0.012*"s" + 0.010*"name" + 0.010*"ask" + 0.009*"answer" + 0.009*"anyon" + 0.007*"peopl" + 0.007*"help" + 0.007*"ve" + 0.007*"will" + 0.007*"stop" + 0.007*"right" + 0.006*"problem" + 0.006*"one" + 0.006*"made"'), (7, '0.034*"get" + 0.034*"can" + 0.022*"job" + 0.019*"take" + 0.017*"time" + 0.014*"school" + 0.011*"day" + 0.011*"long" + 0.011*"colleg" + 0.011*"go" + 0.011*"caus" + 0.011*"place" + 0.010*"work" + 0.010*"experi" + 0.009*"best" + 0.008*"will" + 0.008*"one" + 0.008*"4" + 0.007*"high" + 0.007*"famili"'), (8, '0.035*"like" + 0.034*"t" + 0.028*"peopl" + 0.017*"feel" + 0.016*"can" + 0.016*"s" + 0.015*"know" + 0.015*"don" + 0.014*"life" + 0.014*"think" + 0.013*"say" + 0.013*"girl" + 0.012*"person" + 0.012*"want" + 0.011*"love" + 0.011*"someon" + 0.010*"look" + 0.010*"women" + 0.009*"just" + 0.009*"friend"'), (9, '0.032*"year" + 0.027*"can" + 0.019*"get" + 0.015*"2" + 0.015*"will" + 0.013*"money" + 0.012*"old" + 0.011*"3" + 0.010*"much" + 0.010*"interest" + 0.010*"2017" + 0.009*"5" + 0.009*"talk" + 0.009*"10" + 0.008*"date" + 0.008*"age" + 0.008*"first" + 0.008*"game" + 0.008*"make" + 0.008*"month"')]

In [2]:
# corpus_topics = ldamodel.get_document_topics(corpus, per_word_topics=False)
# doc_topics = [doc_topics for doc_topics in corpus_topics]

# q_i = 10
# print(corpus_topics[q_i])
# print(max(corpus_topics[q_i], key=lambda x: x[1]))
# print(dat.question_text[q_i])

(8, 0.52495134)
What can you say about feminism?

Here is topic 8 with 30 words:
(8, '0.035*"like" + 0.034*"t" + 0.028*"peopl" + 0.017*"feel" + 0.016*"can" + 0.016*"s" + 0.015*"know" + 0.015*"don" + 0.014*"life" + 0.014*"think" + 0.013*"say" + 0.013*"girl" + 0.012*"person" + 0.012*"want" + 0.011*"love" + 0.011*"someon" + 0.010*"look" + 0.010*"women" + 0.009*"just" + 0.009*"friend" + 0.008*"make" + 0.008*"one" + 0.008*"guy" + 0.008*"men" + 0.008*"realli" + 0.007*"sex" + 0.007*"see" + 0.007*"tell" + 0.006*"call" + 0.006*"m"')

### First Model Topics

Just to handwave about what these topics might represent:
0. Studies (class, book, read, major)
1. ?
2. Work / starting a business (learn, work, company, start, business, market)
3. Trump + more studies (trump, prepare, exam, university)
4. ?
5. USA vs Other Countries (country, live, america, people, india, china, world, differ)
6. Quora quesions (quora, question, ask, answer)
7. Getting a job (get, job, time, school, experience)
8. Relationship advice (like, feel, love, sex, tell, look, friend)
9. Financial advice (year, money, interest, make, month)

Which certainly seems like a valid assortment of topics on Quora! And I suspect whether some topics, such as financial advice, are less likely to 

Though I'm a little surprised that a clearer US politics cluster didn't turn up.

### Need Better Preprocessing!

The topics are plausible, but it's clear to me that the sloppy pre-processing didn't help.

#### Review Stemming

Notice that lots of single letters turn up "s" and "t". Presumably from contractions that were split into tokens. These are not meaningful.

The letters "u" and "c" also show up. Unclear whether this is due to stemming gone wrong or SMS slang ("c u l8r").

In [29]:
dat[dat['question_text_processed'].apply(lambda x: 'c' in x)].sample(10)

Unnamed: 0,qid,question_text,target,question_text_processed
976763,bf5d6efb8b84b1eb795e,Why is it so hard and frustrating for me to learn C programming? I can't pass my university exam. What am I doing wrong?,0,"[hard, frustrat, learn, c, program, can, t, pass, univers, exam, wrong]"
64536,0ca8cf24123e4ffe653c,"What is relation between mean value ""c"" and x of graph f(x) in interval [0,x]?",0,"[relat, mean, valu, c, x, graph, f, x, interv, 0, x]"
517184,6545fcfcda2015815748,Is there a C to PEP/8 assembler?,0,"[c, pep, 8, assembl]"
175170,2240e4e52f356533ae10,"Is the graviton relatable to a photon as a massless, ""pure energy"" particle with a velocity (speed really, it seems) of C?",0,"[graviton, relat, photon, massless, pure, energi, particl, veloc, speed, realli, seem, c]"
693675,87debb9c653d62fc36c2,What is a symbol for radians? I heard it's 'c' but I can't find a reliable source.,0,"[symbol, radian, heard, s, c, can, t, find, reliabl, sourc]"
364102,475f9dc75ca689c490b6,"Find the centroid of triangle ABC having vertices A (a, b) B (b, c) C (c, a) is the origin then find the value of (a+b+c)?",0,"[find, centroid, triangl, abc, vertic, b, b, b, c, c, c, origin, find, valu, b, c]"
358936,465907b20a8052d16aa2,Why we don't use any special characters without # in c or c++ programming language?,0,"[don, t, use, special, charact, without, c, c, program, languag]"
1251431,f53b974e64d3967ec88b,What is the fundamental difference between class and struct in C++? What encourages the use of one or the other instruction?,0,"[fundament, differ, class, struct, c, encourag, use, one, instruct]"
54414,0aaf0066b9972d22781c,"If I took Pete, and fattened him up till he weighed 750 lbs, and accelerated him to 0.800 C, and fired him into a volcano, what would happen?",1,"[took, pete, fatten, till, weigh, 750, lb, acceler, 0, 800, c, fire, volcano, happen]"
463619,5ac6e9f369f71489486d,Did Sir Arthur C Clarke ever start research/ prototyping on harvesting of tidal energy?,0,"[sir, arthur, c, clark, ever, start, research, prototyp, harvest, tidal, energi]"


OMG mostly references to **C the programming language**!

Other uses:
* washington DC.
* light constant c
* educational acronyms 
* temperature
* musical notes
* chartered accountant
* math questions

In [37]:
dat['c'] = dat['question_text_processed'].apply(lambda x: 'c' in x)
dat[['target', 'c']].groupby('c').agg(['mean', 'count'])

Unnamed: 0_level_0,target,target
Unnamed: 0_level_1,mean,count
c,Unnamed: 1_level_2,Unnamed: 2_level_2
False,0.062002,1302410
True,0.015625,3712


In [44]:
dat['c_plus_plus'] = dat['question_text'].apply(lambda x: 'c++' in x.lower())
dat[['target', 'c_plus_plus']].groupby('c_plus_plus').agg(['mean', 'count'])

Unnamed: 0_level_0,target,target
Unnamed: 0_level_1,mean,count
c_plus_plus,Unnamed: 1_level_2,Unnamed: 2_level_2
False,0.061909,1305236
True,0.004515,886


In [53]:
dat[np.logical_and(dat.c_plus_plus, dat.target==1)]

Unnamed: 0,qid,question_text,target,question_text_processed,c,c_plus_plus,python,java
64148,0c9448676d165ef38c0a,"I'm 12 and I know the ins and outs of C++, am I special?",1,"[m, 12, know, in, out, c, special]",True,True,False,False
275723,35f7029ca719af9394b8,Why do Indian programmers (99%) don't know that in C/C++ programming void main() is invalid? Why don't they refer to the standard documentation?,1,"[indian, programm, 99, don, t, know, c, c, program, void, main, invalid, don, t, refer, standard, document]",True,True,False,False
456290,59616452a95fadd3aacb,"If C++, the programming language, was developed by a man, why is it difficult to learn?",1,"[c, program, languag, develop, man, difficult, learn]",True,True,False,False
872811,ab017303d09f026d62f0,Does it make me a racist if I don't want to write code in C++?,1,"[make, racist, don, t, want, write, code, c]",True,True,False,False


In [47]:
dat['java'] = dat['question_text'].apply(lambda x: 'java ' in x.lower())
dat[['target', 'java']].groupby('java').agg(['mean', 'count'])

Unnamed: 0_level_0,target,target
Unnamed: 0_level_1,mean,count
java,Unnamed: 1_level_2,Unnamed: 2_level_2
False,0.06193,1304828
True,0.001546,1294


In [48]:
dat[np.logical_and(dat.java, dat.target==1)]

Unnamed: 0,qid,question_text,target,question_text_processed,c,c_plus_plus,python,java
225452,2c17901de2f050beb2f8,"Is Java for ""idiots""?",1,"[java, idiot]",False,False,False,True
1165296,e458bf77e7ad154abfd3,I'm 12 years old and I can code in Java and Swift. Should I skip classes that I'm not interested in so that I can work on my projects instead?,1,"[m, 12, year, old, can, code, java, swift, skip, class, m, interest, can, work, project, instead]",False,False,False,True


In [50]:
dat['python'] = dat['question_text'].apply(lambda x: 'python' in x.lower())
dat[['target', 'python']].groupby('python').agg(['mean', 'count'])

Unnamed: 0_level_0,target,target
Unnamed: 0_level_1,mean,count
python,Unnamed: 1_level_2,Unnamed: 2_level_2
False,0.061924,1304932
True,0.003361,1190


In [52]:
# Python is a language for beginners huh? Yeah, this kid is punk
dat[np.logical_and(dat.python, dat.target==1)]

Unnamed: 0,qid,question_text,target,question_text_processed,c,c_plus_plus,python,java
93459,124bab56b09739615704,"I am an experienced programmer and in my high school my teacher tried to make me use python so I said, ""No; Trust me, python is just a language for beginners, thereby making it not for me."" I got sent out. Did I do anything wrong?",1,"[experienc, programm, high, school, teacher, tri, make, use, python, said, trust, python, just, languag, beginn, therebi, make, got, sent, anyth, wrong]",False,False,True,False
580604,71c2b76c9eac6519fdd1,"Does nobody think Monty Python is way less funny than Benny Hill, and that Python fans should be sent to death camps immediately?",1,"[nobodi, think, monti, python, way, less, funni, benni, hill, python, fan, sent, death, camp, immedi]",False,False,True,False
695890,884aebdbce2abfc4e94d,I have a toddler. How should she prepare herself for the job market 15 years from now in the world of AI? Should I teach her Python as soon as she is willing to learn?,1,"[toddler, prepar, job, market, 15, year, now, world, ai, teach, python, soon, will, learn]",False,False,True,False
1169922,e543c94d470d873d611e,How do I eat a Python without anyone noticing?,1,"[eat, python, without, anyon, notic]",False,False,True,False


Anyhoo, programming language related questions are low-frequency insincere.

In [57]:
dat[dat['question_text_processed'].apply(lambda x: 'd' in x)].sample(10)

Unnamed: 0,qid,question_text,target,question_text_processed,c,c_plus_plus,python,java
1079369,d384ef146ac95bb76e50,How can I be a better D.Va main?,0,"[can, better, d, va, main]",False,False,False,False
534957,68c6c74efcbf917af365,Someone has landed US on tourist visa from India 10hours ago. She is missing since then. What could b d probable reason(s)?,0,"[someon, land, us, tourist, visa, india, 10hour, ago, miss, sinc, b, d, probabl, reason, s]",False,False,False,False
1453,0048b4bb125bd235d94d,"Is the term ""mofo"" objectionable as in ""Adam D'Angelo fancies himself a bad mofo""?",1,"[term, mofo, objection, adam, d, angelo, fanci, bad, mofo]",False,False,False,False
929999,b6419b13adffb9ec9665,"When the d.a.files on a person and the warrent is put out like on dec, 27, and the person is arrest on dec 31 is this warrent a fta?",0,"[d, file, person, warrent, put, like, dec, 27, person, arrest, dec, 31, warrent, fta]",False,False,False,False
113641,163cbc41671cbc2e93aa,Can I do my Ph.D straight after my B.Tech regardless of my field?,0,"[can, ph, d, straight, b, tech, regardless, field]",False,False,False,False
291421,391424d0111602d1fdd1,How do I get over not being raised as well as I'd liked?,0,"[get, rais, well, d, like]",False,False,False,False
316887,3e1b6b48d1f5076bf4bb,"Why were electron orbitals named s, p, d, f, in that order?",0,"[electron, orbit, name, s, p, d, f, order]",False,False,False,False
1153583,e2083685f7548f8cd46b,IF parliament has 500 members d 250 are from one party and the other 250 belongs to other and if the pm from first hf dies then pm will be from where?,0,"[parliament, 500, member, d, 250, one, parti, 250, belong, pm, first, hf, die, pm, will]",False,False,False,False
962222,bc85eb3965e9fcc79b37,Would you eat cake with the person who A2A'd you?,0,"[eat, cake, person, a2a, d]",False,False,False,False
2620,0082d1cfa9207a217e3f,How do you solve [math]\frac{d^2 y}{dt^2} + 1000y=10 sin {10t}[/math] using Undetermined Coefficients and using Laplace Transform ?,0,"[solv, math, frac, d, 2, y, dt, 2, 1000i, 10, sin, 10t, math, use, undetermin, coeffici, use, laplac, transform]",False,False,False,False


Notes:
    * he'd -> he would; I'd -> I would; Why'd -> why would; you'd -> you would
    * D' are mostly names
    * Ph.D -> PhD
    * D.C -> Washington DC

In [60]:
dat[dat['question_text_processed'].apply(lambda x: 'u' in x)].sample(10)

Unnamed: 0,qid,question_text,target,question_text_processed,c,c_plus_plus,python,java
765384,95f29f7d108480e91923,What blood thinners are currently used in the U.S. To treat deep vein thrombosis?,0,"[blood, thinner, current, use, u, s, treat, deep, vein, thrombosi]",False,False,False,False
284597,37b99cc243cc2ffcf6c3,Can u give the list of b.SC courses from which I can appear the m.SC of AIIMS?,0,"[can, u, give, list, b, sc, cours, can, appear, m, sc, aiim]",False,False,False,False
72388,0e3111b90fab5d3e40ff,When u take off the chip can they still track it?,0,"[u, take, chip, can, still, track]",False,False,False,False
351807,44f3950f317d5654a26d,In which tube player u can download multiformat videos?,0,"[tube, player, u, can, download, multiformat, video]",False,False,False,False
639821,7d520af3b59fe8af3a39,"Does any Arab or Persian see anything in this video about Israeli nuclear weapons development and the combined U.S., Israeli, French, German, and South African military strategies for Israel's nuclear weapons that you know is NOT true?",0,"[arab, persian, see, anyth, video, isra, nuclear, weapon, develop, combin, u, s, isra, french, german, south, african, militari, strategi, israel, s, nuclear, weapon, know, true]",False,False,False,False
1030043,c9d78f44545e946aa04b,What is the value of (A U B U C) ∩ (A ∩ B' ∩ C') ∩ C'?,0,"[valu, u, b, u, c, b, c, c]",True,False,False,False
1087907,d5312129965242fd89ec,"Can u make ""box"" between computer ethernet port and modem, obtain the computer message modify it and forward it to the original destination?",0,"[can, u, make, box, comput, ethernet, port, modem, obtain, comput, messag, modifi, forward, origin, destin]",False,False,False,False
1038451,cb79e4a9fec2bd5a7669,Why aren't the U.S. elections done on Twitter?,0,"[aren, t, u, s, elect, done, twitter]",False,False,False,False
1258497,f69fef95182308e0466c,What would be a good reason for the U.S to lie about Russia meddling in their election?,0,"[good, reason, u, s, lie, russia, meddl, elect]",False,False,False,False
507176,634f073372808a389f99,Should the U.S. continue using drone strikes against terrorists?,0,"[u, s, continu, use, drone, strike, terrorist]",False,False,False,False


* U.S or U.S. -> United States of America or USA (check embedding vocabulary)
* U -> You

#### Review Stop Words

In [11]:
' '.join(get_stop_words('en'))

"a about above after again against all am an and any are aren't as at be because been before being below between both but by can't cannot could couldn't did didn't do does doesn't doing don't down during each few for from further had hadn't has hasn't have haven't having he he'd he'll he's her here here's hers herself him himself his how how's i i'd i'll i'm i've if in into is isn't it it's its itself let's me more most mustn't my myself no nor not of off on once only or other ought our ours ourselves out over own same shan't she she'd she'll she's should shouldn't so some such than that that's the their theirs them themselves then there there's these they they'd they'll they're they've this those through to too under until up very was wasn't we we'd we'll we're we've were weren't what what's when when's where where's which while who who's whom why why's with won't would wouldn't you you'd you'll you're you've your yours yourself yourselves"

Some important notes about this stopword set.
1. Why and are lead questions were far more likely to be insincere than how, what, where and which leading questions. Using these as stop-words might be fine for topic modeling, but isn't a good idea for, say, IFIDF/Naive-Bayes. 
2. It is unclear to me that stopping "can't" and "cannot" but not "can" is a good idea.