<img src='http://pm1.narvii.com/6829/9ea1074cb40eb7b2111e9c4d2c662a35648f5796v2_00.jpg'/>

<h1><center> Topic Modeling </center></h1>

In [1]:
import pandas as pd
import pickle

from gensim import matutils, models
import scipy.sparse
import nltk
#nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize, pos_tag

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

Topic modeling is looking at words in a text and determining what topic is the text all about. Here we will try to find topics using LDA (Latent Dirichlet Allocation).  We will go through several iterations in determining topics. For now, I will use parts of speech in choosing words in texts for the iterations.

### Iteration 1: All Text

In [70]:
#Term Document Matrix (tdm) -Gensim needs transpose of dtm
personComments_topic1 = pd.read_pickle('personComments_dtm2.pkl').transpose()
personComments_topic1.head()

Unnamed: 0,Danielle Cohn,Emily Ann Shaheen,Madeline & Eric,Madison Beer,Rebecca Black,Richard Gale,Shannon Beveridge,Steven Assanti
aaa,0,0,1,0,0,0,1,0
aaaa,0,0,1,0,0,0,0,0
aaaaaaaaaaaaa,0,0,1,0,0,0,0,0
aaaaaaakward,0,0,0,0,0,0,1,0
aaw,0,0,1,0,0,0,0,0


In [71]:
# Putting tdm into gensim format, 
# df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(personComments_topic1)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [72]:
# Dictionary of the all terms and their respective location in the tdm
cv = pickle.load(open("cv_stop.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

**LDA** (Latent Dirichlet Allocation)

LDA is an example of topic model which builds a topic per document model and words per topic model.

In [73]:
# Specifying LDA with the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=20)
lda.print_topics()

[(0,
  '0.021*"friday" + 0.011*"guys" + 0.008*"song" + 0.007*"good" + 0.007*"cute" + 0.007*"baby" + 0.005*"girl" + 0.005*"gay" + 0.004*"time" + 0.004*"make"'),
 (1,
  '0.008*"kid" + 0.008*"bully" + 0.007*"shit" + 0.006*"got" + 0.006*"fat" + 0.006*"little" + 0.006*"quot" + 0.006*"ass" + 0.006*"did" + 0.006*"fuck"'),
 (2,
  '0.020*"voice" + 0.010*"sing" + 0.010*"beautiful" + 0.009*"good" + 0.009*"amazing" + 0.007*"hate" + 0.007*"pretty" + 0.006*"song" + 0.006*"better" + 0.005*"talented"'),
 (3,
  '0.030*"arab" + 0.014*"girl" + 0.009*"lol" + 0.007*"gold" + 0.006*"omg" + 0.006*"arabs" + 0.006*"thanks" + 0.006*"thank" + 0.006*"arabic" + 0.005*"ann"')]

### Iteration 2: Nouns

In [74]:
#Function to return nouns from text
def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [75]:
#Reading Clean data
personComments_corpus = pd.read_pickle('personComments_corpus.pkl')
personComments_corpus

Unnamed: 0,comments
Danielle Cohn,disgusting so u and you had a first time thi...
Emily Ann Shaheen,im arab from iraq but my golden name is in eng...
Madeline & Eric,your grocery carts are sooo small ours look g...
Madison Beer,be strong that s my fav song of demi lovato i ...
Rebecca Black,my old girlfriend also had a book titled knot ...
Richard Gale,hes a bitch richard your a n idiot mate he bul...
Shannon Beveridge,my ex broke up with me because she wanted to b...
Steven Assanti,steven is a sagittarius we be great is my guy...


In [76]:
# Filtering noun from clean data
personComments_nouns = pd.DataFrame(personComments_corpus.comments.apply(nouns) )
personComments_nouns

Unnamed: 0,comments
Danielle Cohn,time video mom i i halo mates year y shoes bed...
Emily Ann Shaheen,arab iraq name woman day i i adult i sooooo ne...
Madeline & Eric,grocery carts ours family couple videooo cutee...
Madison Beer,fav song demi lovato i feels yesterday m madis...
Rebecca Black,girlfriend book knot t rope ya way friday rebe...
Richard Gale,hes bitch idiot mate someone deserve victim po...
Shannon Beveridge,ex m years anyone i d date lol babe personalit...
Steven Assanti,steven sagittarius guy shit ass m years man ma...


In [77]:
# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['like', 'just', 'don', 'br', 'video', 'know', 'people', 'love', 'really', 'dani', 
 'danielle', 'madison', 'rebecca', 'shannon', 'really', 'richard', 'casey', 'emily', 'steven']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreating Document-Term matrix with only nouns

cv_noun = CountVectorizer(stop_words=stop_words)
data_cv_noun = cv_noun.fit_transform(personComments_nouns.comments)
personComments_dtm_nouns= pd.DataFrame(data_cv_noun.toarray(), columns=cv_noun.get_feature_names())
personComments_dtm_nouns.index = personComments_nouns.index
personComments_dtm_nouns

Unnamed: 0,aaaa,aaaaaaaaaaaaa,aaaaaaakward,aawww,abbara,abdulla,abidal,ability,abit,abortion,...,ووهي,يا,ياسمين,يتدلى,يديه,يستوعبها,يشبه,يعارض,ᴛʜᴜᴍʙɴᴀɪʟ,ᴡᴛғ
Danielle Cohn,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,1
Emily Ann Shaheen,0,0,0,1,2,1,1,0,0,0,...,1,1,1,1,1,1,1,1,0,0
Madeline & Eric,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Madison Beer,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Rebecca Black,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Richard Gale,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Shannon Beveridge,0,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Steven Assanti,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [78]:
# Creating the gensim corpus with only noun
corpus_noun = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(personComments_dtm_nouns.transpose()))

# Create the vocabulary dictionary
id2word_noun = dict((v, k) for k, v in cv_noun.vocabulary_.items())

In [79]:
# Creating LDA with nouns with 4 topics
lda_noun = models.LdaModel(corpus=corpus_noun, num_topics=4, id2word=id2word_noun, passes=20)
lda_noun.print_topics()

[(0,
  '0.013*"voice" + 0.012*"kid" + 0.008*"time" + 0.008*"years" + 0.007*"girl" + 0.007*"life" + 0.007*"way" + 0.007*"quot" + 0.006*"ass" + 0.005*"guys"'),
 (1,
  '0.023*"song" + 0.020*"friday" + 0.009*"youtube" + 0.009*"fun" + 0.007*"music" + 0.007*"guy" + 0.007*"https" + 0.007*"watch" + 0.007*"ass" + 0.006*"www"'),
 (2,
  '0.017*"baby" + 0.017*"girl" + 0.014*"guys" + 0.011*"family" + 0.008*"arab" + 0.008*"parents" + 0.007*"school" + 0.007*"thanks" + 0.007*"lol" + 0.007*"time"'),
 (3,
  '0.000*"girl" + 0.000*"guys" + 0.000*"lol" + 0.000*"life" + 0.000*"ass" + 0.000*"kid" + 0.000*"time" + 0.000*"shit" + 0.000*"guy" + 0.000*"voice"')]

### Iteration 3: Nouns and Adjectives

In [80]:
#Function to return nouns and adjectives from text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [81]:
# Applying nouns_adj functions to filter nouns and adjective
personComments_nounsAdj = pd.DataFrame(personComments_corpus.comments.apply(nouns_adj))
personComments_nounsAdj

Unnamed: 0,comments
Danielle Cohn,u first time video mom worst i mate i halo mat...
Emily Ann Shaheen,im arab iraq golden name english much i arab w...
Madeline & Eric,grocery carts sooo small ours gigantic family ...
Madison Beer,strong s fav song demi lovato i feels yesterda...
Rebecca Black,old girlfriend book knot t rope ya long way fr...
Richard Gale,hes bitch n idiot mate someone t deserve victi...
Shannon Beveridge,ex lesbian m years anyone beautiful i d date l...
Steven Assanti,steven sagittarius great guy alive s gross shi...


In [82]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cv_nouns_adj = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cv_nouns_adj = cv_nouns_adj.fit_transform(personComments_nounsAdj.comments)
personComments_dtm_nounsAdj = pd.DataFrame(data_cv_nouns_adj.toarray(), columns=cv_nouns_adj.get_feature_names())
personComments_dtm_nounsAdj.index = personComments_nounsAdj.index
personComments_dtm_nounsAdj

Unnamed: 0,aaa,aaaa,aaaaaaaaaaaaa,aaaaaaakward,aawww,abbara,abdoh,abdulla,abidal,ability,...,يا,ياسمين,يتدلى,يديه,يستوعبها,يشبه,يعارض,ᴛʜɪs,ᴛʜᴜᴍʙɴᴀɪʟ,ᴡᴛғ
Danielle Cohn,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,1
Emily Ann Shaheen,0,0,0,0,1,2,1,1,1,0,...,1,1,1,1,1,1,1,0,0,0
Madeline & Eric,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Madison Beer,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Rebecca Black,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Richard Gale,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Shannon Beveridge,0,0,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Steven Assanti,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [83]:
# Create the gensim corpus
corpus_nouns_adj = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(personComments_dtm_nounsAdj.transpose()))

# Create the vocabulary dictionary
id2word_nouns_adj = dict((v, k) for k, v in cv_nouns_adj.vocabulary_.items())

In [67]:
# LDA topic modeling with 4 topics
lda_nouns_adj = models.LdaModel(corpus=corpus_nouns_adj, num_topics=4, id2word=id2word_nouns_adj, passes=80)
lda_nouns_adj.print_topics()

[(0,
  '0.095*"friday" + 0.033*"song" + 0.009*"music" + 0.007*"david" + 0.007*"voice" + 0.006*"great" + 0.006*"weekend" + 0.006*"ta" + 0.006*"black" + 0.005*"partyin"'),
 (1,
  '0.036*"voice" + 0.017*"beautiful" + 0.011*"song" + 0.009*"amazing" + 0.008*"singer" + 0.007*"famous" + 0.006*"haters" + 0.006*"shes" + 0.005*"perfect" + 0.005*"opinion"'),
 (2,
  '0.012*"kid" + 0.009*"gay" + 0.009*"fat" + 0.007*"lesbian" + 0.007*"mikey" + 0.006*"friend" + 0.006*"https" + 0.005*"sex" + 0.005*"victim" + 0.005*"bitch"'),
 (3,
  '0.029*"arab" + 0.015*"cute" + 0.010*"family" + 0.008*"beautiful" + 0.007*"gold" + 0.007*"pregnant" + 0.007*"thanks" + 0.006*"arabs" + 0.006*"girls" + 0.006*"arabic"')]

In [69]:
# Let's take a look at which topics comments for each uploader contains
corpus_transformed = lda_nouns_adj[corpus_nouns_adj]
list(zip([a for [(a,b)] in corpus_transformed], personComments_dtm_nounsAdj.index))

[(2, 'Danielle Cohn'),
 (3, 'Emily Ann Shaheen'),
 (3, 'Madeline & Eric'),
 (1, 'Madison Beer'),
 (0, 'Rebecca Black'),
 (2, 'Richard Gale'),
 (2, 'Shannon Beveridge'),
 (2, 'Steven Assanti')]

### Topics and People


   
* **Topic 0:** song/music/voice, Friday/party/weekend

  Rebecca Black   
    
    
* **Topic 1:** beautiful/perfect/amazing song/music/voice 

  Madison Beer


* **Topic 2:** gay/lesbian, sex, kid, fat, victim, b**ch

  Shannon Beveridge, Steven Assanti, Richard Gale
 
 
 * **Topic 3:** arab, gold, girls, family, beautiful, pregnant

    Emily Ann Shaheen, Madeline & Eric


#### Insights

From topics people talk about in the comments for each of the people , we can see that yet **Shannon Beveridge, Steven Assanti, Richard Gale** are prone to harsh comments than any other people in the list. 