### FOR LDA ANALYSIS OF MEDIUM.COM CORPUS

A potentially useful feature could be to compare the topic distribution of each sentence in an article with the topic distribution of the article itself. The question is: could sentences that are highlighted contain words that are more or less associated with the topic of the article than sentences that are not highlighted?

This type of analysis requires topic analysis, such as Latent Dirichlet Allocation (LDA). Here, use LDA (from the gensim library) to generate topics for the corpus of articles scraped from Medium.com and calculate a topic vector for each article. Then, when calculating features for the sentences in the dataset, I will be able to apply the LDA model to generate a topic vector for each sentence and calculate a cosine similarity score between the topic vector of the sentence and the article it belongs to.

In [229]:
import matplotlib.pyplot as plt
import csv
from textblob import TextBlob, Word
import pandas as pd
import sklearn
import pickle
import numpy as np
import scipy
import nltk.data
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import learning_curve, GridSearchCV, StratifiedKFold, cross_val_score, train_test_split 
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
from nltk.tokenize import RegexpTokenizer
word_tokenizer = RegexpTokenizer('\s+', gaps=True)
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
from nltk.stem.porter import PorterStemmer

from stop_words import get_stop_words
stop_en = get_stop_words('en')
p_stemmer = PorterStemmer()
en_words = set(nltk.corpus.words.words())


from gensim import corpora, models
import gensim

import timeit
import re
import string
from string import whitespace, punctuation

from nltk.corpus import stopwords
stopw_en = stopwords.words('english')
print(stopw_en)
print(stop_en)
print(len(stopw_en))
print(len(stop_en))
all_stopw = set(stopw_en) | set(stop_en)
print(len(all_stopw))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [5]:
set_tr = pickle.load(open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/set_tr_2','rb'))
# set_tr includes all unique highlight/text combos, plus text with highlights removed

In [11]:
# print(set_tr)
# print(set_tr['textwohighlight'][0])

https://unsplash.com/?photo=pFqrYbhIAXs 35 Things You Need to Give Up to Be Successful 1. Wasting Five Minutes When you have five minutes of down-time, how do you spend that time? Most people use it as an excuse to rest or laze. By lazing for 5 five minute breaks each day, we waste 25 minutes daily. That’s 9,125 minutes per year (25 X 365). Sadly, my guess is we’re wasting far more time than that. I was once told by my 9th grade English teacher that if I read every time I had a break — even if the break was just for a minute or two — that I’d get a lot more reading done than expected. She was right. Every time I finished my work early, or had a spare moment, I’d pick up a book and read. How we spend our periodic five minute breaks is a determining factor to what we achieve in our lives. Every little bit adds up. Why can we justify wasting so much time? 2. Not Valuing One Dollar I was recently in Wal-Mart with my mother-in-law buying a few groceries. While we were in the check-out line,

### Initial text processing



In [102]:
all_texts_processed = []

n = 0
for text in set_tr['text']:
    # combine sentences
    txt = ' '.join(text)
    # remove punctuation
    translator = str.maketrans('', '', string.punctuation)
    txt2 = re.sub(u'\u2014','',txt)
    txt3 = txt2.translate(translator)
    # split text into words
    tokens = word_tokenizer.tokenize(txt3.lower())
    # remove stop words
    nostop_tokens = [i for i in tokens if not i in all_stopw]
    # stem words
    stemmed = [p_stemmer.stem(i) for i in nostop_tokens]
    # append to processed texts
    all_texts_processed.append( stemmed )
    if n == 0:
#         print(txt)
#         print(tokens)
#         print(nostop_tokens)
        print(stemmed)
    n += 1
#     if n == 5:
#         break

['httpsunsplashcomphotopfqrybhiax', '35', 'thing', 'need', 'give', 'success', '1', 'wast', 'five', 'minut', 'five', 'minut', 'downtim', 'spend', 'time', 'peopl', 'use', 'excus', 'rest', 'laze', 'laze', '5', 'five', 'minut', 'break', 'day', 'wast', '25', 'minut', 'daili', 'that’', '9125', 'minut', 'per', 'year', '25', 'x', '365', 'sadli', 'guess', 'we’r', 'wast', 'far', 'time', 'told', '9th', 'grade', 'english', 'teacher', 'read', 'everi', 'time', 'break', 'even', 'break', 'minut', 'two', 'i’d', 'get', 'lot', 'read', 'done', 'expect', 'right', 'everi', 'time', 'finish', 'work', 'earli', 'spare', 'moment', 'i’d', 'pick', 'book', 'read', 'spend', 'period', 'five', 'minut', 'break', 'determin', 'factor', 'achiev', 'live', 'everi', 'littl', 'bit', 'add', 'justifi', 'wast', 'much', 'time', '2', 'valu', 'one', 'dollar', 'recent', 'walmart', 'motherinlaw', 'buy', 'groceri', 'checkout', 'line', 'point', 'item', 'thought', 'interest', 'honestli', 'can’t', 'rememb', 'anymor', 'stuck', 'said', '“o

In [74]:
# #### testing removal of punctuation

# test = ' '.join(set_tr['text'][0])
# # print(test.lower())
# # tkns = word_tokenizer.tokenize(test.lower().translate(dict.fromkeys(string.punctuation)))
# # print(tkns)
# # test2 = ''.join(e for e in test if e.isalnum())
# # print(test2)
# # test3 = test.translate(None, string.punctuation)
# # print(test3)
# # tkns = nltk.word_tokenize(test.lower().translate(dict.fromkeys(string.punctuation)))
# # print(tkns)
# translator = str.maketrans('', '', string.punctuation)

# test2 = re.sub(u'\u2014','',test)

# print(test2.translate(translator))

httpsunsplashcomphotopFqrYbhIAXs 35 Things You Need to Give Up to Be Successful 1 Wasting Five Minutes When you have five minutes of downtime how do you spend that time Most people use it as an excuse to rest or laze By lazing for 5 five minute breaks each day we waste 25 minutes daily That’s 9125 minutes per year 25 X 365 Sadly my guess is we’re wasting far more time than that I was once told by my 9th grade English teacher that if I read every time I had a break  even if the break was just for a minute or two  that I’d get a lot more reading done than expected She was right Every time I finished my work early or had a spare moment I’d pick up a book and read How we spend our periodic five minute breaks is a determining factor to what we achieve in our lives Every little bit adds up Why can we justify wasting so much time 2 Not Valuing One Dollar I was recently in WalMart with my motherinlaw buying a few groceries While we were in the checkout line I pointed an item out to her I thoug

### Save all_texts_processed


In [103]:
flda_processedtexts = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_processedtexts','wb')
pickle.dump(all_texts_processed, flda_processedtexts)


In [104]:
# Make document-term matrix
dictionary = corpora.Dictionary(all_texts_processed)
# Convert to bag-of-words
corpus = [dictionary.doc2bow(text) for text in all_texts_processed]
print(corpus[0])

[(0, 1), (1, 2), (2, 40), (3, 11), (4, 14), (5, 19), (6, 1), (7, 5), (8, 9), (9, 8), (10, 1), (11, 13), (12, 29), (13, 50), (14, 12), (15, 1), (16, 1), (17, 2), (18, 2), (19, 5), (20, 11), (21, 5), (22, 3), (23, 10), (24, 1), (25, 2), (26, 18), (27, 1), (28, 1), (29, 4), (30, 1), (31, 4), (32, 2), (33, 4), (34, 1), (35, 1), (36, 1), (37, 1), (38, 7), (39, 23), (40, 13), (41, 5), (42, 3), (43, 30), (44, 4), (45, 4), (46, 4), (47, 8), (48, 1), (49, 44), (50, 2), (51, 1), (52, 5), (53, 1), (54, 9), (55, 1), (56, 4), (57, 1), (58, 3), (59, 16), (60, 4), (61, 1), (62, 1), (63, 2), (64, 9), (65, 1), (66, 17), (67, 23), (68, 10), (69, 3), (70, 1), (71, 1), (72, 7), (73, 1), (74, 1), (75, 2), (76, 5), (77, 1), (78, 8), (79, 1), (80, 1), (81, 7), (82, 2), (83, 2), (84, 2), (85, 12), (86, 2), (87, 2), (88, 1), (89, 1), (90, 2), (91, 24), (92, 11), (93, 3), (94, 6), (95, 1), (96, 4), (97, 1), (98, 12), (99, 12), (100, 4), (101, 1), (102, 8), (103, 3), (104, 2), (105, 3), (106, 2), (107, 18), (108

In [105]:
# save all_texts_processed
flda_dictionary = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_dictionary','wb')
flda_corpus = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_corpus','wb')
pickle.dump(dictionary, flda_dictionary)
pickle.dump(corpus, flda_corpus)


In [106]:
# Run LDA
# choose 10 topics, 1 pass for initial try and time it

tic = timeit.default_timer()
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=1)
toc = timeit.default_timer()
print(str(toc - tic) + ' seconds elapsed')
# current: with new dictionary/corpus excluding more stopwords

61.63819993999641 seconds elapsed


In [81]:
# # Run LDA
# # choose 20 topics, 1 pass to test how much time increases

# tic = timeit.default_timer()
# ldamodel2 = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word = dictionary, passes=1)
# toc = timeit.default_timer()
# print(str(toc - tic) + ' seconds elapsed')
# # current: with old dictionary/corpus without excluding nltk stopwords

52.94888425300451 seconds elapsed


In [82]:
# # Run LDA
# # choose 100 topics, 1 pass to test how much time increases

# tic = timeit.default_timer()
# ldamodel3 = gensim.models.ldamodel.LdaModel(corpus, num_topics=100, id2word = dictionary, passes=1)
# toc = timeit.default_timer()
# print(str(toc - tic) + ' seconds elapsed')
# # current: with old dictionary/corpus without excluding nltk stopwords

124.6579989890015 seconds elapsed


In [83]:
# flda_10topic1pass = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_10topic1pass','wb')
# pickle.dump(ldamodel,  flda_10topic1pass)
# # current: with new dictionary/corpus excluding more stopwords

# flda_20topic1pass = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_20topic1pass','wb')
# pickle.dump(ldamodel2, flda_20topic1pass)
# flda_100topic1pass = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_100topic1pass','wb')
# pickle.dump(ldamodel3, flda_100topic1pass)
# # current: with old dictionary/corpus without excluding nltk stopwords

In [109]:
print(ldamodel.print_topics(num_topics=3, num_words=3))

[(7, '0.006*"like" + 0.006*"peopl" + 0.006*"time"'), (3, '0.006*"one" + 0.006*"like" + 0.006*"work"'), (0, '0.007*"design" + 0.006*"use" + 0.006*"one"')]


In [85]:
# # Run LDA
# # choose 100 topics, 20 passes

# tic = timeit.default_timer()
# ldamodel4 = gensim.models.ldamodel.LdaModel(corpus, num_topics=100, id2word = dictionary, passes=20)
# toc = timeit.default_timer()
# print(str(toc - tic) + ' seconds elapsed')
# # current: with old dictionary/corpus without excluding nltk stopwords

1422.3470460919998 seconds elapsed


In [86]:
flda_100topic20pass = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_100topic20pass','wb')
pickle.dump(ldamodel4, flda_100topic20pass)


In [90]:
# Compare topic outputs of LDA models 1-4
print(ldamodel.print_topics( num_topics=10, num_words=10))
# print(ldamodel2.print_topics(num_topics=10, num_words=10))
# print(ldamodel3.print_topics(num_topics=10, num_words=10))
# print(ldamodel4.print_topics(num_topics=10, num_words=10))


[(0, '0.009*"design" + 0.008*"it’" + 0.007*"use" + 0.007*"can" + 0.006*"like" + 0.006*"get" + 0.006*"will" + 0.005*"make" + 0.005*"work" + 0.005*"time"'), (1, '0.006*"one" + 0.006*"work" + 0.006*"can" + 0.005*"will" + 0.005*"peopl" + 0.004*"it’" + 0.004*"make" + 0.004*"use" + 0.004*"just" + 0.004*"de"'), (2, '0.016*"que" + 0.014*"de" + 0.011*"fuck" + 0.011*"o" + 0.011*"e" + 0.006*"não" + 0.006*"é" + 0.005*"um" + 0.004*"para" + 0.004*"can"'), (3, '0.007*"product" + 0.007*"peopl" + 0.007*"get" + 0.006*"design" + 0.006*"will" + 0.005*"time" + 0.005*"one" + 0.005*"just" + 0.005*"can" + 0.005*"like"'), (4, '0.007*"peopl" + 0.006*"one" + 0.006*"can" + 0.005*"will" + 0.005*"make" + 0.005*"work" + 0.005*"time" + 0.004*"new" + 0.004*"thing" + 0.004*"like"'), (5, '0.008*"will" + 0.007*"thing" + 0.007*"one" + 0.007*"can" + 0.006*"it’" + 0.006*"peopl" + 0.005*"get" + 0.005*"time" + 0.005*"don’t" + 0.005*"make"'), (6, '0.009*"make" + 0.009*"work" + 0.008*"can" + 0.007*"one" + 0.007*"time" + 0.007*"

### Combine nltk and stop_words lists of stopwords -- moved to top



In [100]:
# from nltk.corpus import stopwords
# stopw_en = stopwords.words('english')
# print(stopw_en)
# print(stop_en)
# print(len(stopw_en))
# print(len(stop_en))
# all_stopw = set(stopw_en) | set(stop_en)
# print(len(all_stopw))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [110]:
# Run LDA
# choose 10 topics, 20 passes after removing more stopwords

tic = timeit.default_timer()
ldamodel5 = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word = dictionary, passes=20)
toc = timeit.default_timer()
print(str(toc - tic) + ' seconds elapsed')


765.3832683630026 seconds elapsed


In [114]:
print(ldamodel.print_topics( num_topics=10, num_words=5))

[(0, '0.007*"design" + 0.006*"use" + 0.006*"one" + 0.006*"like" + 0.005*"app"'), (1, '0.006*"it’" + 0.006*"work" + 0.006*"peopl" + 0.006*"time" + 0.005*"make"'), (2, '0.008*"like" + 0.007*"time" + 0.007*"it’" + 0.006*"use" + 0.006*"get"'), (3, '0.006*"one" + 0.006*"like" + 0.006*"work" + 0.005*"use" + 0.005*"make"'), (4, '0.010*"design" + 0.007*"like" + 0.005*"it’" + 0.005*"time" + 0.005*"peopl"'), (5, '0.009*"peopl" + 0.008*"time" + 0.007*"it’" + 0.007*"like" + 0.006*"get"'), (6, '0.007*"peopl" + 0.005*"like" + 0.005*"it’" + 0.005*"one" + 0.005*"make"'), (7, '0.006*"like" + 0.006*"peopl" + 0.006*"time" + 0.006*"one" + 0.005*"make"'), (8, '0.005*"one" + 0.004*"it’" + 0.004*"time" + 0.004*"que" + 0.003*"peopl"'), (9, '0.007*"one" + 0.007*"want" + 0.006*"work" + 0.006*"get" + 0.006*"make"')]


### Generate a "common word" list to ignore in LDA

Some words appear in most topics above - these should be treated as stopwords and ignored. To do this, create a list of all words appearing in more than 60% of files (to ignore).


In [151]:
flatten = lambda all_texts_processed: [item for sublist in all_texts_processed for item in sublist]
# all_texts_combined = ' '.join(all_texts_processed)
all_texts_flattened = flatten(all_texts_processed)
all_texts_flattened[1000000]
print(len(all_texts_flattened))

flatten_uniq = set(all_texts_flattened)
print(len(flatten_uniq))

print(len(all_texts_processed))

commonwords = []
wordlist = []
i = 0
tic = timeit.default_timer()
for word in flatten_uniq:
    n = 0
    for text in all_texts_processed:
        if word in text:
#             print('yes!')
            n += 1
    frac = float(n / len(all_texts_processed))
    if frac >= 0.6:
        commonwords.append(word)
    elif frac < 0.6:
        wordlist.append(word)
    i += 1
#     print(word)
#     print(frac)
#     print(n)
#     if i == 20:
#         break
#     print(word)
#     print(frac)
#     if i >= 50:
#         break

toc = timeit.default_timer()
print(str(toc - tic) + ' seconds elapsed')




2527135
96378
3201


In [227]:
print(len(wordlist))
print(len(commonwords))
print(commonwords)
commonwords_2 = [i.strip('”“’‘') for i in commonwords]
print(commonwords_2)

96342
36
['like', 'tri', 'want', 'need', 'new', 'start', 'say', 'first', 'use', 'time', 'everi', 'see', 'one', 'come', 'look', 'day', 'get', 'it’', 'someth', 'much', 'good', 'also', 'don’t', 'go', 'think', 'even', 'work', 'make', 'take', 'know', 'way', 'right', 'mani', 'year', 'peopl', 'thing']
['like', 'tri', 'want', 'need', 'new', 'start', 'say', 'first', 'use', 'time', 'everi', 'see', 'one', 'come', 'look', 'day', 'get', 'it', 'someth', 'much', 'good', 'also', 'don’t', 'go', 'think', 'even', 'work', 'make', 'take', 'know', 'way', 'right', 'mani', 'year', 'peopl', 'thing']


In [164]:
# all_stopw2 = set(all_stopw_stem) | set(commonwords)
# print(len(all_stopw2))

# all_stopw_stem = [p_stemmer.stem(i) for i in all_stopw]
# print(all_stopw)
# print(all_stopw_stem)

# all_stopw2 = set(all_stopw_stem) | set(commonwords)
# print(len(all_stopw2))


235
{'your', 'didn', "i'd", "she'd", "it's", 'from', "wasn't", 'an', 'as', 'being', 'has', "they're", 'will', "where's", 'of', 'through', 'having', 'further', 'up', 'them', 'they', 'there', 'their', 'no', 'll', 'y', 'she', 'into', "i'm", "hadn't", "he'll", 'you', 'his', 'those', 'with', 't', 'me', "who's", 'should', 'when', 'shouldn', 'he', 'over', "there's", 'so', 'do', 'once', 'before', "weren't", 'what', 'ours', "here's", 'too', 'very', "we'll", "shan't", 's', "don't", 'why', 'yourselves', 'by', 'own', "what's", 'would', 'this', 'been', 'each', 'than', "when's", 'who', "didn't", 'the', 'just', 'mustn', 'my', 'hers', "haven't", 'are', 'himself', 'does', 'doing', 'after', 'weren', "we're", "mustn't", 'we', "he'd", 'if', "wouldn't", 'isn', "aren't", 'aren', 'for', 'between', 'on', "they'll", 'theirs', 'ourselves', "why's", 'have', "she's", "won't", 'themselves', "i've", 'here', 'any', 'such', 'haven', 'only', 'him', 'd', 've', 'its', "you'll", 'couldn', "let's", 'off', 'had', 'is', 'un

In [245]:
fwordlist = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/wordlist','wb')
fcommonwords = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/commonwords','wb')
fcommonwords2 = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/commonwords2','wb')
pickle.dump(wordlist, fwordlist)
pickle.dump(commonwords, fcommonwords)
pickle.dump(commonwords_2, fcommonwords2)

tmp = pickle.load(open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/wordlist','rb'))
print(commonwords_2)

['like', 'tri', 'want', 'need', 'new', 'start', 'say', 'first', 'use', 'time', 'everi', 'see', 'one', 'come', 'look', 'day', 'get', 'it', 'someth', 'much', 'good', 'also', 'don’t', 'go', 'think', 'even', 'work', 'make', 'take', 'know', 'way', 'right', 'mani', 'year', 'peopl', 'thing']


In [160]:
# print(tmp)

In [230]:
# REDO LDA with common words removed

all_texts_processed_new = []


tic = timeit.default_timer()

n = 0
for text in set_tr['text']:
    txt = ' '.join(text)
    # remove punctuation
    translator = str.maketrans('', '', string.punctuation)
    txt2 = re.sub(u'\u2014','',txt) # remove em dashes
    txt3 = re.sub(r'\d+', '', txt2) # remove digits
    txt4 = txt3.translate(translator) # remove punctuation
    # split text into words
    tokens = word_tokenizer.tokenize(txt4.lower())
    # strip single and double quotes from ends of words
    tokens_strip = [i.strip('”“’‘') for i in tokens]
    # keep only english words
    tokens_en = [i for i in tokens_strip if i in en_words]
    # remove nltk/stop_word stop words
    nostop_tokens = [i for i in tokens_en if not i in all_stopw]
    # strip single and double quotes from ends of words
    nostop_strip = [i.strip('”“’‘') for i in nostop_tokens]
    # stem words
    stemmed = [p_stemmer.stem(i) for i in nostop_strip]
    # strip single and double quotes from ends of words
    stemmed_strip = [i.strip('”“’‘') for i in stemmed]
    # stem words
    stemmed2 = [p_stemmer.stem(i) for i in stemmed_strip]
    # strip single and double quotes from ends of words
    stemmed2_strip = [i.strip('”“’‘') for i in stemmed2]
    # remove common words post-stemming
    stemmed_nocommon = [i for i in stemmed2_strip if not i in commonwords_2]
    # append to processed texts
    all_texts_processed_new.append( stemmed_nocommon )
    if n == 0:
#         print(txt)
#         print(tokens)
#         print(nostop_tokens)
        print(stemmed_nocommon)
    n += 1
#     if n == 5:
#         break

toc = timeit.default_timer()
print(str(toc - tic) + ' seconds elapsed')



# #### TRY only using nouns? -- nah, since some of main words in e.g. topic 0 are 'design', 'use'

['give', 'success', 'wast', 'five', 'five', 'spend', 'excu', 'rest', 'laze', 'five', 'minut', 'wast', 'daili', 'per', 'x', 'sadli', 'guess', 'wast', 'far', 'told', 'th', 'grade', 'teacher', 'read', 'break', 'break', 'minut', 'two', 'lot', 'read', 'done', 'finish', 'earli', 'spare', 'moment', 'pick', 'book', 'read', 'spend', 'period', 'five', 'minut', 'factor', 'achiev', 'littl', 'bit', 'justifi', 'wast', 'dollar', 'recent', 'line', 'point', 'item', 'thought', 'interest', 'honestli', 'rememb', 'stuck', 'said', 'dollar', 'lot', 'money', 'short', 'money', 'actual', 'famili', 'trip', 'world', 'whole', 'understand', 'valu', 'dollar', 'appreci', 'valu', 'thoughtlessli', 'spend', 'dollar', 'may', 'seem', 'big', 'deal', 'actual', 'frivol', 'spend', 'long', 'enough', 'million', 'lack', 'care', 'true', 'art', 'valu', 'addit', 'percent', 'rich', 'percent', 'hourli', 'respon', 'minut', 'dollar', 'consequ', 'great', 'major', 'extrem', 'frugal', 'least', 'highli', 'mind', 'money', 'believ', 'success

### Save all_texts_processed


In [231]:
# flda_processedtexts_new = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_processedtexts_new','wb')
# pickle.dump(all_texts_processed_new, flda_processedtexts_new)
# # above: without filtering for english words

flda_processedtexts_new2 = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_processedtexts_new2','wb')
pickle.dump(all_texts_processed_new, flda_processedtexts_new2)
# above: with filtering for english words



### Make document-term matrix


In [232]:
# dictionary_new = corpora.Dictionary(all_texts_processed_new)
# # Convert to bag-of-words
# corpus_new = [dictionary_new.doc2bow(text) for text in all_texts_processed_new]
# print(corpus_new[0])
# # above: without filtering for english words


# Make document-term matrix
dictionary_new2 = corpora.Dictionary(all_texts_processed_new)
# Convert to bag-of-words
corpus_new2 = [dictionary_new2.doc2bow(text) for text in all_texts_processed_new]
print(corpus_new2[0])
# above: with filtering for english words


[(0, 13), (1, 19), (2, 5), (3, 9), (4, 11), (5, 1), (6, 1), (7, 1), (8, 4), (9, 3), (10, 2), (11, 3), (12, 4), (13, 1), (14, 2), (15, 4), (16, 2), (17, 1), (18, 1), (19, 7), (20, 3), (21, 5), (22, 4), (23, 4), (24, 1), (25, 2), (26, 1), (27, 7), (28, 1), (29, 7), (30, 1), (31, 1), (32, 3), (33, 4), (34, 1), (35, 2), (36, 6), (37, 3), (38, 2), (39, 6), (40, 1), (41, 7), (42, 1), (43, 1), (44, 2), (45, 2), (46, 12), (47, 26), (48, 2), (49, 11), (50, 6), (51, 1), (52, 12), (53, 12), (54, 1), (55, 13), (56, 3), (57, 2), (58, 3), (59, 1), (60, 7), (61, 2), (62, 1), (63, 5), (64, 10), (65, 9), (66, 1), (67, 4), (68, 5), (69, 2), (70, 1), (71, 2), (72, 3), (73, 1), (74, 5), (75, 3), (76, 3), (77, 1), (78, 6), (79, 1), (80, 1), (81, 2), (82, 4), (83, 20), (84, 13), (85, 6), (86, 1), (87, 1), (88, 3), (89, 6), (90, 9), (91, 4), (92, 2), (93, 2), (94, 5), (95, 3), (96, 2), (97, 2), (98, 1), (99, 3), (100, 4), (101, 6), (102, 1), (103, 6), (104, 7), (105, 10), (106, 1), (107, 6), (108, 3), (109, 

### Save new dictionary and corpus


In [233]:
# flda_dictionary_new = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_dictionary_new','wb')
# flda_corpus_new = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_corpus_new','wb')
# pickle.dump(dictionary_new, flda_dictionary_new)
# pickle.dump(corpus_new, flda_corpus_new)
# # above: without filtering for english words

flda_dictionary_new2 = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_dictionary_new2','wb')
flda_corpus_new2 = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/lda_corpus_new2','wb')
pickle.dump(dictionary_new2, flda_dictionary_new2)
pickle.dump(corpus_new2, flda_corpus_new2)
# above: with filtering for english words


### Run LDA


In [234]:
# # choose 10 topics, 20 passes
# tic = timeit.default_timer()
# ldamodel_new = gensim.models.ldamodel.LdaModel(corpus_new, num_topics=10, id2word = dictionary_new, passes=20)
# toc = timeit.default_timer()
# print(str(toc - tic) + ' seconds elapsed')
# # current: with new dictionary/corpus excluding more stopwords and common words
# # # above: without filtering for english words


# choose 10 topics, 20 passes
tic = timeit.default_timer()
ldamodel_new = gensim.models.ldamodel.LdaModel(corpus_new2, num_topics=10, id2word = dictionary_new2, passes=20)
toc = timeit.default_timer()
print(str(toc - tic) + ' seconds elapsed')
# current: with new dictionary/corpus excluding more stopwords and common words
# above: with filtering for english words


535.3515410920081 seconds elapsed


In [235]:
# Save LDA model
# # flda_10topic20pass_new = open('/Users/clarencecheng/Dropbox/~Insight/skimr/lda_10topic20pass_new','wb')
# # pickle.dump(ldamodel_new, flda_10topic20pass_new)

# flda_10topic20pass_new2 = open('/Users/clarencecheng/Dropbox/~Insight/skimr/lda_10topic20pass_new2','wb')
# pickle.dump(ldamodel_new, flda_10topic20pass_new2)
# # # above: without filtering for english words


flda_10topic20pass_new2b = open('/Users/clarencecheng/Dropbox/~Insight/skimr/lda_10topic20pass_new2b','wb')
pickle.dump(ldamodel_new, flda_10topic20pass_new2b)
# above: with filtering for english words


### Inspect topic output of new LDA model


In [240]:
print(ldamodel_new.print_topics( num_topics=10, num_words=5))


[(0, '0.011*"compani" + 0.010*"product" + 0.008*"busi" + 0.007*"build" + 0.007*"team"'), (1, '0.030*"design" + 0.010*"user" + 0.007*"code" + 0.006*"web" + 0.006*"color"'), (2, '0.013*"write" + 0.013*"read" + 0.008*"learn" + 0.008*"love" + 0.007*"life"'), (3, '0.034*"via" + 0.028*"music" + 0.025*"game" + 0.020*"univ" + 0.019*"data"'), (4, '0.007*"us" + 0.006*"learn" + 0.006*"world" + 0.005*"system" + 0.005*"human"'), (5, '0.153*"de" + 0.103*"e" + 0.044*"um" + 0.040*"da" + 0.039*"para"'), (6, '0.055*"white" + 0.033*"black" + 0.016*"photo" + 0.016*"hou" + 0.016*"presid"'), (7, '0.016*"life" + 0.008*"success" + 0.007*"feel" + 0.007*"becom" + 0.006*"learn"'), (8, '0.009*"food" + 0.007*"home" + 0.006*"eat" + 0.006*"hou" + 0.006*"live"'), (9, '0.007*"trump" + 0.007*"us" + 0.007*"said" + 0.005*"men" + 0.005*"never"')]


### Define a function to convert topic vector to numeric vector

In [238]:

def lda_to_vec(lda_input):
    num_topics = 10
    vec = [0]*num_topics
    for i in lda_input:
        col = i[0]
        val = i[1]
        vec[col] = val
    return vec

### Calculate document vectors


In [239]:


all_lda_vecs = []

n = 0
for i in corpus_new2:
    doc_lda = ldamodel_new[i]
    vec_lda = lda_to_vec(doc_lda)
    all_lda_vecs.append(vec_lda)
    n += 1
    if n <= 20:
#         print(doc_lda)
        print(vec_lda)



[0.10107509328751951, 0, 0.10852728610649451, 0, 0.075517989451679202, 0, 0, 0.67870291423836493, 0, 0.022212525354884893]
[0.12651750305493689, 0, 0.078822498619816675, 0, 0, 0, 0, 0, 0, 0.78631764775959023]
[0.25624254286537357, 0.055555672560972158, 0.52666362787494203, 0, 0, 0, 0, 0.080845266191335349, 0.030650578584385711, 0.047868018732012435]
[0, 0, 0.089334073178282222, 0, 0.28513255654787562, 0, 0.12729748109316336, 0.013654861923539681, 0.13482279146305834, 0.34849217621184131]
[0, 0, 0.07048149573898782, 0, 0.049042266612691489, 0, 0, 0.74698620673638572, 0, 0.12742346224893911]
[0.12483905231383763, 0, 0.67938636218160664, 0.049743677044155082, 0, 0, 0, 0.090519253639670699, 0.053174661362144673, 0]
[0.43435859745966204, 0, 0, 0, 0.5633153219581547, 0, 0, 0, 0, 0]
[0.14330170401379469, 0.011304899531451831, 0, 0.50968739188797629, 0, 0, 0, 0.14212458616559379, 0, 0.18878576465933591]
[0.2582394907418743, 0.67150372166000294, 0, 0, 0, 0, 0.015085966464631617, 0.0448635137379

In [241]:
print(len(all_lda_vecs))
print(sum(all_lda_vecs[1000]))

3201
0.993042167848


In [243]:
# Save all_lda_vecs
fall_lda_vecs = open('/Users/clarencecheng/Dropbox/~Insight/skimr/datasets/all_lda_vecs','wb')
pickle.dump(all_lda_vecs, fall_lda_vecs)
