## LDA, derive document-term matrix

TODO: Use bigrams and tf-idf. Remove investigator's names and stopwords in preprocessor, otherwise bigram such as 'the protein' will persist.
Set min_df to reasonably low value, maybe 0.05 for 5% in order to reduce number of terms.

In [None]:
# custom vectorizer is needed to lemmatize (see utilsvectorizer module)
vecto = uv.CustomTfidfVectorizer(
                            stop_words = 'english',
                            min_df = 3,
                            ngram_range=(1, 2),
                            lowercase = True,
                            strip_accents = 'unicode',
                            token_pattern = r'(?u)\b[a-zA-Z][a-zA-Z]+\b',
                            binary = True,
                            max_features = 2000
                            )

In [None]:
# produce document-term matrix
train_doc_term_matrix = vecto.fit_transform(train_docs.Abstract)
train_doc_term_matrix.shape, type(train_doc_term_matrix)

In [None]:
# apply vectorization on test set
test_doc_term_matrix = vecto.transform(test_docs.Abstract)
test_doc_term_matrix.shape, type(test_doc_term_matrix)

### Most Frequent Terms

In [None]:
# recover words
lemmatized_words = vecto.get_feature_names()

In [None]:
# sum doc_term_matrix along rows for each column (i.e. word)
word_freq = train_doc_term_matrix.sum(axis=0)

In [None]:
# convert word_freq from matrix to ndarray and flatten it (ravel is faster than flatten as it returns a view instead of a copy)
# then create a pandas serie for display
s_word_freq = pd.Series( np.asarray(word_freq).ravel(), index = lemmatized_words )

In [None]:
# display top words
s_word_freq.sort_values(ascending=False).head(10)

In [None]:
# display top words
s_word_freq.sort_values(ascending=False).tail(10)

### Run Latent Dirichlet Analysis (sklearn)

In [None]:
num_agency = df.Agency.value_counts().count()

In [None]:
# run LDA with 7 components until perplexity reach plateau
lda_all = LatentDirichletAllocation(n_components = num_agency,
                                    max_iter = 500,
                                    learning_method = 'batch',
                                    evaluate_every = 10,
                                    random_state = 7,
                                    verbose = 1,
                                    n_jobs = -1)

In [None]:
lda_all.fit(train_doc_term_matrix)

In [None]:
# pickle LDA model
joblib.dump(lda_all, os.path.join(os.pardir,'models', 'lda_tfidf_full.pkl'))

## LDA (gensim), derive document-term matrix

In [None]:
# discard 0 amount, N/A abstract. Agency shall not be missing and only consider Phase I
crit = (df.Agency == 'National Science Foundation') \
        & (df['Awards Year'].between(2010, 2013))
df_analysis = df.loc[crit,['Abstract','Agency', 'title']]
print(df_analysis.shape)
df_analysis.head()

In [None]:
# concatenate abstract and title
df_analysis['text'] = df_analysis.Abstract + df_analysis.title

In [None]:
df_analysis.head()

#### Building a custom `tokenizer` for Lemmatization with `spacy`

In [None]:
nlp = spacy.load('en')

In [None]:
def tokenizer(doc):
    return [w.lemma_ for w in nlp(doc) if (not w.is_punct | w.is_space) ]
# return [w.lemma_ for w in nlp(doc) if (not w.is_punct | w.is_space) & (w.pos_ in ['NOUN', 'ADJ', 'ADV']) ]

In [None]:
# # nlp(doc)
# ex = nlp(df_analysis.text.iloc[0])

In [None]:
# ex.similarity(nlp(df_analysis.text.iloc[10]))

In [20]:
df_analysis.text.shape

(9198,)

In [21]:
vecto = TfidfVectorizer(
                        min_df = 0.01,
                        max_df = 0.8,
                        ngram_range=(1, 2),
                        stop_words = 'english',
                        tokenizer = tokenizer,
                        lowercase = True,
                        strip_accents = 'unicode',
#                         token_pattern = r'(?u)\b[a-zA-Z][a-zA-Z]+\b',
                        binary = False,
                        )

In [22]:
dtm_train = vecto.fit_transform(df_analysis.text)
# produce document-term matrix
# dtm_train = vecto.fit_transform(X_train)
# dtm_test = vecto.transform(X_test)

In [23]:
dtm_train.shape, type(dtm_train)

((9198, 1738), scipy.sparse.csr.csr_matrix)

In [None]:
ind = vecto.vocabulary_.get('technology', 'Nope')
vecto.idf_[ind]

In [None]:
train_corpus = Sparse2Corpus(dtm_train, documents_columns=False)
# test_corpus = Sparse2Corpus(dtm_test, documents_columns=False)

id2word = pd.Series(vecto.get_feature_names()).to_dict()

In [None]:
# len(id2word), id2word

In [None]:
# 20x more documents than topics
num_topics = 3

In [None]:
lda_gensim = LdaModel(corpus=train_corpus,
                          num_topics = num_topics,
                          id2word=id2word)

In [None]:
topics = lda_gensim.print_topics()

In [None]:
topics

In [None]:
# coherence: List of tuples, one element per topic
#            each element is also a tuple, 1st: list of (coherence score, term); 2nd: overall topic coherence
coherence = lda_gensim.top_topics(corpus=train_corpus, coherence='u_mass')

In [None]:
topic_labels = ['Topic {}'.format(i) for i in range(1, num_topics+1)]

In [None]:
topic_coherence = []
topic_words = pd.DataFrame()

for t in range(len(coherence)):
    # made up label
    label = topic_labels[t]
    # second element is the overall topic coherence
    topic_coherence.append(coherence[t][1])
    
    # 1st element is a tuple 
    df_cohe = pd.DataFrame(coherence[t][0], columns=[(label, 'prob'), (label, 'term')])
    df_cohe[(label, 'prob')] = df_cohe[(label, 'prob')].apply(lambda x: '{:.2%}'.format(x))
    
    topic_words = pd.concat([topic_words, df_cohe], axis=1)
                      
topic_words.columns = pd.MultiIndex.from_tuples(topic_words.columns)
# pd.set_option('expand_frame_repr', False)
print(topic_words.head())

# plot overall topci coherence
pd.Series(topic_coherence, index=topic_labels).plot.bar();

In [None]:
# create visualization
prepare_gensim(lda_gensim, train_corpus, id2word)

# Gensim summarization
TextRank works as follows:

1. Pre-process the text: remove stop words and stem the remaining words.
2. Create a graph where vertices (nodes) are sentences.
3. Connect every sentence to every other sentence by an edge. The weight of the edge is how similar the two sentences are*.
4. Run the PageRank algorithm on the graph.
5. Pick the vertices(aka nodes that are sentences here) with the highest PageRank score

*Gensim’s TextRank uses [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) function to see how similar the sentences are

In [None]:
from gensim.summarization import summarize, keywords

In [None]:
all_abstracts = df_analysis.text.str.cat(sep=' ')

In [None]:
len(all_abstracts)

In [None]:
print(summarize(all_abstracts, word_count=30))

In [None]:
print(keywords(all_abstracts, words = 3, lemmatize=True))

### Interactive Visualization

In [None]:
# save LDA viz to standalone html
pyLDAvis.save_html(\
    prepare(lda_all, train_doc_term_matrix, vecto), os.path.join(os.pardir, 'Small_Business_Award.html'))

In [None]:
# create visualization
prepare(lda_all, train_doc_term_matrix, vecto)

#### Lambda

- **$\lambda$ = 0**: how probable is a word to appear in a topic - words are ranked on lift P(word | topic) / P(word)
- **$\lambda$ = 1**: how exclusive is a word to a topic -  words are purely ranked on P(word | topic)

The ranking formula is $\lambda * P(\text{word} \vert \text{topic}) + (1 - \lambda) * \text{lift}$

User studies suggest $\lambda = 0.6$ works for most people.