## LDA 
I am using the LDA package and following their main example [here](https://pypi.python.org/pypi/lda) closely.
- The number of topics is set at 20.
- with the 0.05 cutoff for vocab (which we are using elsewhere, and 20 topics), the algorithm gives results in the order of a minute (didn't time it.) 

In [38]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import *
import lda #use pip install lda 
from collections import Counter

In [39]:
data = pd.read_pickle("processed_10k_articles.pkl")
titles = [word for word in data.title]

In [70]:
#from Tristan's code:
#putting the code first 
#first generate the bag of words.  This has no TF-IDF weighting yet.
#Only include words that occur in at least 5% of documents.
vectorizer = CountVectorizer(analyzer = "word",min_df=0.05) #0.05
clean_text = [' '.join( (txt.split())[0: min(500, len(txt.split()))])  for txt in data['process'] ]  #data["process"]
unweighted_words = vectorizer.fit_transform(clean_text)
terms_matrix = unweighted_words.toarray()
vocabulary  = vectorizer.vocabulary_ # the words selected 
vocab = [w for w in vocabulary]

In [71]:
model = lda.LDA(n_topics=5, n_iter=1500, random_state=1)
model.fit(terms_matrix)  # model.fit_transform(X) is also available

<lda.lda.LDA at 0x1136fa4e0>

In [72]:
topic_word = model.topic_word_ 
topic_word.shape # all elements are >= 0

(5, 1177)

In [73]:
doc_topic = model.doc_topic_
doc_topic.shape # all elements are >= 0

(10000, 5)

In [74]:
# looking at the topics that are produced
n_top_words = 5
for i, topic_dist in enumerate(topic_word):
    tmp_topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: top words: {} \n (article with highest weight: {})'.format(\
    i, ' '.join(tmp_topic_words), titles[doc_topic[:,i].argmax()]  ))

Topic 0: top words: territori known begin way collect 
 (article with highest weight: List of Nobel Prize winners by country)
Topic 1: top words: reduc lord pass latin britain 
 (article with highest weight: List of rivers of Portugal)
Topic 2: top words: join hill latin boston insid 
 (article with highest weight: Hypercholesterolemia)
Topic 3: top words: program german troop ii score 
 (article with highest weight: List of awards and nominations received by Bryan Adams)
Topic 4: top words: mexico cut code name european 
 (article with highest weight: Ryan Miller)


In [76]:
# looking at articles
for i in range(10):
#     print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))
     print("{} (top topic: {})".format(titles[i], doc_topic[i].argsort()[-2:]))

Art (top topic: [3 2])
Abbreviation (top topic: [1 2])
Astronomy (top topic: [0 2])
Browser (top topic: [4 2])
Bubonic plague (top topic: [0 2])
Cooking (top topic: [4 2])
Calculus (top topic: [1 2])
Coin (top topic: [4 2])
Earth science (top topic: [1 2])
Everything2 (top topic: [3 2])


In [77]:
# top 40 most informative words in the vocabulary
voc_var = [topic_word[:,i].var() for i in range(topic_word[:,:].shape[1]) ]
tophowmany = 40
print('top %i most informative words:'%tophowmany)
top_informative_words = np.asarray(voc_var).argsort()[::-1][:tophowmany]
for i in top_informative_words:
    print("%10s, highest-weight topic:%3d, lowest-weight topic:%d"%(vocab[i], topic_word[:,i].argmax(),  topic_word[:,i].argmin() ))

top 40 most informative words:
    mexico, highest-weight topic:  4, lowest-weight topic:2
      code, highest-weight topic:  4, lowest-weight topic:2
       cut, highest-weight topic:  4, lowest-weight topic:2
      name, highest-weight topic:  4, lowest-weight topic:2
  european, highest-weight topic:  4, lowest-weight topic:2
     reduc, highest-weight topic:  1, lowest-weight topic:2
      join, highest-weight topic:  2, lowest-weight topic:4
  democrat, highest-weight topic:  4, lowest-weight topic:2
      lord, highest-weight topic:  1, lowest-weight topic:2
   program, highest-weight topic:  3, lowest-weight topic:0
    design, highest-weight topic:  4, lowest-weight topic:2
    friend, highest-weight topic:  4, lowest-weight topic:2
        ii, highest-weight topic:  4, lowest-weight topic:2
       hit, highest-weight topic:  4, lowest-weight topic:2
      valu, highest-weight topic:  4, lowest-weight topic:2
      pass, highest-weight topic:  4, lowest-weight topic:2
     chil

In [131]:
w = pd.read_html('https://en.wikipedia.org/wiki/' + 'Pierre-Simon_Laplace',flavor='bs4')

In [132]:
v = [word for word in w if w is not np.NaN]

In [134]:
for i in range(len(w)):
     print(w[i].shape)
     

(12, 2)
(10, 1)
(1, 2)
(1, 2)
(1, 2)
(2, 3)
(28, 31)
(19, 2)
(2, 1)
(1, 2)


In [130]:
w[6].iloc[0,0]

'v t e French Consulate (10 November 1799 – 18 May 1804)'

In [136]:
import wikipedia

In [164]:
print( wikipedia.summary("bilogy",sentences = 1000) )

Biology is a natural science concerned with the study of life and living organisms, including their structure, function, growth, evolution, distribution, identification and taxonomy. Modern biology is a vast and eclectic field, composed of many branches and subdisciplines. However, despite the broad scope of biology, there are certain general and unifying concepts within it that govern all study and research, consolidating it into single, coherent field. In general, biology recognizes the cell as the basic unit of life, genes as the basic unit of heredity, and evolution as the engine that propels the synthesis and creation of new species. It is also understood today that all the organisms survive by consuming and transforming energy and by regulating their internal environment to maintain a stable and vital condition known as homeostasis.
Sub-disciplines of biology are defined by the scale at which organisms are studied, the kinds of organisms studied, and the methods used to study the