## Topic Modeling with gensim and spacy

LDA, or Latent Direchlet Allocation. https://en.wikipedia.org/wiki/Dirichlet_distribution
In LDA modeling,

- each document is a mixture of topics
- each topic is a mixture of words or tokens.



##  Using Scikit to convert a corpus to a matrix of token counts

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [3]:
with open('NLP_Data/pride_chp1.txt','r') as f1:
    pride_corpus = f1.readlines()

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(pride_corpus)
X_train_counts.shape

(120, 328)

In [5]:
print(X_train_counts[0:3])

  (0, 224)	1
  (0, 17)	1
  (0, 226)	1
  (2, 25)	1
  (2, 138)	1
  (2, 37)	1


## Count TF-IDF

Transform token counts into TF-IDF.

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(120, 328)

In [7]:
print(X_train_tf[0:35:7]) # we're just trying to show some random data points

  (0, 17)	0.5773502691896258
  (0, 224)	0.5773502691896258
  (0, 226)	0.5773502691896258
  (2, 54)	0.2773500981126146
  (2, 80)	0.2773500981126146
  (2, 108)	0.2773500981126146
  (2, 135)	0.2773500981126146
  (2, 202)	0.2773500981126146
  (2, 227)	0.2773500981126146
  (2, 235)	0.2773500981126146
  (2, 263)	0.2773500981126146
  (2, 270)	0.2773500981126146
  (2, 271)	0.5547001962252291


In [8]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=1000,
                                stop_words='english')

In [9]:
tf = tf_vectorizer.fit_transform(pride_corpus)

In [10]:
pride_corpus[12:20]

['However little known the feelings or views of such a man may be on his\n',
 'first entering a neighbourhood, this truth is so well fixed in the minds\n',
 'of the surrounding families, that he is considered the rightful property\n',
 'of some one or other of their daughters.\n',
 '\n',
 '"My dear Mr. Bennet," said his lady to him one day, "have you heard that\n',
 'Netherfield Park is let at last?"\n',
 '\n']

In [11]:
n_top_words = 5

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [12]:
lda = LatentDirichletAllocation(n_components=10, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(tf)


print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)


Topics in LDA model:
Topic #0: woman _her_ wife bingley know
Topic #1: replied assure little want bennet
Topic #2: man thinking little beauty daughters
Topic #3: mr dear bennet jane better
Topic #4: girls sure thing lizzy daughters
Topic #5: visit nerves good lady young
Topic #6: wife replied neighbourhood good _her_
Topic #7: neighbourhood single netherfield thousand girls
Topic #8: design years marrying know girls
Topic #9: long possession man mrs know

