# Topic Modeling - LDA

Welcome to your Topic Modeling Assessment! For this project you will be working with a dataset of over 400,000 quora questions that have no labeled cateogry, and attempting to find 20 cateogries to assign these questions to. The .csv file of these text questions can be found underneath the Topic-Modeling folder.

Remember you can always check the solutions notebook and video lecture for any questions.

#### Task: Import pandas and read in the quora_questions.csv file.

In [5]:
import pandas as pd

In [6]:
quora = pd.read_csv('quora_questions.csv')

In [7]:
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


# Preprocessing

#### Task: Use CountVectorizer Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

In [8]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
# Instantiate CountVectorizer with required hyperparameters
# max_df (0.95) => Pick only Words that shows up in 95% of documents
# min_df (2)    => Pick only Words that shows up in minimum 2 documents.

cv = CountVectorizer(max_df=0.9,min_df=2,stop_words='english')

In [10]:
# Create Document Term Matrix
dtm = cv.fit_transform(quora['Question'])

In [11]:
# It creates a sparse matrix wtih Number of Articles (Documents) (404289) and Number of words (Terms) (38669)
dtm

<404289x38669 sparse matrix of type '<class 'numpy.int64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

# Latent Drichlet Allocation

#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42)..

In [12]:
from sklearn.decomposition import LatentDirichletAllocation

In [13]:
lda_model = LatentDirichletAllocation(n_components=7,random_state=42)

In [15]:
%%time
lda_model.fit(dtm)

CPU times: user 11min 2s, sys: 1.28 s, total: 11min 3s
Wall time: 11min 15s


LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=7, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

#### TASK: Print our the top 15 most common words for each of the 20 topics.

In [18]:
%%time
for index,topic in enumerate(lda_model.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC # {index}")
    print([cv.get_feature_names()[i] for i in topic.argsort()[-20:]])
    print('\n\n')
    

THE TOP 15 WORDS FOR TOPIC # 0
['buy', 'used', 'career', 'examples', 'difference', 'free', 'company', 'using', 'mobile', 'software', 'google', 'app', 'android', 'engineering', 'does', 'good', 'use', 'phone', 'india', 'best']



THE TOP 15 WORDS FOR TOPIC # 1
['earn', 'india', 'indian', 'black', 'rs', 'ways', 'programming', 'stop', 'language', 'improve', '1000', 'notes', 'online', '500', 'english', 'make', 'way', 'learn', 'money', 'best']



THE TOP 15 WORDS FOR TOPIC # 2
['hair', 'work', 'police', 'indian', 'read', 'safe', 'book', 'water', 'did', 'compare', 'travel', 'average', 'energy', 'india', 'books', 'best', 'good', 'time', 'does', 'life']



THE TOP 15 WORDS FOR TOPIC # 3
['answer', 'men', 'don', 'ask', 'day', 'make', 'movies', 'thing', 'does', 'question', 'old', 'movie', 'year', 'things', 'questions', 'best', 'know', 'new', 'people', 'quora']



THE TOP 15 WORDS FOR TOPIC # 4
['bank', 'power', 'gmail', 'different', 'card', 'country', 'email', 'differences', 'rid', 'password', 'c

#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [21]:
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [23]:
topic_results = nmf_model.transform(dtm)

In [24]:
topic_results.argmax(axis=1)

array([5, 4, 3, ..., 5, 5, 1])

In [25]:
quora['Topic'] = topic_results.argmax(axis=1)

In [26]:
quora.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,4
2,How can I increase the speed of my internet co...,3
3,Why am I mentally very lonely? How can I solve...,1
4,"Which one dissolve in water quikly sugar, salt...",1


In [27]:
# Create Topic Dictionary
topic_dict = {0:'topic_0',1:'topic_1',2:'topic_2',3:'topic_3',4:'topic_4',5:'topic_5',6:'topic_6'}

In [28]:
quora['Topic_Label'] = quora['Topic'].map(topic_dict)

In [29]:
quora.head()

Unnamed: 0,Question,Topic,Topic_Label
0,What is the step by step guide to invest in sh...,5,topic_5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,4,topic_4
2,How can I increase the speed of my internet co...,3,topic_3
3,Why am I mentally very lonely? How can I solve...,1,topic_1
4,"Which one dissolve in water quikly sugar, salt...",1,topic_1
