___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Topic Modeling for Quora

#### Task: Import pandas and read in the quora_questions.csv file.

In [7]:
import pandas as pd

In [8]:
qstns = pd.read_csv('quora_questions.csv')

In [9]:
qstns.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


# Preprocessing

#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
tfidf = TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [12]:
dtm = tfidf.fit_transform(qstns['Question'])

In [13]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

# Non-negative Matrix Factorization

#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42)..

In [14]:
from sklearn.decomposition import NMF

In [15]:
nmf_model = NMF(n_components=20, random_state=42)

In [16]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=20, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

#### TASK: Print our the top 15 most common words for each of the 20 topics.

In [17]:
for index,topic in enumerate(nmf_model.components_):
    print(f"THE TOP 15 WORDS FOR TOPIC #{index}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])

THE TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']
THE TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']
THE TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']
THE TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']
THE TOP 15 WORDS FOR TOPIC #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']
THE TOP 15 WORDS FOR TOPIC #5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 'olympics'

#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [20]:
topic_results = nmf_model.transform(dtm)

In [21]:
qstns['Topic']= topic_results.argmax(axis=1)

In [22]:
qstns.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14


In [23]:
my_topic_dict = {0:'entertainment',1:'employment',2:'quora posts',3:'technical',4:'world',5:'politics',6:'technical',7:'politics',8:'politics',9:'india',10:'topics',11:'indian economy',12:'resolutions',13:'language skills',14:'health',15:'life',16:'love',17:'web',18:'web',19:'earth'}

In [24]:
qstns['Topic Label']= qstns['Topic'].map(my_topic_dict)

In [25]:
qstns.head()

Unnamed: 0,Question,Topic,Topic Label
0,What is the step by step guide to invest in sh...,5,politics
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16,love
2,How can I increase the speed of my internet co...,17,web
3,Why am I mentally very lonely? How can I solve...,11,indian economy
4,"Which one dissolve in water quikly sugar, salt...",14,health


## Non-Negative Matrix Factorization Method

In [26]:
import pandas as pd

npr = pd.read_csv('npr.csv')

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
tfidf = TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [31]:
dtm = tfidf.fit_transform(npr['Article'])

In [32]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [30]:
from sklearn.decomposition import NMF

In [33]:
nfm_model = NMF(n_components=7,random_state=42)

In [34]:
nfm_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [35]:
for index,topic in enumerate(nfm_model.components_):
    print(f"The Top 15 words for Topic # {index}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

The Top 15 words for Topic # 0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


The Top 15 words for Topic # 1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


The Top 15 words for Topic # 2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


The Top 15 words for Topic # 3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


The Top 15 words for Topic # 4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


The Top 15 words for Topic # 5
['love', 've', 'don

In [37]:
topic_results = nfm_model.transform(dtm)

In [38]:
npr['Topic']= topic_results.argmax(axis=1)

In [39]:
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
