Topic Modeling for clients' queries

Here we show how to do topic modeling on a huge dataset including more than 400 k questions by customers. We will
use TF-IDF and None_Negative Matrix Vectorization for this project. The number of topics is quite deliberate, we will
define twenty topics. Note that after extracting the most commonly used words in each topic, defining the title of 
the topic based offf of that would be also deliberate and related to our field knowledge and judgment.

In [33]:
# Import pandas.
# Important: we did this project on a dataframe named quora_questions.csv with only one test column (Question). 
# So, you may use it on any similar dataset

import pandas as pd
qq = pd.read_csv('quora_questions.csv')

print(len(qq))
qq.head()

404289


Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [34]:
# First preprocessing:
# We'll create a vectorized document term matrix using tf-idf.
# stop words are removed. Also, too common and too rare words across the documents (questions here) 
# are removed from the bow (bag of words) using max_df and min_df parameters.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_df = 0.90, min_df = 2, stop_words = 'english')

dtm = tfidf.fit_transform(qq['Question'])
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

In [35]:
# Import the NMF (non-negative matrix factorization), create an instance of it and set the number of topics to 20

from sklearn.decomposition import NMF

nmf = NMF(n_components=20,random_state=42)
nmf.fit(dtm)



NMF(n_components=20, random_state=42)

In [36]:
# Let's examine the top 15 most common words for each of the 20 topics

for index, topic in enumerate(nmf.components_):
    print(f'List of 15 most commonly used words for the topic No. {index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

List of 15 most commonly used words for the topic No. 0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


List of 15 most commonly used words for the topic No. 1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


List of 15 most commonly used words for the topic No. 2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


List of 15 most commonly used words for the topic No. 3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


List of 15 most commonly used words for the topic No. 4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']



In [37]:
# Now let's add a new column to the original dataframe that labels each question into one of the 20 topics.

# define the probabilities that each question belongs to different topics
qq_topics = nmf.transform(dtm)

# find the most likely topic for each question based off of the probability value
qq['Topic Label'] = qq_topics.argmax(axis = 1)
qq.head()

Unnamed: 0,Question,Topic Label
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14


In [38]:
# And finally, we may give a lable as the topic of the questions. We shoud decide about the topic title based on the 
# most common words listed above for every title. So we make a dictionary first as follows.

title_dict = {1:'Movies and books', 2:'Job', 3:'Q&A', 4:'Internet and Web', 5:'Life', 6:'South & East Asia', 
          7:'Programming', 8:'US Election', 9:'Politict', 10: 'Women', 11:'General', 12:'Economy', 
          13:'Administration', 14:'English Language', 15:'Health', 16:'Recreation', 17:'Love and Marriage',
          18:'Social Media', 19:'Software Engineering', 20:'Earth and Human'}

qq['Topic Title'] = qq['Topic Label'].map(title_dict)

qq.head()

Unnamed: 0,Question,Topic Label,Topic Title
0,What is the step by step guide to invest in sh...,5,Life
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16,Recreation
2,How can I increase the speed of my internet co...,17,Love and Marriage
3,Why am I mentally very lonely? How can I solve...,11,General
4,"Which one dissolve in water quikly sugar, salt...",14,English Language


In [40]:
# And let's see which the frequency of topics in descending order
qq['Topic Title'].value_counts()

# As we see they are in almost in balance

Software Engineering    29735
Social Media            29224
Movies and books        28729
Women                   27686
Politict                23334
Programming             22901
Love and Marriage       22694
Economy                 22171
Health                  21020
Life                    20134
Job                     17590
Q&A                     16145
US Election             15191
General                 15027
Internet and Web        14719
English Language        13273
Recreation              12140
Administration           8961
South & East Asia        8745
Name: Topic Title, dtype: int64