In [1]:
import pandas as pd
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
# from google.colab import files

## Data loading

In [2]:
# Mounting the google drive to google colab in order to load the data files directly from it
# from google.colab import drive
# drive.mount('/content/drive')

In [3]:
# The data can be load directly from it
# You must change the path if the data is not directly in the path EPITA_NLP/Course1/ of the google drive
quora = pd.read_csv('quora_questions.csv')
print(quora.head(30))
# We consider only yhe 10000 first files in order to decrease the computation time
texts = quora["Question"][0:10000]

                                             Question
0   What is the step by step guide to invest in sh...
1   What is the story of Kohinoor (Koh-i-Noor) Dia...
2   How can I increase the speed of my internet co...
3   Why am I mentally very lonely? How can I solve...
4   Which one dissolve in water quikly sugar, salt...
5   Astrology: I am a Capricorn Sun Cap moon and c...
6                                 Should I buy tiago?
7                      How can I be a good geologist?
8                     When do you use シ instead of し?
9   Motorola (company): Can I hack my Charter Moto...
10  Method to find separation of slits using fresn...
11        How do I read and find my YouTube comments?
12               What can make Physics easy to learn?
13        What was your first sexual experience like?
14  What are the laws to change your status from a...
15  What would a Trump presidency mean for current...
16                       What does manipulation mean?
17  Why do girls want to be 

## Non-negative Matrix Factorization

Pre-processing

Use the TfidfVectorizer function : https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [15]:
# Question 1: Uncomment and complete the following lines

tfidf = TfidfVectorizer(stop_words='english', min_df=2, max_df=0.95)
dtm = tfidf.fit_transform(texts)


Use of the NMF algorithm

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

In [18]:
# Question 2: Uncomment and complete the following lines (choose the number of components you want)

NMF_ = NMF(n_components=7, init='random', random_state=0)
NMF_.fit(dtm)

Have a look at the components


In [19]:
NMF_.components_

array([[6.50669087e-03, 0.00000000e+00, 2.67041555e-02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [2.58116728e-02, 0.00000000e+00, 1.76912466e-02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [3.34615170e-03, 3.94290220e-04, 5.45910125e-03, ...,
        0.00000000e+00, 2.78757037e-03, 1.03327876e-02],
       ...,
       [0.00000000e+00, 2.89111494e-04, 1.40061073e-02, ...,
        7.25078981e-05, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 7.78688015e-04, 1.17346239e-02, ...,
        3.77963644e-05, 0.00000000e+00, 1.08311569e-02],
       [7.35029346e-03, 1.28938394e-04, 1.23298363e-02, ...,
        0.00000000e+00, 0.00000000e+00, 4.09595984e-04]])

Have a look at the shape of the component array.

Can you explain the meaning of the shape observed?


In [20]:
NMF_.components_.shape

(7, 5168)

Let's have a look at the most representative words of each topic


In [26]:
# Question 3: Print the 15 words the most representative of each topic
# Advice: both tfidf and NMF objects can be useful to achieve it
# Do you think the number of components used at the previous question for the NMF was relevant? You can try to change it if you want.
tfidf_features = tfidf.get_feature_names_out()

for index_topic, topic_line in enumerate(NMF_.components_):
    print('topic', index_topic)
    print([tfidf_features[i] for i in topic_line.argsort()[-15:]])
    print('------------------')

topic 0
['start', 'movie', 'weight', 'english', 'learning', 'books', 'book', 'programming', '2016', 'movies', 'language', 'india', 'learn', 'way', 'best']
------------------
topic 1
['rupee', 'day', 'did', 'black', 'friends', 'notes', '500', 'ways', '1000', 'india', 'way', 'earn', 'online', 'money', 'make']
------------------
topic 2
['time', 'new', 'says', 'english', 'compare', 'use', 'cost', 'exist', 'love', 'long', 'feel', 'india', 'work', 'mean', 'does']
------------------
topic 3
['knowledge', 'java', 'machine', 'systems', 'transgender', 'bank', 'computer', 'information', 'main', 'engineering', 'science', 'data', 'job', 'love', 'difference']
------------------
topic 4
['believe', 'needing', 'easily', 'improvement', 'asked', 'google', 'delete', 'answers', 'answer', 'ask', 'think', 'question', 'questions', 'people', 'quora']
------------------
topic 5
['things', 'world', 'thing', 'girl', 'going', 'sex', 'person', 'don', 'girls', 'did', 'feel', 'know', 'work', 'life', 'like']
-------

Associate explicitly each text to a topic

In [30]:
# Question 4: associate each text to a specific topic into a new two-column DataFrame
# with one column for the text and the other for the topic number
# Hint: you may find NMF.transform function useful

NMF_transform = NMF_.transform(dtm)
NMF_transform.shape
NMF_transform.argmax(axis=1)
pd.DataFrame({'texts': texts, 'NMF_Cluster': NMF_transform.argmax(axis=1)})

Unnamed: 0,texts,NMF_Cluster
0,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,0
2,How can I increase the speed of my internet co...,1
3,Why am I mentally very lonely? How can I solve...,5
4,"Which one dissolve in water quikly sugar, salt...",2
...,...,...
9995,How would you order these four cities (Bangalo...,4
9996,Stphen william hawking?,0
9997,Mathematical Puzzles: What is () + () + () = 3...,4
9998,Is IMS noida good for BCA?,6


## Latent Dirichlet Allocation (LDA)

Pre-processing 

Use the CountVectorizer function : https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [10]:
# Question 5: Uncomment and complete the following lines

#cv = CountVectorizer(????)
#dtm = cv.fit_transform(????)

Use of the LDA algorithm

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

In [11]:
# Question 6: Uncomment and complete the following lines (choose the number of components you want)

#LDA = LatentDirichletAllocation(????)
#LDA.fit(????)

Have a look at the components



In [12]:
LDA.components_

NameError: name 'LDA' is not defined

Have a look at the shape of the component array.

Can you explain the meaning of the shape observed? 

In [None]:
LDA.components_.shape

Let's have a look at the most representative words of each topic


In [None]:
# Question 6: Print the 15 words the most representative of each topic
# Advice: both cv and LDA objects can be useful to achieve it
# Do you think the number of components used at the previous question for the LDA was relevant? You can try to change it if you want.

Associate explicitly each text to a topic

In [None]:
# Question 7: associate each text to a specific topic into a new three-column DataFrame
# with one column for the text, the second for the topic number of NMF classification and the third  for the topic number of LDA classification
# Hint: you may find LDA.transform function useful