# AAI 520 - Natural Language Processing

## Assignment 4 - Topic Analysis

by Bryan Carr

3 October 2022

for University of San Diego

Prof. Siamak Aram

In this assignment, we will analyse a dataset of over 400,000 quora questions that have no labelled category. We will be categorizing them into 20 categories.

The dataset was provided by Prof. Mokhtari Jadid, via Slack.

**Task: Import Pandas and read in the quora_questions.csv file.**

In [1]:
import pandas as pd
import numpy as np

# Mount the Google Drive to use the data file
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Read in the data

quora_df = pd.read_csv('/content/drive/My Drive/AAI 520 NLP/quora_questions.csv')

In [3]:
quora_df.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [4]:
# Additional EDA
quora_df.shape

(404289, 1)

## Preprocessing

**Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.**

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
vectorizer = TfidfVectorizer(min_df = 2, max_df = 0.95, stop_words='english')

# I tried several different values here, including max_features = 38669, until finding the topics followed correctly
# when using the values from the slides
#
# To interpret them: min_df = 2 means a word must occur in at least two docs -- less frequent words will be filtered out
# max_df = 0.95 means a word must occur in no more than 95% of the docs; this is similar to the 5% most frequent stop words
# stop_words = 'english' additionally filters out that group of stop words

In [7]:
vectors = None
vectors = vectorizer.fit_transform(quora_df['Question'].tolist())

In [8]:
vectors

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

## Non-negative Matrix Factorization

TASK: Using Scikit-Learn, create an instance of NMF with 20 expected components. Use random_state = 42.

In [9]:
from sklearn.decomposition import NMF

In [10]:
nmf_model = NMF(n_components=20, random_state=42) 
#, alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200, shuffle=False, solver='cd', tol=0.0001, verbose=0)

In [11]:
nmf_model.fit(vectors)



NMF(n_components=20, random_state=42)

In [12]:
nmf_components = nmf_model.components_

In [13]:
nmf_components.shape

(20, 38669)

In [14]:
# search for top 15
# components[Q].argsort() would give us an Ascending list of indices for component at index Q
# so look at the last 15 entries, i.e. the highest ones
nmf_components[0].argsort()[-15:]

array([34698, 28463, 26316, 37088, 26326, 26057,  5976, 19847, 22924,
       37520,   482,  5283,  5268, 22925,  4632])

In [15]:
for i, topic in enumerate(nmf_components):
  print(f'THE TOP 15 WORDS FOR TOPIC #{i}')
  print([vectorizer.get_feature_names_out()[w] for w in topic.argsort()[-15:]])
  print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']


THE TOP 15 WORDS FOR TOPIC #1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']


THE TOP 15 WORDS FOR TOPIC #2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']


THE TOP 15 WORDS FOR TOPIC #3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']


THE TOP 15 WORDS FOR TOPIC #4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']


THE TOP 15 WORDS FOR TOPIC #5
['reservation', 'engineering', 'minister', 'president', 'company', 'china', 'business', 'country', 

In [16]:
# Additional work that I did before finding simpler answers/methods in the slides
# I was building a data frame of the word features, which could then be picked through using .iloc to pull out the word
# this would end up working similar to the get_feature_names_out()[w] method above

"""
# Build a dataframe of the Feature words, from the available Dictionary of Vocabulary

vect_features = None

vect_features = pd.DataFrame.from_dict(vectorizer.vocabulary_, orient='index').reset_index().rename(columns={'index':'word', 0:'index'}).sort_values(by='index').set_index('index')

print(vect_features)

"""

"\n# Build a dataframe of the Feature words, from the available Dictionary of Vocabulary\n\nvect_features = None\n\nvect_features = pd.DataFrame.from_dict(vectorizer.vocabulary_, orient='index').reset_index().rename(columns={'index':'word', 0:'index'}).sort_values(by='index').set_index('index')\n\nprint(vect_features)\n\n"

**TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories**

In [17]:
# Recall: vectors is of shape (Docs, Words), in this case 404289x38669
# Vectors was then used to fit our NMF model

topics = nmf_model.transform(vectors)

In [18]:
topics

array([[2.75937605e-04, 5.91249293e-05, 6.17687040e-06, ...,
        6.97269969e-04, 2.13527728e-04, 0.00000000e+00],
       [1.96418670e-04, 8.85438224e-05, 0.00000000e+00, ...,
        0.00000000e+00, 5.51088847e-05, 1.05527238e-05],
       [1.78019854e-04, 6.47373072e-04, 1.60510763e-03, ...,
        3.02354836e-03, 1.05908512e-03, 1.23878889e-03],
       ...,
       [0.00000000e+00, 1.62431955e-05, 5.23720795e-06, ...,
        0.00000000e+00, 2.76279348e-06, 0.00000000e+00],
       [5.36236094e-04, 1.01567857e-03, 0.00000000e+00, ...,
        1.28720137e-04, 7.76975481e-04, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 1.25187210e-04]])

In [19]:
# Check for correct size: should have 404,289 x 20
topics.shape

(404289, 20)

In [20]:
# Check the topic of the first doc: should be 5 according to assignment handout
topics[0].argmax()

5

In [21]:
# Check all topic argmaxes: should be 5, 16, 17, 11, 14, ...
topics.argmax(axis=1)

array([ 5, 16, 17, ..., 11, 11,  9])

In [22]:
# add a new column to the dataframe
quora_df['Topic'] = topics.argmax(axis=1)

In [23]:
quora_df.head(10)

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14
5,Astrology: I am a Capricorn Sun Cap moon and c...,1
6,Should I buy tiago?,0
7,How can I be a good geologist?,10
8,When do you use シ instead of し?,19
9,Motorola (company): Can I hack my Charter Moto...,17
