# Topic Modeling with Single NMF Model
Solution: use K-mean model to output
Source: seed_sample_2000_input.csv

## Source


<img src="images/source-input-seed-2000.png" alt="term-document matrix" style="width: 80%"/>

### import necessary library

In [226]:
from flask import Flask, request, redirect, render_template, Response, send_file, send_from_directory
from flask_wtf import FlaskForm
from wtforms import SelectMultipleField
from flask_bootstrap import Bootstrap
from os.path import join, dirname, realpath
import pandas as pd
# from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer  # add CountVectorizer for NMF
from sklearn.cluster import KMeans
from datetime import datetime
import io, os
import numpy as np

### Import library for stopwords

In [227]:
from sklearn import decomposition
import matplotlib.pyplot as plt
import re
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split

In [228]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shouqiangye/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Set DF from csv file
Source 1: <a href="https://github.com/UNCWellington/AI-tools/blob/main/seed_sample_2000_input.csv?raw=true">seed_sample_2000_input.csv</a>  only 2000 rows  

Source 2: <a href="https://github.com/UNCWellington/AI-tools/blob/main/Sample-for-SC.csv?raw=true">Sample-for-SC.csv   </a> nearly 10,000 rows
Note: we shoud add ?raw=true at the end of github link address

In [229]:
# url = 'https://github.com/UNCWellington/AI-tools/blob/main/seed_sample_2000_input.csv?raw=true'
url = 'https://github.com/UNCWellington/AI-tools/blob/main/Sample-for-SC.csv?raw=true'
# df = pd.read_csv(url,index_col=0)
df = pd.read_csv(url, keep_default_na=False)

In [230]:
df.head()

Unnamed: 0,AN,Seed,TAB
0,1783,,"""Anywhere but here"": Querying spatial stigma a..."
1,4672,,"""Are We Safe Analysts?"" Cisgender Countertrans..."
2,4444,,"""As a Trans Person You Don't Live. You Merely ..."
3,22,,"""At Your Service"": Sexual Harassment of Female..."
4,2808,,"""Bareback"" pornography consumption and safe-se..."


## Stop words, stemming, lemmatization

In [231]:
stemmer = nltk.stem.SnowballStemmer('english')
nltk.download('stopwords')
stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shouqiangye/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [262]:
stopwords

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

## Data Processing

Next, scikit learn has a method that will extract all the word counts for us

In [232]:
documents = df['TAB'].values.astype("U")

In [233]:
documents

array(['"Anywhere but here": Querying spatial stigma as a social determinant of health among youth of color accessing LGBTQ services in Chicago\'s Boystown The link between stigma and negative health outcomes is established, yet available research infrequently considers the complex intersection of place, race, and class-based stigma and how this stigma shapes opportunities and health among marginalized groups. Furthermore, scholarship on the relationship between stigma and health often fails to include the voices of the stigmatized themselves. This exclusion renders their lived-experiences hidden and their insight devalued, producing findings with limited validity to promote health equity and social change. In this article, we explore intersecting place, race, and class-based stigmas, or spatial stigma, as a social determinant of health among youth of color (YoC) accessing LGBTQ-specific services in the Chicago\'s White, middle-class gay enclave, Boystown. Qualitative data were collect

In [234]:
# vectorizer = TfidfVectorizer(stop_words=stopwords, ngram_range=(1, 2))
#features = vectorizer.fit_transform(documents)

In [235]:
vectorizer = CountVectorizer(stop_words=stopwords, ngram_range=(1, 3)) 

In [236]:
# vectors = vectorizer.fit_transform(documents).todense() # (documents, vocab)
vectors = vectorizer.fit_transform(documents)
vectors.shape #, vectors.nnz / vectors.shape[0], row_means.shape

(10094, 2432537)

In [237]:
print(len(documents), vectors.shape)

10094 (10094, 2432537)


In [238]:
# vocab = np.array(vectorizer.get_feature_names_out())
# we change vocab variable to terms
terms = np.array(vectorizer.get_feature_names_out())

In [239]:
terms.shape

(2432537,)

In [240]:
terms[7000:7020]

array(['02 96', '02 96 05', '02 98', '02 98 fully', '02 activism',
       '02 activism lgbt', '02 addition', '02 addition expected',
       '02 adolescents', '02 adolescents alcohol', '02 affectional',
       '02 affectional expression', '02 age', '02 age start', '02 also',
       '02 also associated', '02 although', '02 although incidence',
       '02 although intended', '02 among'], dtype=object)

## Non-negative Matrix Factorization (NMF)

#### Applications of NMF


- Topic Modeling (our problem!)

<img src="images/nmf_doc.png" alt="NMF on documents" style="width: 80%"/>



### NMF from sklearn

We will use [scikit-learn's implementation of NMF](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html):

In [241]:
m,n=vectors.shape
d=10  # num topics

In [260]:
clf = decomposition.NMF(n_components=d, init='nndsvda',max_iter=1,  verbose=0, random_state=3425)

In [261]:
W1 = clf.fit_transform(vectors)
H1 = clf.components_



In [244]:
num_top_words=10
def show_topics(a):
    top_words = lambda t: [terms[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [245]:
show_topics(H1)

['sexual behavior sexual behavior orientation sexual orientation risk behaviors minority sexual minority sex',
 'hiv msm testing risk among prep prevention infection hiv testing study',
 'gender transgender identity gender identity patients sex study female male individuals',
 'de la en que el las los se con del',
 'health mental mental health care transgender health care among minority outcomes based',
 'men sex msm sex men men sex men sex men gay among anal gay men',
 '95 ci 95 ci aor among associated prevalence years study participants',
 'use risk alcohol prep substance among substance use drug associated alcohol use',
 'study gay social self attitudes rights participants research record database',
 'women heterosexual study men transgender women lesbian transgender bisexual differences compared']

### Calcuate Dominant topic with NMF

### calculate dominant_topic directly by skip topic0-9

In [246]:
# colnames = ["Topic" + str(i) for i in range(clf.n_components)]
# docnames = df['AN']
# df_doc_topic = pd.DataFrame(np.round(W1, 2), columns=colnames, index=docnames)
df_doc_topic = pd.DataFrame()
significant_topic = np.argmax(W1, axis=1)
df_doc_topic['new_dominant_topic'] = significant_topic

In [247]:
df_doc_topic

Unnamed: 0,new_dominant_topic
0,4
1,2
2,2
3,0
4,5
...,...
10089,7
10090,8
10091,1
10092,8


In [248]:
df['Topic'] = significant_topic

In [249]:
df

Unnamed: 0,AN,Seed,TAB,Topic
0,1783,,"""Anywhere but here"": Querying spatial stigma a...",4
1,4672,,"""Are We Safe Analysts?"" Cisgender Countertrans...",2
2,4444,,"""As a Trans Person You Don't Live. You Merely ...",2
3,22,,"""At Your Service"": Sexual Harassment of Female...",0
4,2808,,"""Bareback"" pornography consumption and safe-se...",5
...,...,...,...,...
10089,9737,,â€˜Our favourite drugâ€™: Prevalence of use an...,7
10090,7231,,â€˜Thatâ€™s not the kind of church we areâ€™: ...,8
10091,9145,,â€˜The ownâ€™ and â€˜the wiseâ€™ as social sup...,1
10092,9104,,â€˜Totally straightâ€™: Contested sexual ident...,8


### Looking for top terms

In [250]:
# order_centroids = model.cluster_centers_.argsort()[:, ::-1]
order_centroids = H1.argsort()[:, ::-1]
# terms = vectorizer.get_feature_names_out()
terms = np.array(vectorizer.get_feature_names_out())

In [251]:
df2 = pd.DataFrame()
for i in range(d):
    subset_df = df[df.Topic == i].shape[0]

    top_ten_words = [terms[ind] for ind in order_centroids[i, :10]]
    data = {'TopicIndex': i, 
            'NumObs': subset_df,
           'TopicKeyWords': top_ten_words}


    df2 = df2.append(data, ignore_index=True)

In [252]:
df2

Unnamed: 0,TopicIndex,NumObs,TopicKeyWords
0,0.0,1170.0,"[sexual, behavior, sexual behavior, orientatio..."
1,1.0,1288.0,"[hiv, msm, testing, risk, among, prep, prevent..."
2,2.0,1543.0,"[gender, transgender, identity, gender identit..."
3,3.0,90.0,"[de, la, en, que, el, las, los, se, con, del]"
4,4.0,840.0,"[health, mental, mental health, care, transgen..."
5,5.0,961.0,"[men, sex, msm, sex men, men sex, men sex men,..."
6,6.0,389.0,"[95, ci, 95 ci, aor, among, associated, preval..."
7,7.0,802.0,"[use, risk, alcohol, prep, substance, among, s..."
8,8.0,2224.0,"[study, gay, social, self, attitudes, rights, ..."
9,9.0,787.0,"[women, heterosexual, study, men, transgender ..."


### Concat df and df2 to final result

In [253]:
result = pd.concat([df, df2], axis=1)

In [254]:
result

Unnamed: 0,AN,Seed,TAB,Topic,TopicIndex,NumObs,TopicKeyWords
0,1783,,"""Anywhere but here"": Querying spatial stigma a...",4,0.0,1170.0,"[sexual, behavior, sexual behavior, orientatio..."
1,4672,,"""Are We Safe Analysts?"" Cisgender Countertrans...",2,1.0,1288.0,"[hiv, msm, testing, risk, among, prep, prevent..."
2,4444,,"""As a Trans Person You Don't Live. You Merely ...",2,2.0,1543.0,"[gender, transgender, identity, gender identit..."
3,22,,"""At Your Service"": Sexual Harassment of Female...",0,3.0,90.0,"[de, la, en, que, el, las, los, se, con, del]"
4,2808,,"""Bareback"" pornography consumption and safe-se...",5,4.0,840.0,"[health, mental, mental health, care, transgen..."
...,...,...,...,...,...,...,...
10089,9737,,â€˜Our favourite drugâ€™: Prevalence of use an...,7,,,
10090,7231,,â€˜Thatâ€™s not the kind of church we areâ€™: ...,8,,,
10091,9145,,â€˜The ownâ€™ and â€˜the wiseâ€™ as social sup...,1,,,
10092,9104,,â€˜Totally straightâ€™: Contested sexual ident...,8,,,


In [255]:
result.head(10)

Unnamed: 0,AN,Seed,TAB,Topic,TopicIndex,NumObs,TopicKeyWords
0,1783,,"""Anywhere but here"": Querying spatial stigma a...",4,0.0,1170.0,"[sexual, behavior, sexual behavior, orientatio..."
1,4672,,"""Are We Safe Analysts?"" Cisgender Countertrans...",2,1.0,1288.0,"[hiv, msm, testing, risk, among, prep, prevent..."
2,4444,,"""As a Trans Person You Don't Live. You Merely ...",2,2.0,1543.0,"[gender, transgender, identity, gender identit..."
3,22,,"""At Your Service"": Sexual Harassment of Female...",0,3.0,90.0,"[de, la, en, que, el, las, los, se, con, del]"
4,2808,,"""Bareback"" pornography consumption and safe-se...",5,4.0,840.0,"[health, mental, mental health, care, transgen..."
5,5589,,"""Be a Man"": The Role of Social Pressure in Eli...",5,5.0,961.0,"[men, sex, msm, sex men, men sex, men sex men,..."
6,9994,,"""Because your dysphoria gets in the way of you...",0,6.0,389.0,"[95, ci, 95 ci, aor, among, associated, preval..."
7,3165,,"""But the moment they find out that you are MSM...",1,7.0,802.0,"[use, risk, alcohol, prep, substance, among, s..."
8,3946,,"""Diagnosing"" Gender? Categorizing Gender-Ident...",2,8.0,2224.0,"[study, gay, social, self, attitudes, rights, ..."
9,5460,,"""Feeling Safe, Feeling Seen, Feeling Free"": Co...",4,9.0,787.0,"[women, heterosexual, study, men, transgender ..."
