# Topic Modeling with Single NMF Model
Solution: use K-mean model to output
Source: seed_sample_2000_input.csv

## Source


<img src="images/source-input-seed-2000.png" alt="term-document matrix" style="width: 80%"/>

### import necessary library

In [349]:
from flask import Flask, request, redirect, render_template, Response, send_file, send_from_directory
from flask_wtf import FlaskForm
from wtforms import SelectMultipleField
from flask_bootstrap import Bootstrap
from os.path import join, dirname, realpath
import pandas as pd
# from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer  # add CountVectorizer for NMF
from sklearn.cluster import KMeans
from datetime import datetime
import io, os
import numpy as np

### Import library for stopwords

In [350]:
from sklearn import decomposition
import matplotlib.pyplot as plt
import re
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split

In [351]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shouqiangye/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Set DF from csv file
Source 1: <a href="https://github.com/UNCWellington/AI-tools/blob/main/seed_sample_2000_input.csv?raw=true">seed_sample_2000_input.csv</a>  only 2000 rows  

Source 2: <a href="https://github.com/UNCWellington/AI-tools/blob/main/Sample-for-SC.csv?raw=true">Sample-for-SC.csv   </a> nearly 10,000 rows
Note: we shoud add ?raw=true at the end of github link address

In [352]:
url = 'https://github.com/UNCWellington/AI-tools/blob/main/seed_sample_2000_input.csv?raw=true'
# url = 'https://github.com/UNCWellington/AI-tools/blob/main/Sample-for-SC.csv?raw=true'
# df = pd.read_csv(url,index_col=0)
df = pd.read_csv(url, keep_default_na=False)

In [353]:
df.head()

Unnamed: 0,AN,TAB,Seed
0,1000,Urban Growth Dynamics and Changing Land-Use La...,
1,1001,Reduction in exposure to arsenic from drinking...,1.0
2,1002,Optimization of phenol adsorption onto biochar...,
3,1003,Long-term trends in hydrochemistry in the Para...,
4,1004,Application research of crosshole electromagne...,


## Stop words, stemming, lemmatization

In [354]:
stemmer = nltk.stem.SnowballStemmer('english')
nltk.download('stopwords')
stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shouqiangye/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Data Processing

Next, scikit learn has a method that will extract all the word counts for us

In [355]:
documents = df['TAB'].values.astype("U")

In [356]:
documents

array(['Urban Growth Dynamics and Changing Land-Use Land-Cover of Megacity Kolkata and Its Environs Spatio-temporal land-use land-cover changes have a long-term impact on urban environments. The present study is based on land-use land-cover changes and urban expansion of megacity Kolkata and its environs over three decades (1991-2018) using multitemporal Landsat data. The study aims to explore and explain the spatio-temporal land-use land-cover change, areal differentiation, spatio-temporal urban growth trajectory and future land-use land-cover prediction with population projection. The spatio-temporal representation found rapid urbanization, i.e. 19% to 57%, exactly three times as in 1991, resulting in significant loss of other than urban/built-up area. Urban trajectory reveals that the expansion mainly occurred in north-east to south-west direction, the zone of both sides of River Hooghly. Areal differentiation map with highest urbanization (3146 ha or UII = 0.64) was identified in t

In [357]:
# vectorizer = TfidfVectorizer(stop_words=stopwords, ngram_range=(1, 2))
#features = vectorizer.fit_transform(documents)

In [358]:
vectorizer = CountVectorizer(stop_words='english') 

In [359]:
vectors = vectorizer.fit_transform(documents).todense() # (documents, vocab)
vectors.shape #, vectors.nnz / vectors.shape[0], row_means.shape

(1999, 26513)

In [360]:
print(len(documents), vectors.shape)

1999 (1999, 26513)


In [361]:
# vocab = np.array(vectorizer.get_feature_names_out())
# we change vocab variable to terms
terms = np.array(vectorizer.get_feature_names_out())

In [363]:
terms.shape

(26513,)

In [364]:
terms[7000:7020]

array(['cvds', 'cvocs', 'cwc', 'cwt', 'cwwtp', 'cyanidation', 'cyanide',
       'cyanides', 'cyanobacteria', 'cyanobacterial', 'cyanohab',
       'cyanohabs', 'cyanophyta', 'cyanoprokaryotic', 'cyanotoxin',
       'cyanotoxins', 'cycle', 'cycles', 'cyclic', 'cyclical'],
      dtype=object)

## Non-negative Matrix Factorization (NMF)

#### Applications of NMF


- Topic Modeling (our problem!)

<img src="images/nmf_doc.png" alt="NMF on documents" style="width: 80%"/>



### NMF from sklearn

We will use [scikit-learn's implementation of NMF](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html):

In [365]:
m,n=vectors.shape
d=10  # num topics

In [366]:
clf = decomposition.NMF(n_components=d, random_state=3425)

In [367]:
W1 = clf.fit_transform(vectors)
H1 = clf.components_



In [368]:
num_top_words=10
def show_topics(a):
    top_words = lambda t: [terms[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [369]:
show_topics(H1)

['water drinking supply quality irrigation use nutrient management treatment resources',
 'groundwater area quality study samples aquifer irrigation wells concentrations shallow',
 'fluid deposits gold hydrothermal mineralization type ore stage deposit similar',
 'arsenic exposure drinking skin concentrations bangladesh associated urinary levels study',
 'soil soils organic nutrient crop increased plant content carbon nutrients',
 'study model using used data results energy based method gas',
 'health risk children drinking study human exposure 95 factors assessment',
 'mg samples heavy concentration metals concentrations kg adsorption cr pb',
 'river quality pollution basin water china total analysis source watershed',
 'land use area urban change cover forest watershed study areas']

### Calcuate Dominant topic with NMF

In [370]:
colnames = ["Topic" + str(i) for i in range(clf.n_components)]
# docnames = ["Doc" + str(i) for i in range(len(documents))]
docnames = df['AN']
df_doc_topic = pd.DataFrame(np.round(W1, 2), columns=colnames, index=docnames)
significant_topic = np.argmax(df_doc_topic.values, axis=1)
df_doc_topic['dominant_topic'] = significant_topic

In [371]:
df_doc_topic

Unnamed: 0_level_0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,dominant_topic
AN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000,0.09,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.20,9
1001,0.16,0.08,0.00,0.30,0.00,0.09,0.22,0.00,0.00,0.11,3
1002,0.05,0.00,0.00,0.00,0.03,0.36,0.00,0.18,0.00,0.00,5
1003,0.03,0.00,0.03,0.02,0.03,0.11,0.01,0.14,0.27,0.18,8
1004,0.05,0.02,0.03,0.00,0.00,0.49,0.00,0.00,0.00,0.00,5
...,...,...,...,...,...,...,...,...,...,...,...
2994,0.00,0.00,1.27,0.00,0.00,0.00,0.00,0.02,0.00,0.00,2
2995,0.03,0.00,0.00,0.56,0.02,0.13,0.12,0.02,0.00,0.00,3
2996,0.04,0.00,0.04,0.00,0.04,0.42,0.00,0.02,0.00,0.03,5
2997,0.00,0.00,0.02,0.01,0.01,0.22,0.00,0.13,0.00,0.03,5


In [372]:
df['Topic'] = significant_topic

In [373]:
df

Unnamed: 0,AN,TAB,Seed,Topic
0,1000,Urban Growth Dynamics and Changing Land-Use La...,,9
1,1001,Reduction in exposure to arsenic from drinking...,1,3
2,1002,Optimization of phenol adsorption onto biochar...,,5
3,1003,Long-term trends in hydrochemistry in the Para...,,8
4,1004,Application research of crosshole electromagne...,,5
...,...,...,...,...
1994,2994,"Geology, fluid inclusions, and H-O-S-Pb isotop...",,2
1995,2995,Sex-Specific Associations between One-Carbon M...,,3
1996,2996,A Quantitative Process-Based Inventory Study o...,,5
1997,2997,Treatment of Low Biodegradability Leachates in...,,5


### Looking for top terms

In [374]:
# order_centroids = model.cluster_centers_.argsort()[:, ::-1]
order_centroids = H1.argsort()[:, ::-1]
# terms = vectorizer.get_feature_names_out()
terms = np.array(vectorizer.get_feature_names_out())

In [375]:
df2 = pd.DataFrame()
for i in range(d):
    subset_df = df[df.Topic == i].shape[0]

    top_ten_words = [terms[ind] for ind in order_centroids[i, :10]]
    data = {'TopicIndex': i, 
            'NumObs': subset_df,
           'TopicKeyWords': top_ten_words}


    df2 = df2.append(data, ignore_index=True)

In [376]:
df2

Unnamed: 0,TopicIndex,NumObs,TopicKeyWords
0,0.0,361.0,"[water, drinking, supply, quality, irrigation,..."
1,1.0,158.0,"[groundwater, area, quality, study, samples, a..."
2,2.0,104.0,"[fluid, deposits, gold, hydrothermal, minerali..."
3,3.0,138.0,"[arsenic, exposure, drinking, skin, concentrat..."
4,4.0,100.0,"[soil, soils, organic, nutrient, crop, increas..."
5,5.0,485.0,"[study, model, using, used, data, results, ene..."
6,6.0,156.0,"[health, risk, children, drinking, study, huma..."
7,7.0,314.0,"[mg, samples, heavy, concentration, metals, co..."
8,8.0,109.0,"[river, quality, pollution, basin, water, chin..."
9,9.0,74.0,"[land, use, area, urban, change, cover, forest..."


### Concat df and df2 to final result

In [377]:
result = pd.concat([df, df2], axis=1)

In [378]:
result

Unnamed: 0,AN,TAB,Seed,Topic,TopicIndex,NumObs,TopicKeyWords
0,1000,Urban Growth Dynamics and Changing Land-Use La...,,9,0.0,361.0,"[water, drinking, supply, quality, irrigation,..."
1,1001,Reduction in exposure to arsenic from drinking...,1,3,1.0,158.0,"[groundwater, area, quality, study, samples, a..."
2,1002,Optimization of phenol adsorption onto biochar...,,5,2.0,104.0,"[fluid, deposits, gold, hydrothermal, minerali..."
3,1003,Long-term trends in hydrochemistry in the Para...,,8,3.0,138.0,"[arsenic, exposure, drinking, skin, concentrat..."
4,1004,Application research of crosshole electromagne...,,5,4.0,100.0,"[soil, soils, organic, nutrient, crop, increas..."
...,...,...,...,...,...,...,...
1994,2994,"Geology, fluid inclusions, and H-O-S-Pb isotop...",,2,,,
1995,2995,Sex-Specific Associations between One-Carbon M...,,3,,,
1996,2996,A Quantitative Process-Based Inventory Study o...,,5,,,
1997,2997,Treatment of Low Biodegradability Leachates in...,,5,,,


In [379]:
result.head(10)

Unnamed: 0,AN,TAB,Seed,Topic,TopicIndex,NumObs,TopicKeyWords
0,1000,Urban Growth Dynamics and Changing Land-Use La...,,9,0.0,361.0,"[water, drinking, supply, quality, irrigation,..."
1,1001,Reduction in exposure to arsenic from drinking...,1.0,3,1.0,158.0,"[groundwater, area, quality, study, samples, a..."
2,1002,Optimization of phenol adsorption onto biochar...,,5,2.0,104.0,"[fluid, deposits, gold, hydrothermal, minerali..."
3,1003,Long-term trends in hydrochemistry in the Para...,,8,3.0,138.0,"[arsenic, exposure, drinking, skin, concentrat..."
4,1004,Application research of crosshole electromagne...,,5,4.0,100.0,"[soil, soils, organic, nutrient, crop, increas..."
5,1005,An environmentally-friendly integrated seismic...,,5,5.0,485.0,"[study, model, using, used, data, results, ene..."
6,1006,Risk of arsenic-related skin lesions in Bangla...,1.0,3,6.0,156.0,"[health, risk, children, drinking, study, huma..."
7,1007,Proposal of an irrigation water quality index ...,,0,7.0,314.0,"[mg, samples, heavy, concentration, metals, co..."
8,1008,Spatial variation and quantitative screening l...,,6,8.0,109.0,"[river, quality, pollution, basin, water, chin..."
9,1009,"Removal of hazardous dyes, toxic metal ions an...",,0,9.0,74.0,"[land, use, area, urban, change, cover, forest..."
