# Topic Modeling with Single LDA Model
Solution: use LDA model for topic model
Source: seed_sample_2000_input.csv

## Source


<img src="images/source-input-seed-2000.png" alt="term-document matrix" style="width: 80%"/>

### Import necessary library

In [4]:
from flask import Flask, request, redirect, render_template, Response, send_file, send_from_directory
from flask_wtf import FlaskForm
from wtforms import SelectMultipleField
from flask_bootstrap import Bootstrap
from os.path import join, dirname, realpath
import pandas as pd
# from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer  # add CountVectorizer for NMF
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from datetime import datetime
import io, os
import numpy as np

### Import library for stopwords

In [5]:
from sklearn import decomposition
import matplotlib.pyplot as plt
import re
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split

In [6]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shouqiangye/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Set DF from csv file
Source 1: <a href="https://github.com/UNCWellington/AI-tools/blob/main/seed_sample_2000_input.csv?raw=true">seed_sample_2000_input.csv</a>  only 2000 rows  

Source 2: <a href="https://github.com/UNCWellington/AI-tools/blob/main/Sample-for-SC.csv?raw=true">Sample-for-SC.csv   </a> nearly 10,000 rows
Note: we shoud add ?raw=true at the end of github link address

In [7]:
url = 'https://github.com/UNCWellington/AI-tools/blob/main/seed_sample_2000_input.csv?raw=true'
# url = 'https://github.com/UNCWellington/AI-tools/blob/main/Sample-for-SC.csv?raw=true'
# df = pd.read_csv(url,index_col=0)
df = pd.read_csv(url, keep_default_na=False)

In [8]:
df.head()

Unnamed: 0,AN,TAB,Seed
0,1000,Urban Growth Dynamics and Changing Land-Use La...,
1,1001,Reduction in exposure to arsenic from drinking...,1.0
2,1002,Optimization of phenol adsorption onto biochar...,
3,1003,Long-term trends in hydrochemistry in the Para...,
4,1004,Application research of crosshole electromagne...,


## Stop words, stemming, lemmatization

In [9]:
stemmer = nltk.stem.SnowballStemmer('english')
nltk.download('stopwords')
stopwords = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shouqiangye/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Data Processing

Next, scikit learn has a method that will extract all the word counts for us

In [10]:
documents = df['TAB'].values.astype("U")

In [356]:
documents

array(['Urban Growth Dynamics and Changing Land-Use Land-Cover of Megacity Kolkata and Its Environs Spatio-temporal land-use land-cover changes have a long-term impact on urban environments. The present study is based on land-use land-cover changes and urban expansion of megacity Kolkata and its environs over three decades (1991-2018) using multitemporal Landsat data. The study aims to explore and explain the spatio-temporal land-use land-cover change, areal differentiation, spatio-temporal urban growth trajectory and future land-use land-cover prediction with population projection. The spatio-temporal representation found rapid urbanization, i.e. 19% to 57%, exactly three times as in 1991, resulting in significant loss of other than urban/built-up area. Urban trajectory reveals that the expansion mainly occurred in north-east to south-west direction, the zone of both sides of River Hooghly. Areal differentiation map with highest urbanization (3146 ha or UII = 0.64) was identified in t

In [357]:
# vectorizer = TfidfVectorizer(stop_words=stopwords, ngram_range=(1, 2))
#features = vectorizer.fit_transform(documents)

### Generate TF-IDF Features
In this step, you will generate the TF-IDF matrix for given documents. Here, you will also perform preprocessing operations such as tokenization, and removing stopwords.

In [None]:
# Initialize regex tokenizer
"""
tokenizer = RegexpTokenizer(r'\w+')

# Vectorize document using TF-IDF
tfidf = TfidfVectorizer(lowercase=True,
                        stop_words='english',
                        ngram_range = (1,1),
                        tokenizer = tokenizer.tokenize)

# Fit and Transform the documents
train_data = tfidf.fit_transform(documents_list)  
"""

In [11]:
vectorizer = TfidfVectorizer(stop_words=stopwords, ngram_range=(1, 3))
features = vectorizer.fit_transform(documents)

### Perform LDA
Scikit-learn offers LatentDirichletAllocation for performing LDA on any Document Term Matrix(DTM). Let’s see the example below(This example will take approx 25 mins on the local machine with 8GB RAM):

In [12]:
"""
# Define the number of topics or components
num_components=5

# Create LDA object
model=LatentDirichletAllocation(n_components=num_components)

# Fit and Transform SVD model on data
lda_matrix = model.fit_transform(train_data)

# Get Components 
lda_components=model.components_
"""

'\n# Define the number of topics or components\nnum_components=5\n\n# Create LDA object\nmodel=LatentDirichletAllocation(n_components=num_components)\n\n# Fit and Transform SVD model on data\nlda_matrix = model.fit_transform(train_data)\n\n# Get Components \nlda_components=model.components_\n'

In [13]:
k = 10
length = 1

In [19]:
# model = KMeans(n_clusters=k, init='k-means++', max_iter=int(length), n_init=1, verbose=0, random_state=3425)
# Create LDA object
model=LatentDirichletAllocation(n_components=k, learning_method="online", max_iter=int(length), random_state=3425)

In [22]:
# model.fit(features)
model.fit_transform(features)

array([[0.00448134, 0.27985556, 0.00448136, ..., 0.00448135, 0.68429361,
        0.00448135],
       [0.00562098, 0.00562091, 0.00562099, ..., 0.00562093, 0.50974521,
        0.00562096],
       [0.00552994, 0.00552989, 0.00552993, ..., 0.00552994, 0.45348797,
        0.50227267],
       ...,
       [0.00599173, 0.00599174, 0.00599175, ..., 0.00599173, 0.94607439,
        0.00599173],
       [0.00527423, 0.00527422, 0.00527422, ..., 0.00527423, 0.95253195,
        0.00527423],
       [0.00554379, 0.00554378, 0.00554378, ..., 0.00554378, 0.95010594,
        0.00554379]])

In [27]:
# Get Components 
lda_components=model.components_
lda_components

array([[0.1712602 , 0.19918397, 0.19255022, ..., 0.17154456, 0.1779624 ,
        0.17189322],
       [0.17576739, 0.17803867, 0.18975948, ..., 0.18817151, 0.17776708,
        0.19732881],
       [0.19616406, 0.18645249, 0.19928438, ..., 0.18297267, 0.18351629,
        0.1884843 ],
       ...,
       [0.22930171, 0.17722299, 0.17513556, ..., 0.19954192, 0.17143725,
        0.18683724],
       [1.16175455, 0.21597088, 0.21564616, ..., 0.20660812, 0.19862432,
        0.20956879],
       [0.19679141, 0.18898786, 0.19937852, ..., 0.20176634, 0.1832806 ,
        0.19183786]])

### See the dominant topic in each document

In [53]:
# colnames = ["Topic" + str(i) for i in range(lda_components)]
colnames = ["Topic" + str(i) for i in range(10)]
docnames = df['AN']
# Create Document - Topic Matrix
# lda_output = model.transform(features)
data_vectorized = model.fit_transform(features)

df_doc_topic = pd.DataFrame(np.round(data_vectorized, 2), columns=colnames, index=docnames)
significant_topic = np.argmax(df_doc_topic.values, axis=1)
df_doc_topic['dominant_topic'] = significant_topic

In [55]:
df_doc_topic

Unnamed: 0_level_0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,dominant_topic
AN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000,0.00,0.28,0.00,0.00,0.00,0.00,0.00,0.00,0.68,0.00,8
1001,0.01,0.01,0.01,0.01,0.01,0.01,0.45,0.01,0.51,0.01,8
1002,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.45,0.50,9
1003,0.00,0.00,0.41,0.00,0.00,0.00,0.00,0.00,0.55,0.00,8
1004,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.47,0.49,0.00,8
...,...,...,...,...,...,...,...,...,...,...,...
2994,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.96,0.00,8
2995,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.96,0.00,8
2996,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.95,0.01,8
2997,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.95,0.01,8


In [41]:
df['Topic'] = significant_topic

In [48]:
terms = vectorizer.get_feature_names()
df2 = pd.DataFrame()
for index, component in enumerate(lda_components):
    subset_df = df[df.Topic == index].shape[0]
    
    zipped = zip(terms, component)
    top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:7]
    top_terms_list=list(dict(top_terms_key).keys())
    print("Topic "+str(index)+": ",top_terms_list)
    data = {'Cluster Number': index, 
            'Number of Items': subset_df,
            'Top Cluster Terms': top_terms_list}

    df2 = df2.append(data, ignore_index=True)



Topic 0:  ['water', 'climate', 'change', 'soil', 'climate change', 'hydrogel', 'stress']
Topic 1:  ['water', 'toch', 'so2', 'mg', 'cover', 'rock', 'water harvesting']
Topic 2:  ['oil', 'water', 'args', 'cnts', 'paleosols', 'reservoir', 'samples']
Topic 3:  ['arsenic', 'groundwater', 'basin', 'nutrient', 'stress drop', 'intercropping', 'fetal']
Topic 4:  ['arsenic', 'snow', 'water', 'mass', 'arsenic poisoning drinking', 'adaptation', 'acetaldehyde']
Topic 5:  ['nitration', 'forest', 'thinning', 'acrylamide', 'water', 'paddy soils', 'tektites']
Topic 6:  ['arsenic', 'water', 'health', 'community', 'drinking', 'soil', 'sanitation']
Topic 7:  ['adsorption', 'removal', 'dye', 'water', 'adsorbents', 'adsorbent', 'fe']
Topic 8:  ['water', 'groundwater', 'arsenic', 'soil', 'study', 'quality', 'drinking']
Topic 9:  ['dgm', 'plumes', 'water', 'cattle loans', 'cattle', 'bi dao', 'dao']


In [44]:
df_doc_topic

Unnamed: 0_level_0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,dominant_topic
AN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000,0.00,0.28,0.00,0.00,0.00,0.00,0.00,0.00,0.68,0.00,8
1001,0.01,0.01,0.01,0.01,0.01,0.01,0.45,0.01,0.51,0.01,8
1002,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.45,0.50,9
1003,0.00,0.00,0.41,0.00,0.00,0.00,0.00,0.00,0.55,0.00,8
1004,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.47,0.49,0.00,8
...,...,...,...,...,...,...,...,...,...,...,...
2994,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.96,0.00,8
2995,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.96,0.00,8
2996,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.95,0.01,8
2997,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.95,0.01,8


In [49]:
df2

Unnamed: 0,Cluster Number,Number of Items,Top Cluster Terms
0,0.0,50.0,"[water, climate, change, soil, climate change,..."
1,1.0,10.0,"[water, toch, so2, mg, cover, rock, water harv..."
2,2.0,20.0,"[oil, water, args, cnts, paleosols, reservoir,..."
3,3.0,8.0,"[arsenic, groundwater, basin, nutrient, stress..."
4,4.0,9.0,"[arsenic, snow, water, mass, arsenic poisoning..."
5,5.0,9.0,"[nitration, forest, thinning, acrylamide, wate..."
6,6.0,26.0,"[arsenic, water, health, community, drinking, ..."
7,7.0,46.0,"[adsorption, removal, dye, water, adsorbents, ..."
8,8.0,1814.0,"[water, groundwater, arsenic, soil, study, qua..."
9,9.0,7.0,"[dgm, plumes, water, cattle loans, cattle, bi ..."


### Concat df and df2 to final result

In [50]:
result = pd.concat([df, df2], axis=1)

In [51]:
result

Unnamed: 0,AN,TAB,Seed,Topic,Cluster Number,Number of Items,Top Cluster Terms
0,1000,Urban Growth Dynamics and Changing Land-Use La...,,8,0.0,50.0,"[water, climate, change, soil, climate change,..."
1,1001,Reduction in exposure to arsenic from drinking...,1,8,1.0,10.0,"[water, toch, so2, mg, cover, rock, water harv..."
2,1002,Optimization of phenol adsorption onto biochar...,,9,2.0,20.0,"[oil, water, args, cnts, paleosols, reservoir,..."
3,1003,Long-term trends in hydrochemistry in the Para...,,8,3.0,8.0,"[arsenic, groundwater, basin, nutrient, stress..."
4,1004,Application research of crosshole electromagne...,,8,4.0,9.0,"[arsenic, snow, water, mass, arsenic poisoning..."
...,...,...,...,...,...,...,...
1994,2994,"Geology, fluid inclusions, and H-O-S-Pb isotop...",,8,,,
1995,2995,Sex-Specific Associations between One-Carbon M...,,8,,,
1996,2996,A Quantitative Process-Based Inventory Study o...,,8,,,
1997,2997,Treatment of Low Biodegradability Leachates in...,,8,,,


In [52]:
result.head(10)

Unnamed: 0,AN,TAB,Seed,Topic,Cluster Number,Number of Items,Top Cluster Terms
0,1000,Urban Growth Dynamics and Changing Land-Use La...,,8,0.0,50.0,"[water, climate, change, soil, climate change,..."
1,1001,Reduction in exposure to arsenic from drinking...,1.0,8,1.0,10.0,"[water, toch, so2, mg, cover, rock, water harv..."
2,1002,Optimization of phenol adsorption onto biochar...,,9,2.0,20.0,"[oil, water, args, cnts, paleosols, reservoir,..."
3,1003,Long-term trends in hydrochemistry in the Para...,,8,3.0,8.0,"[arsenic, groundwater, basin, nutrient, stress..."
4,1004,Application research of crosshole electromagne...,,8,4.0,9.0,"[arsenic, snow, water, mass, arsenic poisoning..."
5,1005,An environmentally-friendly integrated seismic...,,8,5.0,9.0,"[nitration, forest, thinning, acrylamide, wate..."
6,1006,Risk of arsenic-related skin lesions in Bangla...,1.0,9,6.0,26.0,"[arsenic, water, health, community, drinking, ..."
7,1007,Proposal of an irrigation water quality index ...,,8,7.0,46.0,"[adsorption, removal, dye, water, adsorbents, ..."
8,1008,Spatial variation and quantitative screening l...,,8,8.0,1814.0,"[water, groundwater, arsenic, soil, study, qua..."
9,1009,"Removal of hazardous dyes, toxic metal ions an...",,8,9.0,7.0,"[dgm, plumes, water, cattle loans, cattle, bi ..."
