# Topic modeling using Latent Semantic Analysis

### Demonstrating NLP applications - Akash G

137 articles have been mined from - **Times of India, Indian Express, The Economist, The Guardian** - with the central topic of discussion regarding **China**. Our attempt here is to find out the topics that can categorize these documents/articles, with the help of soft-clustering methods like **Latent Semantic Analysis**. This method essentially utilizes the **SVD** process and clusters the documents based on their vectorized positions in a dimensionally reduced space.

&nbsp;



In [1]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
from sklearn.metrics.pairwise import cosine_similarity
import math
import spacy
import re

In [4]:
df = pd.read_csv("/Users/Akashgupta/Desktop/Rnlp/rpd/INDEXP.csv")
df2 = pd.read_csv("/Users/Akashgupta/Desktop/Rnlp/rpd/ECONOMIST.csv")
df3 = pd.read_csv("/Users/Akashgupta/Desktop/Rnlp/rpd/TIMESOFIND.csv")
df4 = pd.read_csv("/Users/Akashgupta/Desktop/Rnlp/rpd/GUARD.csv")

In [5]:
frames = [df,df2,df3,df4]
data = pd.concat(frames)

&nbsp;

**Below is the concatanated Dataframe consisting of articles from all the news sources put together.**

&nbsp;

In [6]:
datanew = df.append([df2,df3,df4], ignore_index = True)
datanew

Unnamed: 0,title,text
0,ie_xi,"Over the last six months, in the shadow of COV..."
1,ie_xi,In a widely noted and strongly criticised spee...
2,ie_xi,The Iranian government’s recent approval of a ...
3,ie_xi,Reports that Iran and China are close to concl...
4,ie_xi,The recent re-emergence of terms like “Malabar...
...,...,...
132,gu_xi,The sight of young people anywhere being bruta...
133,gu_xi,"Hong Kong is not yet cowed. That was, unquesti..."
134,gu_xi,Strong national defence is the consequence of ...
135,gu_xi,When Mike Pompeo spoke at the online Copenhage...


&nbsp;

**Below is a regular expression code block to filter out non-textual characters from the article body**

&nbsp;

In [7]:
datanew['clean_text'] = datanew['text'].str.replace("[^a-zA-Z#]"," ")

&nbsp;

**The following code block is for tokenizing the document into words.**

&nbsp;

In [8]:
stop_words = nltk.corpus.stopwords.words('english')
tokenized_doc = datanew['clean_text'].apply(lambda x: x.split())
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if x not in stop_words])
print(tokenized_doc)

0      [Over, the, last, six, months, in, the, shadow...
1      [In, a, widely, noted, and, strongly, criticis...
2      [The, Iranian, government, s, recent, approval...
3      [Reports, that, Iran, and, China, are, close, ...
4      [The, recent, re, emergence, of, terms, like, ...
                             ...                        
132    [The, sight, of, young, people, anywhere, bein...
133    [Hong, Kong, is, not, yet, cowed, That, was, u...
134    [Strong, national, defence, is, the, consequen...
135    [When, Mike, Pompeo, spoke, at, the, online, C...
136    [The, Chinese, assault, on, Indian, troops, ne...
Name: clean_text, Length: 137, dtype: object


In [9]:
de_tokenized_doc = []
for i in range(len(tokenized_doc)):
    t = ' '.join(tokenized_doc[i])
    de_tokenized_doc.append(t)

&nbsp;

**Here a term-document matrix is created by essentially vectorizing the documents by using TFIDF weighting.**

&nbsp;

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = stop_words, use_idf = True, ngram_range = (1,3))
X = vectorizer.fit_transform(de_tokenized_doc)
print(X.shape)

(137, 151212)


In [11]:
terms = vectorizer.get_feature_names()

&nbsp;

**SVD is applied and 10 principal components are selected which essentially represent our 10 topics. The top 10 keywords are then listed according to each topic**

&nbsp;

In [15]:
from sklearn.utils.extmath import randomized_svd

U, sigma, VT = randomized_svd(X, n_components = 10, n_iter = 100, random_state = 122)

for i, com in enumerate(VT):
    terms_com = zip(terms, com)
    sorted_terms = sorted(terms_com, key = lambda x:x[1],reverse = True)[:20]
    print("topic"+str(i+1)+": ")
    print("========")
    for t in sorted_terms:
        print(t[0])
    print(" ")

topic1: 
china
india
chinese
us
indian
xi
world
beijing
military
sea
south
mr
would
iran
war
countries
china sea
strategic
opinion
delhi
 
topic2: 
iran
india
afghanistan
iran china
chabahar
tehran
iranian
indian
strategic
deep strategic
deep strategic partnership
strategic partnership
partnership
link
port
delhi
investments
rail
deep
chinese investments
 
topic3: 
iran
human rights
afghanistan
rights
sanctions
iran china
human
tehran
chabahar
different
nuclear
iranian
deep strategic
deep strategic partnership
cold
trump
argument based
arms race
classroom
uighurs
 
topic4: 
india
indian
pla
lac
ladakh
different
india china
war
opinion
border
modi
boundary
cold
military
argument based
arms race
classroom
galwan
movements
kashmir
 
topic5: 
sea
china sea
adiz
south china sea
south china
south
china
us
scs
asean
claims
islands
maritime
waters
fishing
australia
pacific
east
statement
indo pacific
 
topic6: 
hong
hong kong
kong
british
law
britain
beijing
us
uk
lam
mrs
new law
democracy
mai

&nbsp;

We can see all the articles written about **China** can be grouped into 10 topics or categories as follows :- 

- **Topic 1** : These are articles that primarily focus on Indo-China geopolitical situations. These set of articles also have references to the US.

- **Topic 2** : Articles grouped into this topic revolve around discussions involving **Iran, Afghanistan, Tehran, Chabahar**. They largely focus on how India's policy towards China is also influenced by other factors, particularly Iran and Afghanistan.

- **Topic 3** : These articles also tend to focus on geopolitical situations that connect - **India, China, Iran, Afghanistan** but also tend to have a focus on **human rights** aspects as well. These probably refer to China's treatment of the Uiyghur Muslims.

- **Topic 4** : Articles in this category are almost exclusively centered around the **India-China border problems** in **Ladakh**. 

- **Topic 5** : These articles focus on the **South China Sea** situation and how China is increasingly exerting its influence in this region. Prominent international groupings like **ASEAN** also find mention in these articles.

- **Topic 6** : These articles focus on the **Hong Kong NSL** recently enacted by the Chinese government. These articles shed light on geopolitical relations between **Britain and China**.

- **Topic 7** : These articles primarily report military action by China's **PLA** in the **ADIZ South China Sea** region.

- **Topic 8** : Articles in this category have focussed on describing **Xi Jinping** and his connections with the ideologies of his predecessors like **Deng and Mao**.

- **Topic 9** : These articles revolve around China's relations with **US and Britain**. It seems like these articles are associated with words like **combat, offensive and divisions** which give an idea about the negative sentiment associated in these geopolitical relations.

- **Topic 10** : These articles highlight China's inhuman treatment of **Uighurs** in the **Xinjiang** province of China and the detention centres located there. 

&nbsp;

**The aim of this analysis is to get an idea of what the world is writing about in the context of China. With a topic analysis we get to know the broad perceptions prevailing in countries when they discuss China**