# Topic Modelling
> Topic Modelling adalah teknik tanpa pengawasan (unsupervised) untuk menemukan tema dokumen yang diberikan. Ini mengekstrak kumpulan kata kunci yang terjadi bersamaan. Kata kunci yang muncul bersamaan ini mewakili sebuah topik. Misalnya, saham, pasar, ekuitas, reksa dana akan mewakili topik 'investasi saham'.


## Latent Semantic Analysis (LSA)
> Latent Semantic Analysis adalah teknik untuk mengekstraksi topik dari dokumen teks tertentu. Ini menemukan hubungan antara istilah dan dokumen. LSA bisa digunakan untuk menilai esai dengan mengkonversikan esai menjadi matriks-matriks yang diberi nilai pada masing-masing term untuk dicari kesamaan dengan term referensi. 
```
Data yang digunakan adalah komentar Youtube dari sebuah video
```
- Representasi data dalam bentuk TF-IDF
- Output tugas adalah :
  - Bobot kata terhadap masing masing topik
  - Bobot setiap topik terhadap  dokumen
- Kata kunci : Capres 2024
- Link Youtube : [Click Here](https://www.youtube.com/watch?v=0c3aXvjx3ck)

## Crawling YouTube Comments

In [180]:
import pandas as pd
from googleapiclient.discovery import build

In [181]:
def video_comments(video_id):
	# empty list for storing reply
	replies = []

	# creating youtube resource object
	youtube = build('youtube', 'v3', developerKey=api_key)

	# retrieve youtube video results
	video_response = youtube.commentThreads().list(part='snippet,replies', videoId=video_id).execute()

	# iterate video response
	while video_response:
		
		# extracting required info
		# from each result object
		for item in video_response['items']:
			
			# Extracting comments ()
			published = item['snippet']['topLevelComment']['snippet']['publishedAt']
			user = item['snippet']['topLevelComment']['snippet']['authorDisplayName']

			# Extracting comments
			comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
			likeCount = item['snippet']['topLevelComment']['snippet']['likeCount']

			replies.append([published, user, comment, likeCount])
			
			# counting number of reply of comment
			replycount = item['snippet']['totalReplyCount']

			# if reply is there
			if replycount>0:
				# iterate through all reply
				for reply in item['replies']['comments']:
					
					# Extract reply
					published = reply['snippet']['publishedAt']
					user = reply['snippet']['authorDisplayName']
					repl = reply['snippet']['textDisplay']
					likeCount = reply['snippet']['likeCount']
					
					# Store reply is list
					#replies.append(reply)
					replies.append([published, user, repl, likeCount])

			# print comment with list of reply
			#print(comment, replies, end = '\n\n')

			# empty reply list
			#replies = []

		# Again repeat
		if 'nextPageToken' in video_response:
			video_response = youtube.commentThreads().list(
					part = 'snippet,replies',
					pageToken = video_response['nextPageToken'], 
					videoId = video_id
				).execute()
		else:
			break
	#endwhile
	return replies

In [182]:
api_key = 'AIzaSyD6Lb7fJBcMcxcvKsIRctUXv_bwTyxc56w' 
#Link YouTube = https://www.youtube.com/watch?v=0c3aXvjx3ck
video_id = "0c3aXvjx3ck" #id video
comments = video_comments(video_id)
comments

[['2023-05-08T10:18:22Z', 'Dahlan Werang', 'Anies tdk layak', 0],
 ['2023-05-08T09:28:11Z',
  'Johan Supriyadi',
  'klo masih 3 beginih. udah ketauan siapa yang bakal gabungan. tambah lagi 1 jadi 4 biar seru.',
  0],
 ['2023-05-07T14:18:21Z',
  'Koko',
  'Videonya ga pas sama omongan presenternya',
  0],
 ['2023-05-07T10:01:58Z',
  'Sintia Sindi',
  'Indonesia dlm masa krisis segalla bidang mohon dengan hormat untuk tidak lgi bercanda tawa soal prediden Ri',
  0],
 ['2023-05-06T02:59:28Z',
  'Nama kamu',
  'Prabowo Subianto Djojohadikusumo ❤🎉',
  1],
 ['2023-05-04T01:41:10Z', 'barbosa imran', 'Pak anis cerdas terpercaya', 0],
 ['2023-05-04T00:41:06Z',
  'Ahmad Aguscik',
  'Jawa ganjar  &#39; luar jawar anis pilih',
  0],
 ['2023-05-03T15:34:20Z',
  'Dana Kusuma',
  'Msh yakin pd pilpres hanya ada 2 poros....kunci kemenangan ada pd yg didukung penuh Jokowi...karena energi ri 1 msh ada pd jokowi.',
  0],
 ['2023-05-03T14:05:27Z', 'MARPAUNG MEMORY', 'No ganjar', 0],
 ['2023-05-03T13:41:23

In [183]:
df = pd.DataFrame(comments, columns=['timePosted', 'displayName', 'comment', 'likeCount'])
df

Unnamed: 0,timePosted,displayName,comment,likeCount
0,2023-05-08T10:18:22Z,Dahlan Werang,Anies tdk layak,0
1,2023-05-08T09:28:11Z,Johan Supriyadi,klo masih 3 beginih. udah ketauan siapa yang b...,0
2,2023-05-07T14:18:21Z,Koko,Videonya ga pas sama omongan presenternya,0
3,2023-05-07T10:01:58Z,Sintia Sindi,Indonesia dlm masa krisis segalla bidang mohon...,0
4,2023-05-06T02:59:28Z,Nama kamu,Prabowo Subianto Djojohadikusumo ❤🎉,1
...,...,...,...,...
299,2023-04-26T13:13:10Z,Nyoimanis,Kursi itu apa sih?,0
300,2023-04-26T13:20:15Z,mediraja 77,Kursi yg diproleh saat pemilu 2019 lalu.<br><b...,2
301,2023-04-26T13:19:10Z,Rayhan Syamil,googling mas,0
302,2023-04-26T13:17:42Z,Ravael Chandra,Tempat duduk,1


In [184]:
#Menghapus kolom likeCount
del df['likeCount']

In [185]:
df

Unnamed: 0,timePosted,displayName,comment
0,2023-05-08T10:18:22Z,Dahlan Werang,Anies tdk layak
1,2023-05-08T09:28:11Z,Johan Supriyadi,klo masih 3 beginih. udah ketauan siapa yang b...
2,2023-05-07T14:18:21Z,Koko,Videonya ga pas sama omongan presenternya
3,2023-05-07T10:01:58Z,Sintia Sindi,Indonesia dlm masa krisis segalla bidang mohon...
4,2023-05-06T02:59:28Z,Nama kamu,Prabowo Subianto Djojohadikusumo ❤🎉
...,...,...,...
299,2023-04-26T13:13:10Z,Nyoimanis,Kursi itu apa sih?
300,2023-04-26T13:20:15Z,mediraja 77,Kursi yg diproleh saat pemilu 2019 lalu.<br><b...
301,2023-04-26T13:19:10Z,Rayhan Syamil,googling mas
302,2023-04-26T13:17:42Z,Ravael Chandra,Tempat duduk


In [186]:
df.to_csv("capres2024.csv",index=False)

## Preprocessing 

In [187]:
!pip install nltk
!pip install indoNLP
!pip install pydrive

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [188]:
import pandas as pd #pandas
import numpy as np #numpy
import re #regex
import string #string population
from nltk.tokenize import word_tokenize #tokenize
from nltk.corpus import stopwords #stopword
from indoNLP.preprocessing import replace_slang #slang word
from nltk.stem.porter import PorterStemmer #stemming
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import os
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [189]:
class Prepocessing:
  def __init__(self):
    self.listStopword =  set(stopwords.words('indonesian'))
    self.stemmer = PorterStemmer()

  def remove_emoji(self, string): #remove emoji
    emoji_pattern = re.compile("["
      u"\U0001F600-\U0001F64F"  # emoticons
      u"\U0001F300-\U0001F5FF"  # symbols & pictographs
      u"\U0001F680-\U0001F6FF"  # transport & map symbols
      u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
      u"\U00002702-\U000027B0"
      u"\U000024C2-\U0001F251"
      "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r' ', string)

  def remove_unwanted(self, document): #clean text
    # remove user mentions
    document = re.sub("@[A-Za-z0-9_]+"," ", document)
    # remove URLS 
    document = re.sub(r'http\S+', ' ', document)
    # remove hashtags
    document = re.sub("#[A-Za-z0-9_]+","", document)
    # remove emoji's
    document = self.remove_emoji(document)
    # remove punctuation
    document = re.sub("[^0-9A-Za-z ]", "" , document)
    # remove double spaces
    document = document.replace('  '," ")
    return document.strip()
  
  def tokenize(self, text): #tokenize -> memisah kalimat 
    return word_tokenize(text.translate(str.maketrans('', '', string.punctuation)).lower())
  
  def stopWord(self, text): #stopword -> menghapus kata hubung
    return [kata for kata in text if kata not in self.listStopword]
  
  def slank_word(self, text): #slank word -> mengganti kata yang tidak baku
    return [replace_slang(kata) for kata in text]

  def stemming(self, text): #stemming -> mengganti kata menjadi kata dasar
    return " ".join([self.stemmer.stem(kata) for kata in text])

In [190]:
preprocessing = Prepocessing()

## Text Cleaning
> Pada tahapan Text Cleaning, dilakukan tag removal yaitu menghapus tautan HTML dan emoji serta menghapus strip ruang putih seperti spasi, tab, dan baris baru yang berlebihan di dalam kalimatnya. Setelah itu dilakukan case folding dengan mengubah kalimat di dalam ulasan menjadi kalimat dalam bentuk lowercase. Setelah itu dilakukan penghapusan tanda baca, karakter spesial, dan spasi ganda. Setelah itu dilakukan penghapusan angka.


In [191]:
df['clean'] = df['comment'].apply(lambda x: preprocessing.remove_unwanted(x))

## Tokenization
> Tokenization adalah proses untuk membagi teks yang dapat berupa kalimat, paragraf atau dokumen, menjadi token-token/bagian-bagian tertentu. Pada tahapan ini, seluruh kalimat yang ada di setiap ulasan Warung Amboina akan berubah menjadi kata hingga membentuk kumpulan kata kata baru dalam bentuk tabel.

In [192]:
df['tokenize'] = df['clean'].apply(lambda x: preprocessing.tokenize(x))

## Stop Words Removal
> Stop Words Removal adalah proses filtering, pemilihan kata-kata penting dari hasil token yaitu kata-kata apa saja yang digunakan untuk mewakili dokumen. Contoh kata - kata yang difilter dari tahapan ini adalah kata penghubung seperti dan, adalah, yang, di. Kata - kata tersebut kemudian akan dihapus hingga membentuk kumpulan kata kata baru dalam bentuk tabel.

In [193]:
df['stopword'] = df['tokenize'].apply(lambda x: preprocessing.stopWord(x))

## Slang Words
> Slang Words merupakan proses menghilangkan kata tidak baku atau informal.

In [194]:
df['slangword'] = df['stopword'].apply(lambda x: preprocessing.slank_word(x))

## Stemming
> 
Stemming merupakan proses pengubahan kata kerja menjadi kata kasar dengan menghilangkan imbuhan menggunakan library Sastrawi. Contoh kata yang diganti seperti berjalan menjadi jalan, berbunyi menjadi bunyi.


In [195]:
df['stem'] = df['slangword'].apply(lambda x: preprocessing.stemming(x))

## Merging Data
> Proses penggabungan data

In [196]:
df

Unnamed: 0,timePosted,displayName,comment,clean,tokenize,stopword,slangword,stem
0,2023-05-08T10:18:22Z,Dahlan Werang,Anies tdk layak,Anies tdk layak,"[anies, tdk, layak]","[anies, tdk, layak]","[anies, tidak, layak]",ani tidak layak
1,2023-05-08T09:28:11Z,Johan Supriyadi,klo masih 3 beginih. udah ketauan siapa yang b...,klo masih 3 beginih udah ketauan siapa yang ba...,"[klo, masih, 3, beginih, udah, ketauan, siapa,...","[klo, 3, beginih, udah, ketauan, gabungan, 1, ...","[kalo, 3, beginih, sudah, ketahuan, gabungan, ...",kalo 3 beginih sudah ketahuan gabungan 1 4 bia...
2,2023-05-07T14:18:21Z,Koko,Videonya ga pas sama omongan presenternya,Videonya ga pas sama omongan presenternya,"[videonya, ga, pas, sama, omongan, presenternya]","[videonya, ga, pas, omongan, presenternya]","[videonya, enggak, pas, omongan, presenternya]",videonya enggak pa omongan presenternya
3,2023-05-07T10:01:58Z,Sintia Sindi,Indonesia dlm masa krisis segalla bidang mohon...,Indonesia dlm masa krisis segalla bidang mohon...,"[indonesia, dlm, masa, krisis, segalla, bidang...","[indonesia, dlm, krisis, segalla, bidang, moho...","[indonesia, dalam, krisis, segalla, bidang, mo...",indonesia dalam krisi segalla bidang mohon hor...
4,2023-05-06T02:59:28Z,Nama kamu,Prabowo Subianto Djojohadikusumo ❤🎉,Prabowo Subianto Djojohadikusumo,"[prabowo, subianto, djojohadikusumo]","[prabowo, subianto, djojohadikusumo]","[prabowo, subianto, djojohadikusumo]",prabowo subianto djojohadikusumo
...,...,...,...,...,...,...,...,...
299,2023-04-26T13:13:10Z,Nyoimanis,Kursi itu apa sih?,Kursi itu apa sih,"[kursi, itu, apa, sih]","[kursi, sih]","[kursi, sih]",kursi sih
300,2023-04-26T13:20:15Z,mediraja 77,Kursi yg diproleh saat pemilu 2019 lalu.<br><b...,Kursi yg diproleh saat pemilu 2019 lalubrbrKur...,"[kursi, yg, diproleh, saat, pemilu, 2019, lalu...","[kursi, yg, diproleh, pemilu, 2019, lalubrbrku...","[kursi, yang, diproleh, pemilu, 2019, lalubrbr...",kursi yang diproleh pemilu 2019 lalubrbrkursi ...
301,2023-04-26T13:19:10Z,Rayhan Syamil,googling mas,googling mas,"[googling, mas]","[googling, mas]","[googling, mas]",googl ma
302,2023-04-26T13:17:42Z,Ravael Chandra,Tempat duduk,Tempat duduk,"[tempat, duduk]",[duduk],[duduk],duduk


In [197]:
df[['comment', 'clean', 'tokenize', 'stopword', 'slangword', 'stem']]

Unnamed: 0,comment,clean,tokenize,stopword,slangword,stem
0,Anies tdk layak,Anies tdk layak,"[anies, tdk, layak]","[anies, tdk, layak]","[anies, tidak, layak]",ani tidak layak
1,klo masih 3 beginih. udah ketauan siapa yang b...,klo masih 3 beginih udah ketauan siapa yang ba...,"[klo, masih, 3, beginih, udah, ketauan, siapa,...","[klo, 3, beginih, udah, ketauan, gabungan, 1, ...","[kalo, 3, beginih, sudah, ketahuan, gabungan, ...",kalo 3 beginih sudah ketahuan gabungan 1 4 bia...
2,Videonya ga pas sama omongan presenternya,Videonya ga pas sama omongan presenternya,"[videonya, ga, pas, sama, omongan, presenternya]","[videonya, ga, pas, omongan, presenternya]","[videonya, enggak, pas, omongan, presenternya]",videonya enggak pa omongan presenternya
3,Indonesia dlm masa krisis segalla bidang mohon...,Indonesia dlm masa krisis segalla bidang mohon...,"[indonesia, dlm, masa, krisis, segalla, bidang...","[indonesia, dlm, krisis, segalla, bidang, moho...","[indonesia, dalam, krisis, segalla, bidang, mo...",indonesia dalam krisi segalla bidang mohon hor...
4,Prabowo Subianto Djojohadikusumo ❤🎉,Prabowo Subianto Djojohadikusumo,"[prabowo, subianto, djojohadikusumo]","[prabowo, subianto, djojohadikusumo]","[prabowo, subianto, djojohadikusumo]",prabowo subianto djojohadikusumo
...,...,...,...,...,...,...
299,Kursi itu apa sih?,Kursi itu apa sih,"[kursi, itu, apa, sih]","[kursi, sih]","[kursi, sih]",kursi sih
300,Kursi yg diproleh saat pemilu 2019 lalu.<br><b...,Kursi yg diproleh saat pemilu 2019 lalubrbrKur...,"[kursi, yg, diproleh, saat, pemilu, 2019, lalu...","[kursi, yg, diproleh, pemilu, 2019, lalubrbrku...","[kursi, yang, diproleh, pemilu, 2019, lalubrbr...",kursi yang diproleh pemilu 2019 lalubrbrkursi ...
301,googling mas,googling mas,"[googling, mas]","[googling, mas]","[googling, mas]",googl ma
302,Tempat duduk,Tempat duduk,"[tempat, duduk]",[duduk],[duduk],duduk


In [198]:
df_1 = df[['stem']]
df_1

Unnamed: 0,stem
0,ani tidak layak
1,kalo 3 beginih sudah ketahuan gabungan 1 4 bia...
2,videonya enggak pa omongan presenternya
3,indonesia dalam krisi segalla bidang mohon hor...
4,prabowo subianto djojohadikusumo
...,...
299,kursi sih
300,kursi yang diproleh pemilu 2019 lalubrbrkursi ...
301,googl ma
302,duduk


In [199]:
df_new = df_1.drop_duplicates().reset_index(drop=True)
df_new

Unnamed: 0,stem
0,ani tidak layak
1,kalo 3 beginih sudah ketahuan gabungan 1 4 bia...
2,videonya enggak pa omongan presenternya
3,indonesia dalam krisi segalla bidang mohon hor...
4,prabowo subianto djojohadikusumo
...,...
296,kursi sih
297,kursi yang diproleh pemilu 2019 lalubrbrkursi ...
298,googl ma
299,duduk


In [200]:
df_new.to_csv('Capres2024dataset.csv', index=False)

## Modelling Data

In [201]:
data = pd.read_csv("/content/drive/MyDrive/Prosaindata/tugas/Capres2024dataset.csv")
data

Unnamed: 0,stem
0,ani tidak layak
1,kalo 3 beginih sudah ketahuan gabungan 1 4 bia...
2,videonya enggak pa omongan presenternya
3,indonesia dalam krisi segalla bidang mohon hor...
4,prabowo subianto djojohadikusumo
...,...
296,kursi sih
297,kursi yang diproleh pemilu 2019 lalubrbrkursi ...
298,googl ma
299,duduk


## TF-IDF

In [202]:
# Vectorize document using TF-IDF
tfidf = TfidfVectorizer()

In [203]:
documents_list = data.values.reshape(-1,).tolist()

In [204]:
train_data = tfidf.fit_transform(data['stem'].apply(lambda x: np.str_(x)))

In [205]:
# Define the number of topics or components
num_components=10

# Create SVD object
lsa = TruncatedSVD(n_components=num_components, n_iter=100, random_state=42)

# Fit SVD model on data
lsa.fit_transform(train_data)

# Get Singular values and Components 
Weight = lsa.singular_values_ 
V_transpose = lsa.components_.T
Weight

array([2.64706915, 2.2016034 , 2.09150412, 2.07297462, 1.99347352,
       1.85435295, 1.79308805, 1.72376663, 1.7132633 , 1.6683006 ])

In [206]:
# Print the topics with their terms
terms = tfidf.get_feature_names_out() 
for index, component in enumerate(lsa.components_):
    zipped = zip(terms, component)
    top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:5]
    top_terms_list=list(dict(top_terms_key).keys())
    print("Topic "+str(index)+": ",top_terms_list)

Topic 0:  ['ganjar', 'ani', 'yang', 'partai', 'pilih']
Topic 1:  ['ganjar', 'mahfud', 'pranowo', 'prabowo', 'no']
Topic 2:  ['ani', 'jawa', 'baswedan', 'formula', 'warga']
Topic 3:  ['kursi', 'ani', 'sih', 'ppp', 'ganjar']
Topic 4:  ['prabowo', 'mahfud', 'md', 'cawapr', 'subianto']
Topic 5:  ['yang', 'pilih', 'capr', 'bokep', 'cawapr']
Topic 6:  ['pilih', 'partai', 'ppp', 'pdip', 'petuga']
Topic 7:  ['mahfud', 'md', 'formula', 'setuju', 'amirudin']
Topic 8:  ['enggak', 'poro', 'pilih', 'saja', 'jawa']
Topic 9:  ['capr', 'ppp', 'kapan', 'mahfud', 'dukung']


In [207]:
tf_idf_array = train_data.toarray()
print(tf_idf_array)

[[0.        0.        0.        ... 0.        0.        0.       ]
 [0.        0.        0.        ... 0.        0.        0.       ]
 [0.        0.        0.        ... 0.        0.        0.       ]
 ...
 [0.        0.        0.        ... 0.        0.        0.       ]
 [0.        0.        0.        ... 0.        0.        0.       ]
 [0.6506964 0.        0.        ... 0.        0.        0.       ]]


In [208]:
words_set = tfidf.get_feature_names_out()
print(words_set)

['10' '100' '11000' ... 'zaman' 'zimbabw' 'zulkifli']


In [209]:
df_tf_idf = pd.DataFrame(tf_idf_array, columns = words_set)
df_tf_idf

Unnamed: 0,10,100,11000,12,15,150,16,163,19,1928,...,yogya,yogyakarta,yohan,yookarena,youtub,yoyo,yudikatif,zaman,zimbabw,zulkifli
0,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,...,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000
1,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,...,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000
2,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,...,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000
3,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,...,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000
4,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,...,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,...,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000
297,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,...,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000
298,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,...,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000
299,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,...,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000,0.0000000000000000


In [210]:
from sklearn.decomposition import TruncatedSVD
# SVD represent documents and terms in vectors 
lsa_model = TruncatedSVD(n_components=5, algorithm='randomized', n_iter=10, random_state=42)
lsa_top=lsa_model.fit_transform(df_tf_idf)

In [211]:
print(lsa_top)
print(lsa_top.shape)  # (no_of_doc*no_of_topics)

[[ 1.98869854e-01 -1.38376650e-01  2.15698006e-01  1.42076819e-01
   5.12011723e-02]
 [ 4.11416350e-02 -5.04599568e-03 -1.60539834e-02  8.67426340e-03
  -1.71465226e-04]
 [ 6.09287728e-02  2.29833309e-02 -2.29281012e-02  7.73064725e-03
  -2.09899368e-02]
 ...
 [ 4.81134686e-04 -2.11751582e-04 -8.26656575e-04  3.78172300e-04
   5.12241783e-04]
 [-7.32190219e-10  1.30352065e-08  3.27487996e-08  1.19543370e-07
   3.94460902e-07]
 [ 6.21896700e-02 -9.45849116e-02 -2.05469933e-01  3.26010670e-01
  -3.08038212e-03]]
(301, 5)


In [212]:
#bobot setiap topik terhadap dokumen
result = pd.DataFrame(lsa_top, columns=['Topik 1','Topik 2','Topik 3','Topik 4','Topik 5'])
result

Unnamed: 0,Topik 1,Topik 2,Topik 3,Topik 4,Topik 5
0,0.1988698544538768,-0.1383766504681777,0.2156980061826590,0.1420768191615354,0.0512011723109526
1,0.0411416350018769,-0.0050459956757236,-0.0160539833708569,0.0086742633998680,-0.0001714652256947
2,0.0609287728488435,0.0229833309466812,-0.0229281011858051,0.0077306472479068,-0.0209899367775933
3,0.0351334886426639,0.0184269572973214,0.0256174438379070,0.0163037956185003,0.0215071737185803
4,0.0728026908350219,0.0650550038161726,-0.0695916781240870,-0.0493847621655555,0.4442754240119795
...,...,...,...,...,...
296,0.1261384748968871,-0.1620987496243333,-0.3211137371984872,0.5443897725715384,0.0075948590922994
297,0.2225477550124332,-0.1526555693204313,-0.2046540686907619,0.0086829500607749,-0.0227290414901178
298,0.0004811346861321,-0.0002117515822279,-0.0008266565745284,0.0003781722999902,0.0005122417827454
299,-0.0000000007321902,0.0000000130352065,0.0000000327487996,0.0000001195433705,0.0000003944609017


In [213]:
VT = lsa_model.components_
print(VT)
print(VT.shape) # (no_of_topics*no_of_words)

[[ 5.77525910e-03  3.87858580e-02  5.05674297e-03 ...  4.26859447e-03
   6.03022397e-04  2.83897293e-03]
 [-1.26982790e-02  1.05247177e-02  1.37352217e-02 ... -2.11389221e-03
   6.61817639e-04 -1.05903405e-04]
 [-3.05632987e-02  4.03710284e-02  1.48735358e-03 ...  7.37144465e-03
   1.26928202e-04 -4.47203144e-03]
 [ 4.93622769e-02  4.87831746e-03  2.89727331e-03 ...  1.57043077e-03
  -6.10606770e-05  3.30266366e-03]
 [-4.97328416e-04 -1.55407887e-02 -5.66115846e-03 ... -2.46282110e-03
  -8.84549666e-04 -1.13281005e-03]]
(5, 1484)


In [214]:
#bobot setiap kata terhadap topik
label=[]
for i in range (1,1485):
  hasil = f"kata ke-{i}"
  label.append(hasil)
VT_tabel = pd.DataFrame(VT,columns=label)
VT_tabel.rename(index={0:"Topik 1",1:"Topik 2",2:"Topik 3",3:"Topik 4",4:"Topik 5"})

Unnamed: 0,kata ke-1,kata ke-2,kata ke-3,kata ke-4,kata ke-5,kata ke-6,kata ke-7,kata ke-8,kata ke-9,kata ke-10,...,kata ke-1475,kata ke-1476,kata ke-1477,kata ke-1478,kata ke-1479,kata ke-1480,kata ke-1481,kata ke-1482,kata ke-1483,kata ke-1484
Topik 1,0.005775259103067,0.0387858579948793,0.0050567429735425,0.0043315534076394,0.0043315534076394,0.0040875900389011,0.0009899129987701,0.0263953375123731,0.019496277994384,0.0008962184657498,...,0.0031463582768725,0.0071259686550394,0.0091654925719895,0.0009899129987701,0.0034194238279916,0.0011833595975098,0.0018238707004299,0.0042685944697495,0.0006030223965054,0.0028389729301589
Topik 2,-0.0126982789781015,0.0105247176791851,0.0137352217071512,0.0019530956629954,0.0019530956629956,-0.0090238497454056,-0.0003000828333914,-0.0384273344688317,-0.0194308241573094,-0.0004407997089898,...,0.0026557394685443,0.0009057138564885,-0.0105532071992686,-0.0003000828333914,0.0006039052470031,-0.0007259656274184,-0.0013764720345655,-0.0021138922066219,0.0006618176392848,-0.0001059034049431
Topik 3,-0.03056329869124,0.0403710283502044,0.0014873535789779,0.0049803817390359,0.004980381739036,-0.021752950507023,-0.000873812403305,0.000716523425964,-0.0707848749750312,0.0020292807975556,...,0.0032470804292369,0.0088845833578435,0.0235650733898725,-0.000873812403305,-0.0040378247103006,-0.001167905136821,-0.0014726604933062,0.0073714446548376,0.0001269282016096,-0.0044720314380712
Topik 4,0.0493622769337174,0.00487831745891,0.002897273309446,0.003837582570974,0.003837582570974,0.0351430185794288,0.0004007652927898,0.0892851705221067,0.0936614269729895,0.000668668315099,...,0.0026796685251815,0.0065145413121955,0.0099053189647294,0.0004007652927898,0.0022070142809375,-0.0007107683247704,-0.0018410143605698,0.0015704307668634,-6.10606769943e-05,0.0033026636636283
Topik 5,-0.0004973284159486,-0.0155407886656289,-0.0056611584644457,-0.0027999997802973,-0.0027999997802974,-0.0003542711934985,-0.0003946422987916,0.0088751770029051,0.001151766827624,-0.0001689518288011,...,-0.0024541511548488,-0.0038224354014546,0.0051293956418934,-0.0003946422987916,-0.0016695695097215,0.0005263081118028,-0.0006871231191901,-0.002462821102974,-0.0008845496659318,-0.0011328100524262
