<a href="https://colab.research.google.com/github/andrybrew/IHT-SEM1302-30Okt/blob/main/practice_material/topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Topic Modeling**

##**Importing required libraries**

In [1]:
!pip install pyLDAvis --no-deps
!pip install funcy

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyLDAvis
Successfully installed pyLDAvis-3.4.1


In [32]:
import pandas as pd
import nltk
import requests
import seaborn as sns
import string
import gensim
import spacy
import re

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from gensim import corpora
import pyLDAvis.gensim_models

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

  and should_run_async(code)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

##**Import Dataset from Apify Runs**

In [61]:
# Fetching the dataset from Apify
api_url = "https://api.apify.com/v2/datasets/na5aYoW2Rcrm9h08r/items"
api_token = "apify_api_NbL3VeatZbemul8vsNCyVKRhzmRv0M0Wh9ql"  # Substitute with your actual API token


  and should_run_async(code)


In [62]:
# Making a GET request to the API
params = {"token": api_token, "format": "json"}  # parameter untuk permintaan
response = requests.get(api_url, params=params)

# Checking the response status
if response.status_code == 200:  # Sukses
    # Processing the JSON data into a pandas DataFrame
    data = response.json()
    df_tweet = pd.DataFrame(data)

    # Saving the DataFrame to a CSV file
    df_tweet.to_csv("twitter_data.csv", index=False)
    print("Data berhasil diunduh dan disimpan sebagai twitter_data.csv")
else:
    print(f"Terjadi kesalahan: {response.status_code}. Pesan: {response.text}")

  and should_run_async(code)


Data berhasil diunduh dan disimpan sebagai twitter_data.csv


##**Data Preprocessing for Topic Modeling**

In [63]:
# Function for removing URL
def remove_url(text):
    return re.sub(r'https?://\S+|www\.\S+|\S+\.\S+/\S+', '', text, flags=re.MULTILINE)

# Menghapus URL dari setiap tweet
df_tweet['text'] = df_tweet['text'].apply(remove_url)

  and should_run_async(code)


In [64]:
# Select News Title
tweet = df_tweet['text']

# Tokenize
tokenized_text = [d.lower().split() for d in tweet]

# Remove punctuation
punctuation = string.punctuation
tokenized_text = [[word for word in doc if word not in punctuation] for doc in tokenized_text]

# Lemmatization
lemmatizer = WordNetLemmatizer()
tokenized_text = [[lemmatizer.lemmatize(word) for word in doc] for doc in tokenized_text]

# Remove stopwords
stop_words = stopwords.words('indonesian')
tokenized_text = [[word for word in doc if word not in stop_words] for doc in tokenized_text]

# Adding additional words to the stop words list
custom_stop_words = ['dgn', 'sdh', 'yg', 'the', 'gak', 'ga', 'a', 'krn', 'thd', 'nya', 'ya', 'n', 'kalo', 'aja', 'deh', 'tuh', 'udah', 'dll.', '2', '25', '20', '1.', '2.', '7.', 'u', '5', 'gua', '•']
stop_words.extend(custom_stop_words)

# Remove stopwords again
tokenized_text = [[word for word in doc if word not in stop_words] for doc in tokenized_text]

# Create dictionary
dictionary = corpora.Dictionary(tokenized_text)

# Create corpus
corpus = [dictionary.doc2bow(doc) for doc in tokenized_text]

  and should_run_async(code)


##**Setting Up LDA Model**

In [65]:
# Train LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=3,
                                           passes = 200,
                                           per_word_topics=True)

  and should_run_async(code)


##**Visualizing Topics**

In [66]:
# Enable Notebook
pyLDAvis.enable_notebook()

# Visualize
pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

  and should_run_async(code)


In [72]:
# Generating the best topics
top_topics = lda_model.print_topics(num_words=10)  # Displaying the top 10 keywords for each topics

# Create DataFrame
df_topics = pd.DataFrame(top_topics, columns=['Topic', 'Keywords'])

# Set topic as index
df_topics.set_index('Topic', inplace=True)

# Show df_topics
df_topics

  and should_run_async(code)


Unnamed: 0_level_0,Keywords
Topic,Unnamed: 1_level_1
0,"0.034*""bunga"" + 0.034*""suku"" + 0.017*""bi"" + 0...."
1,"0.039*""suku"" + 0.038*""bunga"" + 0.016*""bi"" + 0...."
2,"0.075*""bunga"" + 0.074*""suku"" + 0.030*""bi"" + 0...."


  and should_run_async(code)


**Topik 0:**

Fokus pada suku bunga, Bank Indonesia, kenaikan suku bunga, nilai tukar, dan saham. Topik ini mungkin berkaitan dengan kebijakan suku bunga Bank Indonesia dan dampaknya terhadap pasar saham dan nilai tukar rupiah.

**Topik 1:**

Juga berfokus pada suku bunga, Bank Indonesia, dan pasar, tetapi juga mencakup kebijakan, Federal Reserve (fed), dan Kredit Pemilikan Rumah (KPR). Topik ini mungkin berkaitan dengan kebijakan suku bunga dan pengaruhnya terhadap pasar keuangan, termasuk KPR.

**Topik 2:**

Fokus utamanya adalah suku bunga dan Bank Indonesia, dengan kata kunci seperti suku bunga acuan dan persentase. Topik ini mungkin berfokus pada kebijakan suku bunga Bank Indonesia dan pengumuman terkait suku bunga acuan.