<a href="https://colab.research.google.com/github/ezzy4me/youtube_comment_tm/blob/main/step_4_youtube_bertopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4. BERTopic with YouTube Comments on Living Alone


Based on the text data obtained earlier, we will conduct a topic modeling called BERTopic.

## load CSV file


In [1]:
from google.colab import drive
drive.mount('/content/drive')

# environment setting
import csv
import pandas as pd
import numpy as np
import re

Mounted at /content/drive


In [2]:
filtered_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/eng_youtube/model_input.csv')

# 4. BERTopic

We will explain the process of conducting BERTopic and performing visualization.

This process deals with how to perform subject modeling on text data and visually represent the results.

In [3]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.4.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.8/90.8 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transfor

In [11]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP

# UMAP model
umap_model = UMAP(n_neighbors=5,
                  n_components=5,
                  min_dist=0.0,
                  metric='cosine',
                  random_state=43)

# CountVectorizer
vectorizer_model = CountVectorizer(ngram_range=(1, 1), stop_words="english")

# BERTopic model
youtube_model = BERTopic(embedding_model="sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens",
                                 umap_model=umap_model,
                                 vectorizer_model=vectorizer_model,
                                 top_n_words=10, min_topic_size = 50,
                                 calculate_probabilities=True)

In [12]:
from sentence_transformers import SentenceTransformer
# Extract documents from data frames
docs = filtered_df['title_comments'].tolist()

# Generate sentence embeddings
sentence_model = SentenceTransformer("sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens")
embeddings = sentence_model.encode(docs)

# Performing Topic Modeling
topics, probs = youtube_model.fit_transform(docs, embeddings)


### visualization

Using visualization methods with BERTopic simplifies the process of visualizing and interpreting topic and sentiment analysis of YouTube comments.

#### **youtube_model.visualize_topics():**

This method is used to visualize the topics extracted from the BERTopic model. When you call this method, it generates a visualization that displays the main topics and the word clusters associated with each topic. This allows you to visually grasp the content of the topics, enhancing your understanding and interpretation of the topics.

In [13]:
youtube_model.visualize_topics()

####**youtube_model.visualize_documents(docs, embeddings=embeddings):**

This method is used to visualize documents (e.g., comments). You can provide the documents you want to analyze in the docs parameter and optionally include their vector representations with the embeddings parameter. The visualization displays in which BERTopic topics each given document belongs, often in a 2D space, enabling you to understand the relationships between different topics.

In [14]:
youtube_model.visualize_documents(docs, embeddings=embeddings)

####**youtube_model.visualize_barchart(top_n_topics=30):**

This method visualizes the frequency of topics in a bar chart. You can select the top N topics using the top_n_topics parameter, and it shows the relative importance and distribution of these topics. This helps you identify the most significant topics and summarize the analysis results.

In [15]:
youtube_model.visualize_barchart(top_n_topics=30)

#### Topic presenting
it extracts the words associated with each topic and joins them into a single string.

In [16]:
for i in range(0, 100):
    topic = youtube_model.get_topic(i)
    if isinstance(topic, bool):
        continue
    topic_words = [word[0] for word in topic]

    topic_string = ' '.join(topic_words)
    print(f"{i} 's topic : {topic_string}")

0 's topic : women men man nofap life dont retention bro just truth
1 's topic : accounting public cpa audit career exam big external industry working
2 's topic : car spend living don money live need things van solar
3 's topic : youtuber iftar ramadan eat food abdul recipe daily harun weight
4 's topic : swift jayco motorhome class month new van price buy payment
5 's topic : beer thirty patio fireside home rv sanity session chat good
6 's topic : makeup abdul tutorial youtuber peach look beauty love sweet skin
7 's topic : texas driving state park drive central llano road lake rv
8 's topic : florida beach melbourne place live vero rv mobile sale good
9 's topic : mexico new mountains mountain ruidoso nm driving snow cloudcroft beautiful
10 's topic : dna ethiopia ancestry results youtuber harar african abdul africa dire
11 's topic : office ramadan updates harun daily bedroom universal youtuber home event
12 's topic : cops rest sleeping knocking lesson window story came stop cop
1