### Topic Analysis
- If full texts as PDF file is not available, abstracts are used.
- All data about abstracts were exported from Scopus and are in CSV format.

In [None]:
import sys
from os import getcwd
from os.path import abspath, dirname, join
sys.path.append(f'{getcwd()}/bibliometric')
import warnings
warnings.simplefilter('ignore', category=(UserWarning, FutureWarning, SyntaxWarning))

import pandas as pd

from bibliometric import tools, cleaner

BASE_PATH = abspath(getcwd())
PARENT_PATH = dirname(dirname(BASE_PATH))
RESOURCE_PATH = join(PARENT_PATH, 'resources')
target_file_path = join(RESOURCE_PATH, '20250711_scopus_work2.csv')

#### Load the CSV file of the journal "Environmental History" and clean their abstracts and remove personal stopwords.

In [2]:
df = pd.read_csv(target_file_path, encoding='utf-8', sep=',')
abstracts =  df['Abstract'][df['Abstract']!='[No abstract available]'].to_list()
stopwords = []
texts = cleaner.clean_texts(texts=abstracts, stopwords=stopwords)

Texts are cleaned!


#### Calculate the topic model on the cleaned abstracts.

In [3]:
topic_model = tools.generate_topic_model()
my_model = tools.fit_topic_model(
    topic_model=topic_model, texts=texts, file_name='20250711_scopus', use_reduce_outlier=False)

2025-08-19 11:14:21,824 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/30 [00:00<?, ?it/s]

2025-08-19 11:14:47,365 - BERTopic - Embedding - Completed ✓
2025-08-19 11:14:47,367 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-08-19 11:14:56,569 - BERTopic - Dimensionality - Completed ✓
2025-08-19 11:14:56,571 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-08-19 11:14:56,731 - BERTopic - Cluster - Completed ✓
2025-08-19 11:14:56,733 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-08-19 11:14:57,321 - BERTopic - Representation - Completed ✓
2025-08-19 11:14:57,324 - BERTopic - Topic reduction - Reducing number of topics
2025-08-19 11:14:57,332 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-08-19 11:15:12,284 - BERTopic - Representation - Completed ✓
2025-08-19 11:15:12,287 - BERTopic - Topic reduction - Reduced number of topics from 31 to 31


Get information about each topic including its ID, frequency, name and representative words.

In [4]:
my_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,162,-1_corpus_research_technology_language,"[corpus, research, technology, language, metho...",[network analysis historical correspondence fr...
1,0,141,0_word recognition_word frequency_lexical deci...,"[word recognition, word frequency, lexical dec...",[hebrew semitic language word formed non conca...
2,1,107,1_description logic_logic program_argumentatio...,"[description logic, logic program, argumentati...",[paper tackle fundamental question arising loo...
3,2,62,2_speech system_speech recognition_automatic s...,"[speech system, speech recognition, automatic ...",[ever increasing volume audio data available o...
4,3,44,3_linguistic_discourse analysis_communicative_...,"[linguistic, discourse analysis, communicative...",[study situated field pragmatic fiction audio ...
5,4,40,4_design medium_techno aesthetic_artistic prac...,"[design medium, techno aesthetic, artistic pra...",[article discusses three composition autonomou...
6,5,36,5_sentiment analysis_opinion term_term opinion...,"[sentiment analysis, opinion term, term opinio...",[past decade artificial intelligence ai techni...
7,6,36,6_authorship_analysis text_authorship attribut...,"[authorship, analysis text, authorship attribu...",[kepler book founded 1955 menlo park ca usa ma...
8,7,26,7_preference ordering_preference function_pref...,"[preference ordering, preference function, pre...",[existing literature social preference either ...
9,8,26,8_translation system_machine translation_trans...,"[translation system, machine translation, tran...",[important extension conventional text neural ...


Visualize documents and their topics in 2D.

In [6]:
tools.plot_topics_of_documents(texts=texts, topic_model=my_model)

Batches:   0%|          | 0/30 [00:00<?, ?it/s]

Visualize a hierarchical structure of the topics.

In [None]:
tools.plot_hierarchical_clustering(texts=texts, topic_model=my_model)

100%|██████████| 17/17 [00:00<00:00, 96.87it/s]
