Instead of using bag-of-words models, I'll try to use transformer models. In this case, I'll use BERTopic to generate embeddings.

In [5]:
from bertopic import BERTopic
import pandas as pd

In [2]:
df = pd.read_csv(r"C:\Users\Philippa\Documents\GitHub\crystalline-mining\extracted_pdf_info.csv")

df.head()

Unnamed: 0,pdf_text,metadata
0,['NATURAL CALCIUM CARBONATE FOR BIOMEDICAL APP...,"{'Author': 'ismail - [2010]', 'Creator': 'Micr..."
1,['Proceedings of Machine Learning Research LEA...,"{'Author': '', 'CreationDate': 'D:202404290022..."
2,['Preprint Advances in Chemical Physics Vol. 1...,"{'CreationDate': 'D:20060424095709Z', 'ModDate..."
3,['Enhancing Drug-Drug Interaction Extraction f...,"{'Author': '', 'CreationDate': 'D:201805160029..."
4,['BERTChem-DDI : Improved Drug-Drug Interactio...,"{'Author': '', 'CreationDate': 'D:202012230158..."


In [6]:
docs = df["pdf_text"].values.copy()

In [19]:
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

In [8]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,20,-1_the_of_and_in,"[the, of, and, in, et, al, to, drug, for, netw...",[['Multimodal AI predicts clinical outcomes of...
1,0,61,0_the_and_of_in,"[the, and, of, in, to, is, for, drug, we, on]",[['A Deep Learning Approach to the Prediction ...
2,1,18,1_the_of_and_in,"[the, of, and, in, to, for, with, is, as, by]",[['Metal-Organic Frameworks in Semiconductor D...


In [10]:
topic_model.get_topic(1)

[('the', 0.1142819479938496),
 ('of', 0.08935438477025603),
 ('and', 0.07910074559299812),
 ('in', 0.05711684596057733),
 ('to', 0.05378035900483375),
 ('for', 0.03809003288126277),
 ('with', 0.03539240191385621),
 ('is', 0.03397271431433337),
 ('as', 0.024266565281732225),
 ('by', 0.02410571705239028)]

It seems that BERTopic only focused on stopwwords, even outside of Topic -1. I'll have to remove stop words and see if there's any improvement.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words="english")
topic_model_v2 = BERTopic(language="english", n_gram_range=(1,2), vectorizer_model=vectorizer_model)

In [15]:
topics, probs = topic_model_v2.fit_transform(docs)

In [16]:
topic_model_v2.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,41,-1_drug_et_al_network,"[drug, et, al, network, drugs, model, networks...",[['Multimodal AI predicts clinical outcomes of...
1,0,28,0_drug_graph_information_model,"[drug, graph, information, model, drugs, predi...",[['Predicting Drug-Drug Interactions using Dee...
2,1,19,1_dna_fig_figure_origami,"[dna, fig, figure, origami, chem, materials, r...","[['DNA origami Swarup Dey,1 Chunhai Fan,2,3 Ku..."
3,2,11,2_prediction_drug_protein_drugtarget,"[prediction, drug, protein, drugtarget, drugs,...",[['A Cross-Field Fusion Strategy for Drug–Targ...


In [18]:
topic_model_v2.get_topic(1)

[('dna', 0.026089953901695086),
 ('fig', 0.01861758362462085),
 ('figure', 0.018251200672335946),
 ('origami', 0.0168542680011874),
 ('chem', 0.015167035772732708),
 ('materials', 0.014003791410801953),
 ('release', 0.013500560717247723),
 ('si', 0.013469906150874095),
 ('structures', 0.01318654662812934),
 ('al', 0.013183829122821816)]

Search for topics that are similar to an input search_term.

In [20]:
similar_topics, similarity = topic_model_v2.find_topics("crystalline", top_n=5)
topic_model_v2.get_topic(similar_topics[0])

[('dna', 0.026089953901695086),
 ('fig', 0.01861758362462085),
 ('figure', 0.018251200672335946),
 ('origami', 0.0168542680011874),
 ('chem', 0.015167035772732708),
 ('materials', 0.014003791410801953),
 ('release', 0.013500560717247723),
 ('si', 0.013469906150874095),
 ('structures', 0.01318654662812934),
 ('al', 0.013183829122821816)]

In [22]:
topic_model_v2.visualize_heatmap()