In [0]:
!pip install bertopic

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-da9440ea-8bbf-49e7-a245-361f2e563c0a/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
import spacy

In [0]:
!python -m spacy download en_core_web_lg 

## load data and perform minimal preprocessing since 
### 1. we gonna use sentence transformer as base model for bertopicand 
### 2. unlike a conversation, the language of this dataset is information dense, thus do not need much preprocessing to filter out noises

In [0]:
import pandas as pd
test_df=pd.read_csv("/dbfs/FileStore/articles_test.csv")
train_df=pd.read_csv("/dbfs/FileStore/articles_train.csv")

In [0]:
import re
def reflection_tokenizer(text):
    '''expects a string and returns a string with tokens that are lower cased and 
        non-alphanumeric characters as well as numbers removed.'''
    text=re.sub(r'[\W_]+', ' ', text) #keeps alphanumeric characters
    text=re.sub(r'\d+', '', text) #removes numbers
    text = text.lower()
    return text
train_df['preprocessed_abstract']=train_df['ABSTRACT'].apply(reflection_tokenizer)
test_df['preprocessed_abstract']=test_df['ABSTRACT'].apply(reflection_tokenizer)

In [0]:
import numpy as np
train_df["label"]=np.where(train_df["Computer Science"]==1,'Computer Science',np.where(train_df['Physics']==1,'Physics',np.where(train_df['Mathematics']==1,'Mathematics',np.where(train_df['Statistics']==1,'Statistics',np.where(train_df['Quantitative Biology']==1,'Quantitative Biology','Quantitative Finance')))))
train_df['label_num']=train_df['label'].map({'Computer Science': 0, 'Physics': 1,'Mathematics':2, 'Statistics':3, 'Quantitative Biology':4, 'Quantitative Finance':5})
train_df=train_df.drop(columns=['Computer Science','Physics','Mathematics','Statistics','Quantitative Biology','Quantitative Finance'])
train_df.head(20)

Unnamed: 0,ID,TITLE,ABSTRACT,preprocessed_abstract,label,label_num
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,predictive models allow subject specific infe...,Computer Science,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,rotation invariance and translation invarianc...,Computer Science,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,we introduce and develop the notion of spheri...,Mathematics,2
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,the stochastic landau lifshitz gilbert llg eq...,Mathematics,2
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,fourier transform infra red ftir spectra of s...,Computer Science,0
5,6,On maximizing the fundamental frequency of the...,Let $\Omega \subset \mathbb{R}^n$ be a bound...,let omega subset mathbb r n be a bounded doma...,Mathematics,2
6,7,On the rotation period and shape of the hyperb...,We observed the newly discovered hyperbolic ...,we observed the newly discovered hyperbolic m...,Physics,1
7,8,Adverse effects of polymer coating on heat tra...,The ability of metallic nanoparticles to sup...,the ability of metallic nanoparticles to supp...,Physics,1
8,9,SPH calculations of Mars-scale collisions: the...,We model large-scale ($\approx$2000km) impac...,we model large scale approx km impacts on a m...,Physics,1
9,10,$\mathcal{R}_{0}$ fails to predict the outbrea...,Time varying susceptibility of host at indiv...,time varying susceptibility of host at indivi...,Quantitative Biology,4


## Semi-supervised Topic Modeling:
### Using Semi-supervised Bertopic model to nudge the creation of topics toward certain pre-specified topics.    
### Semi-supervised modeling allows us to steer the dimensionality reduction of the embeddings into a space that closely follows any labels we already have.

To perform this semi-supervised approach, we can take in some pre-defined topics and simply pass those to the y parameter when fitting BERTopic. These labels can be pre-defined topics or simply documents that you feel belong together regardless of their content. BERTopic will nudge the creation of topics toward these categories using the pre-defined labels.     

To perform supervised topic modeling, we simply use all categories:   

topic_model = BERTopic(verbose=True).fit(docs, y=categories)   
The topic model will be much more attuned to the categories that were defined previously.    
### However, this does not mean that only topics for these categories will be found. BERTopic is likely to find more specific topics in those we have already defined. This allows us to discover previously unknown topics

In [0]:
print(train_df['ABSTRACT'].count())
print(test_df['ABSTRACT'].count())

20972
8989


In [0]:
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.representation import PartOfSpeech
from bertopic.vectorizers import ClassTfidfTransformer

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")#all-mpnet-base-v2

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(ngram_range=(1, 3))

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with key bert
# a `bertopic.representation` model
representation_model = KeyBERTInspired()

# # Step 6 - (Optional) Fine-tune topic representations with 
# pos_patterns = [
#             [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
#             [{'POS': 'NOUN'}], [{'POS': 'ADJ'}],
#             [{'POS': 'ADJ'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}],
#             [{'POS': 'NOUN'}],
#             [{'POS': 'NOUN'},{'POS': 'NOUN'}]

# ]
# representation_model = PartOfSpeech("en_core_web_lg", pos_patterns=pos_patterns)

# All steps together
min_topic_model_processed = BERTopic(
  min_topic_size=100,
  language="english", 
  top_n_words=20, 
  calculate_probabilities=True,
  n_gram_range=(1, 3),
  verbose=True,
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)

min_topics_processed, min_probs_processed = min_topic_model_processed.fit_transform(train_df['preprocessed_abstract'],y=train_df['label_num'])

Batches:   0%|          | 0/656 [00:00<?, ?it/s]

2023-04-06 00:10:30,283 - BERTopic - Transformed documents to Embeddings
2023-04-06 00:10:40,453 - BERTopic - Reduced dimensionality
2023-04-06 00:10:55,747 - BERTopic - Clustered reduced embeddings


In [0]:
freq = min_topic_model_processed.get_topic_info()
freq

Unnamed: 0,Topic,Count,Name
0,-1,4351,-1_learning_models_algorithms_networks
1,0,5485,0_magnetic_spin_quantum_phase
2,1,4404,1_this paper we_theorem_this paper_theory
3,2,380,2_semantics_logics_decidable_automata
4,3,280,3_deep learning_deep neural_deep neural networ...
...,...,...,...
107,106,18,106_knowledge graph embeddings_knowledge graph...
108,107,17,107_quantum channel_ideal quantum channel_quan...
109,108,17,108_aerial vehicles_tracking control_uav_attit...
110,109,16,109_signaling networks_protein complexes_signa...


In [0]:
freq.head(20)

Unnamed: 0,Topic,Count,Name
0,-1,4351,-1_learning_models_algorithms_networks
1,0,5485,0_magnetic_spin_quantum_phase
2,1,4404,1_this paper we_theorem_this paper_theory
3,2,380,2_semantics_logics_decidable_automata
4,3,280,3_deep learning_deep neural_deep neural networ...
5,4,243,4_slam_pose_vision_camera
6,5,217,5_graphs_of graph_graph is_complexity
7,6,172,6_arbitrage_markets_market_liquidity
8,7,167,7_mimo_transmit_antennas_wireless
9,8,156,8_speech recognition_automatic speech_automati...


In [0]:
min_topic_model_processed.get_topic(0)

Out[144]: [('magnetic', 0.30588767),
 ('spin', 0.22316842),
 ('quantum', 0.19854933),
 ('phase', 0.18576126),
 ('electron', 0.1781575),
 ('galaxies', 0.17560966),
 ('dynamics', 0.1731713),
 ('star', 0.14322568),
 ('energy', 0.12963197),
 ('transition', 0.118588366)]

In [0]:
min_topic_model_processed.get_topic(1)

Out[145]: [('this paper we', 0.29324612),
 ('theorem', 0.29166406),
 ('this paper', 0.23478912),
 ('theory', 0.22877887),
 ('in this paper', 0.20950463),
 ('conditions', 0.19376747),
 ('any', 0.18236612),
 ('of', 0.1759616),
 ('general', 0.16993664),
 ('on', 0.16932659)]

In [0]:
min_topic_model_processed.get_topic(2)

Out[146]: [('semantics', 0.45382175),
 ('logics', 0.4317517),
 ('decidable', 0.4185804),
 ('automata', 0.40899402),
 ('abstract', 0.3822445),
 ('logic', 0.35762882),
 ('automaton', 0.33087435),
 ('complexity', 0.30450425),
 ('concurrent', 0.29795772),
 ('formal', 0.28988218)]

In [0]:
min_topic_model_processed.visualize_topics()

In [0]:
min_topic_model_processed.visualize_hierarchy(top_n_topics=30)

In [0]:
min_topic_model_processed.visualize_barchart(top_n_topics=10)

In [0]:
min_topic_model_processed.visualize_heatmap(n_clusters=20, width=1000, height=1000)

In [0]:
topic_info=min_topic_model_processed.get_topic_info()
topic_info['mapping']=tuple(zip(topic_info.Topic, topic_info.Name))
topic_info

Unnamed: 0,Topic,Count,Name,mapping
0,-1,4351,-1_learning_models_algorithms_networks,"(-1, -1_learning_models_algorithms_networks)"
1,0,5485,0_magnetic_spin_quantum_phase,"(0, 0_magnetic_spin_quantum_phase)"
2,1,4404,1_this paper we_theorem_this paper_theory,"(1, 1_this paper we_theorem_this paper_theory)"
3,2,380,2_semantics_logics_decidable_automata,"(2, 2_semantics_logics_decidable_automata)"
4,3,280,3_deep learning_deep neural_deep neural networ...,"(3, 3_deep learning_deep neural_deep neural ne..."
...,...,...,...,...
107,106,18,106_knowledge graph embeddings_knowledge graph...,"(106, 106_knowledge graph embeddings_knowledge..."
108,107,17,107_quantum channel_ideal quantum channel_quan...,"(107, 107_quantum channel_ideal quantum channe..."
109,108,17,108_aerial vehicles_tracking control_uav_attit...,"(108, 108_aerial vehicles_tracking control_uav..."
110,109,16,109_signaling networks_protein complexes_signa...,"(109, 109_signaling networks_protein complexes..."


In [0]:
train_df['prediction']=np.array(min_topics_processed)
new_dict={}
for i in topic_info['mapping']:
    new_dict[i[0]]=i[1]
print(new_dict)
train_df['prediction_name']=train_df['prediction'].map(new_dict)
train_df.head()

{-1: '-1_learning_models_algorithms_networks', 0: '0_magnetic_spin_quantum_phase', 1: '1_this paper we_theorem_this paper_theory', 2: '2_semantics_logics_decidable_automata', 3: '3_deep learning_deep neural_deep neural networks_imagenet', 4: '4_slam_pose_vision_camera', 5: '5_graphs_of graph_graph is_complexity', 6: '6_arbitrage_markets_market_liquidity', 7: '7_mimo_transmit_antennas_wireless', 8: '8_speech recognition_automatic speech_automatic speech recognition_speech enhancement', 9: '9_meta analysis_statistical_causal_the causal', 10: '10_brain networks_neural_of neural_neuron', 11: '11_classifiers_classifier_classification_multi label', 12: '12_word embeddings_sentiment analysis_natural language processing_corpus', 13: '13_stochastic gradient descent_stochastic optimization_stochastic gradient_gradient descent', 14: '14_microgrid_power flow_power systems_renewable energy', 15: '15_tweets_social media_twitter_on social media', 16: '16_linear codes_ldpc codes_decoding algorithm_dec

Unnamed: 0,ID,TITLE,ABSTRACT,preprocessed_abstract,label,label_num,prediction,prediction_name
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,predictive models allow subject specific infe...,Computer Science,0,-1,-1_learning_models_algorithms_networks
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,rotation invariance and translation invarianc...,Computer Science,0,3,3_deep learning_deep neural_deep neural networ...
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,we introduce and develop the notion of spheri...,Mathematics,2,1,1_this paper we_theorem_this paper_theory
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,the stochastic landau lifshitz gilbert llg eq...,Mathematics,2,1,1_this paper we_theorem_this paper_theory
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,fourier transform infra red ftir spectra of s...,Computer Science,0,-1,-1_learning_models_algorithms_networks


In [0]:
train_df['label'].value_counts()

Out[165]: Computer Science        8594
Physics                 5521
Mathematics             4436
Statistics              1765
Quantitative Biology     447
Quantitative Finance     209
Name: label, dtype: int64

In [0]:
a=train_df[train_df['label']=='Computer Science']['prediction_name'].value_counts() #total 8594
a[a>50]

Out[166]: -1_learning_models_algorithms_networks                                                                3414
2_semantics_logics_decidable_automata                                                                  378
3_deep learning_deep neural_deep neural networks_imagenet                                              277
4_slam_pose_vision_camera                                                                              240
5_graphs_of graph_graph is_complexity                                                                  212
7_mimo_transmit_antennas_wireless                                                                      166
8_speech recognition_automatic speech_automatic speech recognition_speech enhancement                  154
11_classifiers_classifier_classification_multi label                                                   142
12_word embeddings_sentiment analysis_natural language processing_corpus                               142
13_stochastic gradient desc

In [0]:
b=train_df[train_df['label']=='Physics']['prediction_name'].value_counts() #total 5521
b[b>50] 

Out[159]: 0_magnetic_spin_quantum_phase    5458
Name: prediction_name, dtype: int64

In [0]:
c=train_df[train_df['label']=='Quantitative Biology']['prediction_name'].value_counts() #total 447
c[c>10]

Out[158]: 10_brain networks_neural_of neural_neuron                                                  141
-1_learning_models_algorithms_networks                                                      87
35_biased dispersal_of species_dispersal_ecological                                         64
50_intracellular_multicellular_extracellular_cells                                          49
69_folding simulation_molecular_molecule_the molecule                                       33
79_phylogenetic tree_gene trees_phylogenetic_phylogenetic networks                          26
98_epidemic model_epidemics_epidemic_of infection                                           22
109_signaling networks_protein complexes_signaling networks with_complexes and pathways     16
Name: prediction_name, dtype: int64

In [0]:
d=train_df[train_df['label']=='Quantitative Finance']['prediction_name'].value_counts() #total 209
d[d>10]

Out[160]: 6_arbitrage_markets_market_liquidity                               165
87_economic growth_economy_economic_international tax avoidance     24
-1_learning_models_algorithms_networks                              11
Name: prediction_name, dtype: int64

In [0]:
e=train_df[train_df['label']=='Statistics']['prediction_name'].value_counts() #total 1765
e[e>10]

Out[161]: -1_learning_models_algorithms_networks                                                                       780
9_meta analysis_statistical_causal_the causal                                                                146
29_of neural networks_deep neural networks_deep neural_neural networks                                        75
43_deep reinforcement_deep reinforcement learning_learning agents_reinforcement learning                      54
48_to adversarial examples_of adversarial examples_against adversarial_adversarial examples in                50
45_monte carlo mcmc_sequential monte carlo_markov chain monte_chain monte carlo                               50
46_classifiers_optimal classifier_classifier_classification                                                   50
49_matrix completion_low rank matrix_subspace clustering_component analysis                                   49
56_forecasting_deep learning_forecasts_predicting                                     

In [0]:
f=train_df[train_df['label']=='Mathematics']['prediction_name'].value_counts() #toal 4436
f[f>10]

Out[164]: 1_this paper we_theorem_this paper_theory    4371
-1_learning_models_algorithms_networks         29
0_magnetic_spin_quantum_phase                  12
Name: prediction_name, dtype: int64