<a href="https://colab.research.google.com/github/cbadenes/notebooks/blob/main/probabilistic_topic_models/stripnet_cord19.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you're opening this Notebook on colab, you will probably need to install  Transformers 🤗 and STrIP NET:

Remember to enable CUDA support!  (`Runtime>Change runtime type>Hardware accelerator(CUDA)`)

In [2]:
!pip install --no-deps stripnet==0.0.6
!pip install transformers
!pip install sentence-transformers==2.1.0
!pip install bertopic==0.9.4
!pip install pyvis==0.1.9
!pip install dimcli plotly networkx pyvis jsonpickle

Collecting stripnet==0.0.6
  Downloading stripnet-0.0.6-py3-none-any.whl (12 kB)
Installing collected packages: stripnet
Successfully installed stripnet-0.0.6
Collecting transformers
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 4.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 32.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 8.7 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 27.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████

# Prepare the data

* STriP can run on any `pandas` dataframe column containing text.
* In the current version (0.0.6), it requires a column that combines the title and abstract of papers separated by `[SEP]` keyword.


In [12]:
from google.colab import data_table
data_table.enable_dataframe_formatter()

# Load some data
import pandas as pd


# small sample
sample_data = "https://delicias.dia.fi.upm.es/nextcloud/index.php/s/gtt2mZGPAjFee8s/download"
# medium size sample
#sample_data = "https://delicias.dia.fi.upm.es/nextcloud/index.php/s/R2FyYtBRrN23T3r/download"
# large size sample
#sample_data = "https://delicias.dia.fi.upm.es/nextcloud/index.php/s/kjPAkyBJLjHBCqZ/download"


data = pd.read_csv(sample_data, sep=',', header=0, on_bad_lines='skip')

# Keep only title and abstract columns
data = data[['title', 'abstract']]

# Concat the title and abstract columns separated with [SEP] keyword
data['text'] = data['title'] + '[SEP]' + data['abstract']

data_table.DataTable(data, include_index=False, num_rows_per_page=5)

Unnamed: 0,title,abstract,text
0,Adapting the UK Biobank brain imaging protocol...,SARS-CoV-2 infection has been shown to damage ...,Adapting the UK Biobank brain imaging protocol...
1,COVID-19 assessment in family practice-A clini...,The study aimed to evaluate the diagnostic acc...,COVID-19 assessment in family practice-A clini...
2,Herding and feedback trading in cryptocurrency...,This paper examines the extent to which herdin...,Herding and feedback trading in cryptocurrency...
3,Blockade of the C5a-C5aR axis alleviates lung ...,The pathogenesis of highly pathogenic Middle E...,Blockade of the C5a-C5aR axis alleviates lung ...
4,In silico prediction of toxicity and its appli...,Objective and methods This study reviewed the ...,In silico prediction of toxicity and its appli...
...,...,...,...
663,Hypoxia-induced inflammation: Profiling the fi...,Central nervous system and visual dysfunction ...,Hypoxia-induced inflammation: Profiling the fi...
664,Selectively caring for the most severe COVID-1...,"1 Abstract SARS-CoV-2, the virus responsible f...",Selectively caring for the most severe COVID-1...
665,Economic influences on population health in th...,• The United States is in the midst of a 40-ye...,Economic influences on population health in th...
666,Title: Solid non-lung organs from COVID-19 don...,"Tweet: ""We think that non-lung grafts from COV...",Title: Solid non-lung organs from COVID-19 don...


# Generate the network of similar documents

* Create a network based on topic distributions from documents. If you are not satisfied with the topics you get, just retrain the topic model by tweaking the parameters:

    * `min_topic_size` (int, optional): The minimum number of documents per topic. Increasing this value will lead to a lower number of clusters/topics. Defaults to 10.
    * `n_gram_range` (tuple(min_n, max_n), optional): The n-gram range for the CountVectorizer. The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Advised to keep high values between 1 and 3. More would likely lead to memory issues. Defaults to (1, 1).
    * `stop_words` (str, list, optional): The sklearn stopwords to use. Defaults to 'english'.           
    * `threshold` (float, optional): Minimum cosine similarity to draw a link on the network. Default value None will use an internally calculated threshold value.
    * `remove_isolated_nodes` (bool, optional): True will remove any nodes that have 0 edges. Defaults to False.
    * `max_connections` (int, optional): Maximum connections to allow in the network. The actual value used might be lower than this due to internal calculations. Defaults to None which uses the internally generated heuristic for max_connections
    * `verbose` (bool, optional): Defaults to True.

In [13]:
# Instantiate the StripNet
from stripnet import StripNet
stripnet = StripNet()

# Run the StripNet pipeline
stripnet.fit_transform(data['text'], 
                       min_topic_size=10, 
                       n_gram_range=(1,1), 
                       stop_words='english', 
                       threshold=None, 
                       remove_isolated_nodes=False, 
                       max_connections=None, 
                       verbose=True)

2022-02-23 14:56:52 INFO: Load pretrained SentenceTransformer: allenai-specter
2022-02-23 14:56:55 INFO: Use pytorch device: cuda


Batches:   0%|          | 0/21 [00:00<?, ?it/s]

2022-02-23 14:57:30 INFO: Initializing the topic model
2022-02-23 14:57:30 INFO: Training the topic model
2022-02-23 14:57:37,262 - BERTopic - Reduced dimensionality with UMAP
2022-02-23 14:57:37,299 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-02-23 14:57:37 INFO: Populating Topic Results
2022-02-23 14:57:37 INFO: Cosine similarity
2022-02-23 14:57:37 INFO: Calculating optimal threshold
2022-02-23 14:57:37 INFO: Number of connections: 10128
2022-02-23 14:57:37 INFO: Calculating Network Plot


Internally, StripNet creates a model based on [BERTopic](https://github.com/MaartenGr/BERTopic). Topics are available from the model. `-1` refers to all outliers and should typically be ignored.

In [14]:
model = stripnet.bertopic_model
data_table.DataTable(model.get_topic_info(), include_index=False, num_rows_per_page=5)

Unnamed: 0,Topic,Count,Name
0,0,377,0_patients_sarscov2_covid19_virus
1,-1,88,-1_covid19_data_pandemic_study
2,1,79,1_health_students_study_covid19
3,2,51,2_covid19_model_cases_pandemic
4,3,39,3_data_covid19_preventive_practices
5,4,34,4_policy_health_water_global


The most relevant words by topic can also be found:

In [15]:
model.get_topic(1)[:10]

[('health', 0.028668460097845575),
 ('students', 0.02849113797467941),
 ('study', 0.026836301268405915),
 ('covid19', 0.025622864105712025),
 ('pandemic', 0.023159503089996394),
 ('anxiety', 0.02284128617967548),
 ('participants', 0.020659643061269058),
 ('mental', 0.019721224398812784),
 ('results', 0.019685877820920305),
 ('social', 0.019636983543087672)]

and documents from their topic distributions:

In [16]:
data_table.DataTable(stripnet.topic_data, include_index=False, num_rows_per_page=5)

Unnamed: 0,Text,Topic,Probs,Topic_Count,Topic_Name
0,Adapting the UK Biobank brain imaging protocol...,0,1.000000,377,"patients, sarscov2, covid19, virus"
1,COVID-19 assessment in family practice-A clini...,0,1.000000,377,"patients, sarscov2, covid19, virus"
2,Herding and feedback trading in cryptocurrency...,4,0.728904,34,"policy, health, water, global"
3,Blockade of the C5a-C5aR axis alleviates lung<...,0,0.958463,377,"patients, sarscov2, covid19, virus"
4,In silico prediction of toxicity and its<br>ap...,-1,0.000000,88,"covid19, data, pandemic, study"
...,...,...,...,...,...
663,Hypoxia-induced inflammation: Profiling the fi...,0,0.930844,377,"patients, sarscov2, covid19, virus"
664,Selectively caring for the most severe COVID-1...,2,0.736853,51,"covid19, model, cases, pandemic"
665,Economic influences on population health in th...,4,1.000000,34,"policy, health, water, global"
666,Title: Solid non-lung organs from COVID-19 don...,0,0.883538,377,"patients, sarscov2, covid19, virus"


The graph is fully interactive! Have fun playing around by hovering over the nodes and moving them around!


In [18]:
from dimcli.utils.networkviz import NetworkViz # custom version of pyvis - colab-compatible
viznet = NetworkViz(notebook=True, width="80%", height="500px")
viznet.toggle_hide_edges_on_drag(True)
viznet.barnes_hut()
viznet.repulsion(300)
viznet.from_nx(stripnet.nx_net)
viznet.show("stripnet.html")

# Find the most important document

* After you fit the model using the above steps, you can plot the most important documents with one line of code
* The plot is fully interactive too! Hovering over any bar shows the relevant information of the document.

In [19]:
stripnet.most_important_docs()

2022-02-23 14:59:36 INFO: Calculating Network Centrality


# Common Issues

### Threshold

1) If your StripNet graph is just one big ball of moving fireflies, try these steps
Check the value of threshold currently used by stripnet

In [20]:
current_threshold = stripnet.threshold
print(current_threshold)

0.75


Increase the value of `threshold` in steps of 0.05 and try again until you see a good looking network. Remember the max value of threshold is 1! 

If you're threshold is already 0.95 then try increasing in steps of 0.01 instead.

### Num Topics

If you're dataset is small ( less than 500 rows) and the number of topics generated seems too less try tweaking the value of `min_topic_size` to a value lower than the default value of 10 until you get topics that look reasonable to you

### Isolated Nodes

After the above two steps, if your graph looks messy, try removing isolated nodes (those nodes that don't have any connections)


In practice, you might have to tweak all three at the same time!

# Repeat the process varying some of the parameters

how does it influence the result?