# Materials and Points of Reference
### Similar Projects
- [Stanford Large Network Dataset](https://snap.stanford.edu/data/index.html#communities):
A collection of 20+ large networks spanning multiple topics from academic collaboration (among physicists) to Facebook networks
- - [Reddit Temporal Networks](https://cs.stanford.edu/~srijan/pubs/conflict-paper-www18.pdf)

# Methods and Metrics
### Community Detection: 
- [Girvan-Newman Algorithm](https://memgraph.github.io/networkx-guide/algorithms/community-detection/girvan-newman/): iterative elimination of edges that have the highest number of shortest paths between nodes passing through them so that by removing edges from the graph one-by-one, the network breaks down into smaller pieces, so-called communities.
- [Louvain Method](https://www.nature.com/articles/s41598-019-41695-z): Good for finding non-overlapping communities by way of graph *modularity*

# Goals

### Static Network Analysis
1. Acquire data from selected journals 
2. Preprocess data (string cleaning, matching institutions to known entities)
3. Geolocate institutions
4. Rank centrality
5. Apply Girvan-Newman, Louvain


# Past Week
1. Bug fixes
2. Brief foray into similar projects
3. Examined new techniques

In [16]:
# file reading/writing 
import storage 
import csv

# analysis 
import networkx as nx
from haversine import haversine

# standard plotting 
import seaborn 
import matplotlib.pyplot as plt

# mapping 
import folium
from IPython.display import display, IFrame
from folium.plugins import HeatMap
from folium.plugins import MarkerCluster
from shapely.geometry import shape, Point


# standard utility
import numpy as np
import random
import json
import pandas as pd

# text processing
import spacy
from bertopic import BERTopic


In [28]:
articles = storage.retrieve_all_articles()
print(len(articles))
docs = [i[1] for i in articles]

nlp = spacy.load('en_core_web_sm', exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
print(docs)
model = BERTopic.load("MaartenGr/BERTopic_ArXiv")
topic_model = BERTopic(embedding_model=model)

topics, probs = topic_model.fit_transform(docs)



325
['Diabetes mellitusProgress and opportunities in the evolving epidemic.', '50 years of metabolism research at Cell.', 'Pancancer singlecell dissection reveals phenotypically distinct B cell subtypes.', 'MYBrelated transcription factors control chloroplast biogenesis.', 'Innate immune memory after brain injury drives inflammatory cardiac dysfunction.', 'Presynaptic sensor and silencer of peptidergic transmission reveal neuropeptides as primary transmitters in pontine fear circuit.', 'Integrated cryoEM structure of a spumaretrovirus reveals crosskingdom evolutionary relationships and the molecular basis for assembly and virus entry.', 'Allogeneic CD19targeted CART therapy in patients with severe myositis and systemic sclerosis.', 'The WDR11 complex is a receptor for acidicclustercontaining cargo proteins.', 'Molecular and cellular mechanisms of teneurin signaling in synaptic partner matching.', 'Assembly and activation of EBV latent membrane protein 1.', 'Threedimensional genome arch

In [29]:
print(probs.shape)

(325,)


In [30]:
print(len(topics))
print(topics[2])

325
1
