# Our Data
In this notebook, I show the data we use for the project. The data is available to us in [Dropbox](https://www.dropbox.com/scl/fo/ujlvwjenrugdnzusxb04p/AJY7nvba6xihgvzd0bjwCRk?rlkey=woyb4gojtiqhvazd6lablbj4g&st=gd1buut9&dl=0) (This was the link we used last summer, we'll likely keep using it, but that's why it's named "2024") and is from [Dimensions AI](https://www.dimensions.ai) as of 2021. In general, all fields are separated following the [ANZSRC standard](https://www.abs.gov.au/statistics/classifications/australian-and-new-zealand-standard-research-classification-anzsrc/latest-release), and we work with them one at a time. The parent folder in Dropbox has three folders in it:
1. **notebooks:** This has a bunch of example notebooks for various things. You can look through these if you want. If you're unfamiliar with networks, I especially suggest looking at the "crash_courses/networks" folder.
2. **msi_demo:** Don't worry about this right now. We'll look at it later.
3. **data:** Where the data is stored. This is what I'll focus on now.

One extra thing I'll add now (since I don't know where else to put it) is that you don't need to download the data. Instead, you can copy a link to the file and use that. For example, a link from dropbox to one of the files is:

https://www.dropbox.com/scl/fi/a1t16rtialcw03n50ffkc/concepts_Zoology_608.csv.gz?rlkey=vjv60sfbhofbgvzfzdkrlurl1&st=eol4x1dm&dl=0

To use it in Python, you change the "dl=0" at the end to "dl=1" which makes it automatically download when you hit the link and use the link as the file path. If it's compressed, you'll need to manually enter the compression information. For example, you can use:
```
pd.read_csv('https://www.dropbox.com/...&dl=1', compression='gzip')
```

## Config

In [15]:
# load some packages
from matplotlib.text import Text
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import numpy as np

# Disruption Data
The **disruption_data_from_tom** folder has information on the importance of all papers in the dataset. It's big (almost 5gb as a parquet), but contains a lot of info.

Each article has an ID that matches the ID in the other data and various statistics. Statistics each have a letter and number. The letter signifies what is is. For example:

* `b`: Backward citations. The number of articles that paper cites.
* `i`: Forward citations. The number of articles that cite that paper.
* `cd`: The CD index from [Funk and Owen-Smith (2016)](https://pubsonline.informs.org/doi/abs/10.1287/mnsc.2015.2366). A scale from -1 to 1. A value of -1 means a paper is consolidating, meaning it combines ideas from many other papers. A value of 1 means it's disruptive, meaning it introduces ideas that replace other work.
* `cdnok`: "CD no K". The CD index without a k part. Similarly goes from -1 to 1, just is more likely to be closer to an extreme value.

The number is the number of years that is calculated for. For example, `i10` is the number of citations a paper gets after 10 years. 

# Citation Data
The "dimensions_2021_09_01_pull_20250425" has data on article citations to create a citation network. This is new, so I'm not as familiar with it as the other data, but is part of what we'll be working with this summer.

# Collaboration Network Data
Information to make a collaboration network is available in the "dimensions_2021_09_01_pull_1/collaboration_network_data" folder.

Contains data to make a collaboration network for researchers that have worked together. Each row represents an article and author pair: the `article_id` is the article and the `researcher_id` is the researcher.

We haven't used these yet, but may at some point.

In [16]:
pd.read_csv(
        'https://www.dropbox.com/scl/fi/ccpg44dt73ga24p45a0u3/collaborations_Zoology_608.csv.gz?rlkey=8kr35ztwfztz36qww7v1fr926&st=xcxd6s6d&dl=1',
        compression='gzip',
    )

Unnamed: 0,article_id,year,researcher_id
0,pub.1100184428,2018.0,ur.07500112735.28
1,pub.1100184428,2018.0,ur.0775274140.76
2,pub.1014429482,2008.0,ur.012645135765.17
3,pub.1014429482,2008.0,ur.016413706041.84
4,pub.1071171019,1994.0,ur.01216432663.31
...,...,...,...
1134648,pub.1028766873,2008.0,ur.0747722674.67
1134649,pub.1063382447,2016.0,ur.01114402740.30
1134650,pub.1063382447,2016.0,ur.01137537003.00
1134651,pub.1063382447,2016.0,ur.0577704134.35


# Articles Data
Data on articles is available in the "dimensions_2021_09_01_pull_1/articles_data" folder. The "applied_mathematics_piloting/dimensions_2021_09_01_articles_category_for_2l_code_102.*" files also contain the same articles data, just for an older pull on just the applied mathematics field (Field 102). 

This contains a lot of information on each article, including its date of publication, year, journal, title, and such. We haven't done much with this either, but it also contains the full abstract text if we wanted to do any NLP concept parsing stuff with it.

In [3]:
pd.read_csv(
        'https://www.dropbox.com/scl/fi/0bsgnsgb1rr1s143cgyd1/articles_Zoology_608.csv.gz?rlkey=emlv5fy52dz0d2rnodygjy9pp&st=kehlrig7&dl=1',
        compression = 'gzip',
    ).columns

  pd.read_csv(


Index(['article_id', 'year', 'date', 'doi', 'volume', 'issue', 'pages',
       'title_preferred', 'abstract_preferred', 'journal_title',
       'citations_count', 'metrics_times_cited', 'metrics_recent_citations',
       'metrics_field_citation_ratio', 'metrics_relative_citation_ratio',
       'altmetrics_score', 'nauthors', 'npatents_citing',
       'ncategory_for_l1_codes', 'ncategory_for_l2_codes',
       'pg_abstract_preferred_lang_code', 'pg_abstract_preferred_lang_conf'],
      dtype='object')

# Concept Data
The concept data is mostly what we'll be working with. It's available in the "dimensions_2021_09_01_pull_1/concept_network_data" folder for different fields and the "applied_mathematics_piloting/articles_category_for_2l_abstracts_concepts_processed_v1_EX_102.csv.gz" file for just the applied mathematics field.

Each row represents a concept that occurs in a specific paper. It has the following columns:

| Column | Meaning |
| --- | --- |
| `article_id` | Unique identifier for each article. Same as the other files to merge on. |
| `year` | Article's year of publication |
| `concept` | The concept that shows up in the article's abstract |
| `relevance_mean` | Each occurrence of a concept has a relevance score between 0 and 1 for how associated it is with the text around it. This is the mean of that score across all occurrences. |
| `concept_freq_in_abstract` | The number of times that concept shows up in the paper's abstract. |
| `concept_no` | The number of concepts before it in that paper's abstract |
| `dfreq_in_category_for_2l` | The number of times that concept shows up in the field |
| `dfreq_in_category_for_2l_year` | The number of times that concept shows up in the field in that year |

We primarily use the first 4.

In [4]:
# load the data
df = pd.read_csv(
        'https://www.dropbox.com/scl/fi/a1t16rtialcw03n50ffkc/concepts_Zoology_608.csv.gz?rlkey=vjv60sfbhofbgvzfzdkrlurl1&st=ciu77f72&dl=1',
        compression='gzip',
    )

df

Unnamed: 0,article_id,year,concept,relevance_mean,concept_freq_in_abstract,concept_no,dfreq_in_category_for_2l,dfreq_in_category_for_2l_year
0,pub.1050083629,1984,mfvsg,0.0,1,0,2,1
1,pub.1050083629,1984,svsg,0.0,1,1,1,1
2,pub.1091469564,1992,1972 random parasitoid model,0.0,1,0,1,1
3,pub.1117742847,2001,198697 lobster research cruise,0.0,1,0,1,1
4,pub.1043257859,2001,1986–97 lobster research cruise,0.0,1,0,2,2
...,...,...,...,...,...,...,...,...
18064866,pub.1000238039,2015,’s worker population size,0.0,1,40,1,1
18064867,pub.1005449634,2012,’s workforce swarm fraction depart,0.0,1,42,1,1
18064868,pub.1134463678,2021,’s bad crop pest,0.0,1,57,1,1
18064869,pub.1083920773,2017,’s bad invasive,0.0,1,54,1,1


## Filtering
We filter the data for a couple of things. All thresholds can be moved if another one is better, these are just what we're using now.

First, we keep only relevant enough concepts. We have used a relevance threshold of 0.7.

Second, we keep only new enough papers. We start science at 1920, before which there are very few papers.

Third, we want to get rid of concepts that are too rare (likely typos) and too ubiquitous (too vague for use, two things using a "mathematical model," a real concept, doesn't mean there's not a knowledge gap between them). We use concepts that show up in between 0.01% and 0.1% of papers (this is what Adam is working on, so it will definitely change).

In [5]:
# config
MIN_RELEVANCE = 0.7
MIN_YEAR = 1920
MAX_YEAR = 2021  # when the data is from
MIN_CONCEPT_FREQ = 0.0001
MAX_CONCEPT_FREQ = 0.001

# relevance filtering
df = df[df['relevance_mean'] >= MIN_RELEVANCE]

# year filtering
df = df[df['year'] >= MIN_YEAR]

# counts
num_articles = df['article_id'].nunique()
concept_freq = df.groupby('concept').transform('size') / num_articles
df = df[(concept_freq >= MIN_CONCEPT_FREQ) & (concept_freq <= MAX_CONCEPT_FREQ)]

# remove columns we dont care about
df = df[['article_id', 'concept', 'year']]

df

Unnamed: 0,article_id,concept,year
18622,pub.1039451177,akh receptor,2002
18623,pub.1129686695,akh receptor,2020
18624,pub.1007326462,akh receptor,2006
18625,pub.1101015253,akh receptor,2018
49772,pub.1092160858,acromyrmex leaf cut ant,2017
...,...,...,...
18063221,pub.1010904311,γδ t cell,2012
18063223,pub.1039368032,γδ t cell,2014
18063225,pub.1120146611,γδ t cell,2019
18063226,pub.1040144066,γδ t cell,2012


## Network Nodes
Concepts are nodes in the network. Each concept has a weight, which comes from the normalized year the concept initially is published in. We normalize using
$$
    n_i = \frac{y_i - y_{\min}}{y_{\max} - y_{\max}}.
$$

In [10]:
# get the initial publication
concepts = (
        df
            .sort_values('year')  # sort so earliest year is first
            .drop_duplicates(subset='concept', keep='last')  # drop the second+ occurance
            .reset_index(drop=True)
    )

# normalize the year
concepts['norm_year'] = (concepts['year'] - MIN_YEAR) / (MAX_YEAR - MIN_YEAR)

concepts

Unnamed: 0,article_id,concept,year,norm_year
0,pub.1050740824,museum expedition,1935,0.148515
1,pub.1050740824,portuguese east africa,1935,0.148515
2,pub.1008437065,british museum expedition,1935,0.148515
3,pub.1085008089,collection of bird,1957,0.366337
4,pub.1013931187,fuca strait,1961,0.405941
...,...,...,...,...
4141,pub.1140601689,oil collect bee,2021,1.000000
4142,pub.1134785046,cry protein,2021,1.000000
4143,pub.1134891915,natural habitat,2021,1.000000
4144,pub.1135094093,green space,2021,1.000000


## Network Edges
Get the edges of the network by merging the dataframe in on itself. Then, we again keep only the first occurrence and normalize the year.

In [12]:
# get all possible edges
edges = df.merge(df, on=['article_id', 'year'], suffixes=['_source', '_target'])

# remove duplicates
edges = edges[edges['concept_source'] < edges['concept_target']]  # remove self links (u - u) and the second occurance (u - v vs v - u)
edges = edges.sort_values('year').drop_duplicates(subset=['concept_source', 'concept_target']).reset_index(drop=True)

# nromalize the year
edges['norm_year'] = (edges['year'] - MIN_YEAR) / (MAX_YEAR - MIN_YEAR)

edges

Unnamed: 0,article_id,concept_source,year,concept_target,norm_year
0,pub.1014736377,crop production,1923,field crop,0.029703
1,pub.1029501515,british museum expedition,1923,museum expedition,0.029703
2,pub.1054807062,english channel,1925,oceanic water,0.049505
3,pub.1054807182,english channel,1925,inshore water,0.049505
4,pub.1054807182,deep water,1925,inshore water,0.049505
...,...,...,...,...,...
22851,pub.1135822847,aphid species,2021,field experiment,1.000000
22852,pub.1136270513,common ancestor,2021,filter feed larvae,1.000000
22853,pub.1137711773,bt toxin cry1ac,2021,resistance management strategy,1.000000
22854,pub.1138164980,field population,2021,wild population,1.000000


## Knowledge Network
Make the network using the dataframes.

There are other ways to do this (Networkx can make a network from a pandas edgelist or we could make a bipartite article-concept network and wrap it in on itself), but this is fastest in my experience without losing isolate nodes.

In [13]:
# initialize the graph
G = nx.Graph()

# add the nodes
G.add_nodes_from([(c, {'weight': ny}) for c, ny in zip(concepts['concept'], concepts['norm_year'])])

# add the edges
G.add_edges_from([(u, v, {'weight': ny}) for u, v, ny in zip(edges['concept_source'], edges['concept_target'], edges['norm_year'])])

G

<networkx.classes.graph.Graph at 0x312fe2b90>

## Next Steps
Next, you can plug the network into the homology pipeline and calculate persistence. You could also do any other network analysis on it (centrality, communities, etc.).

In [14]:
import oat_python as oat

# adjacency matrix
adj = nx.adjacency_matrix(G)
adj.setdiag([d['weight'] for _, d in G.nodes(data=True)])
adj = adj.sorted_indices()

# oat calculation
factored = oat.rust.FactoredBoundaryMatrixVr(adj, 1)
homology = factored.homology(False, False)

# initialize the plot
fig, ax = plt.subplots()
fig.set_figheight(4)
fig.set_figwidth(4)
infty = 1.05
ax.set_xlabel('Birth')
ax.set_ylabel('Death')
ax.axis('equal')

# lines
ax.axhline(infty, ls='--', c='k', lw=1)
ax.axline([0, 0], [1, 1], ls='--', c='k', lw=1)

# loop, plot homology
for dim in homology['dimension'].unique():
    dim_bc = homology[homology['dimension'] == dim]
    ax.scatter(dim_bc['birth'], dim_bc['death'].replace(np.inf, infty), s=2.5, label=f'$H_{dim}$')

# final formatting
ticks = ax.get_yticklabels()[1:-1]
ticks.append(Text(0, infty, r'$\infty$'))  # add infty label
ax.set_yticks(np.hstack((ax.get_yticks()[1:-1], infty)))
ax.set_yticklabels(ticks)
ax.legend(frameon=False)
fig.tight_layout()
handles = ax.get_legend_handles_labels()

thread '<unnamed>' panicked at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/oat_rust-0.1.1/src/topology/simplicial/from/graph_weighted.rs:529:17:


Error: Entry (0,0) of the dissimilarity matrix passed to the `ChainComplexVrFiltered` constructor is Some(OrderedFloat(0.1485148514851485)) but the minimum structural nonzero entry in row 0 is Some(OrderedFloat(0.0297029702970297)).  In this case `None` represents a value greater strictly greater than `Some(x)`, for every filtration value `x`.
This message is generated by OAT.


note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


PanicException: 

Error: Entry (0,0) of the dissimilarity matrix passed to the `ChainComplexVrFiltered` constructor is Some(OrderedFloat(0.1485148514851485)) but the minimum structural nonzero entry in row 0 is Some(OrderedFloat(0.0297029702970297)).  In this case `None` represents a value greater strictly greater than `Some(x)`, for every filtration value `x`.
This message is generated by OAT.

