## Typography
We want to understand the role that different homological features have in science. To do that, we first need to figure out the role in the network that each paper plays.

#### Preliminaries

In [1]:
## load some packages
import Gavin.utils.make_network as mn
from matplotlib.lines import Line2D
from matplotlib import animation
from IPython.display import HTML
import matplotlib.pyplot as plt
import networkx as nx
import pandas as pd
import numpy as np
import pickle

# config
OPTIMIZED_FILE = 'results/bounding_chains/Pure Mathematics_results.pickle'
ARTICLE_CONCEPT_FILE = 'https://www.dropbox.com/scl/fi/toojc1t8bny5fhtrqm92v/concepts_Pure-Mathematics_101.csv.gz?rlkey=agj5xeecx1boywggwmy7hr70z&st=luvwa5ye&dl=1'  # these can also be file paths
CITATION_FILE = 'https://www.dropbox.com/scl/fi/j38s35iax0kr9royaw1l3/articles_Pure-Mathematics_101.csv.gz?rlkey=khh91jnh8s92vjfrk7kxkt64s&st=9arba867&dl=1'
FIELD = OPTIMIZED_FILE.split('results/bounding_chains/')[1].split('_')[0]
OUTDIR = f'results/typing/{FIELD}.parquet'
MIN_RELEVANCE = 0.7  # these should be the same as what was used in the optimization network
MIN_YEAR = 1920
MIN_ARTICLE_FREQ = 0.0001
MAX_ARTICLE_FREQ = 0.001

#### Read the File
Open the results file and store the relevant results.

In [2]:
# load the file
with open(OPTIMIZED_FILE, 'rb') as file:
    res = pickle.load(file)

# info from res
G = res['graph']
optimized = res['optimized']
concepts = res['concepts']

# save memory
del res

#### Typography
We'll classify nodes according to the following scheme:
- Cycle: All nodes involved in the initial cycle
- Bounding Chain: All nodes involved in the bounding chain that aren't in the initial cycle.

We'll classify edges according to the following scheme:
- Cycle: All edges involved in the initial cycle
- Birth: All edges involved in the initial cycle that start to exist at the same time the cycle is born. These are also classified as cycle edges. If a cycle has multiple edges that start to exist all at once, it can have more than one birth edge
- Bounding Chain: All edges involved in the bounding chain that aren't in the initial cycle
- Tentpole: All edges involved in the bounding chain that are connected on one end to a cycle node and the other end to a bounding chain node. These are also Bounding Chain edges
- Arch: All edges involved in the bounding chain that are connected on both ends to bounding chain nodes (i.e. are completly disconnected from the initial cycle). These are also bounding chain edges
- Death: All edges involved in the bounding chain that appear at the same time the hole dies. These are also bounding chain edges. If a bounding chain has multiple edges that show up at once, one hole can have more than one death edge

In [3]:
def cycle_typography(idx, G=G, optimized=optimized, concepts=concepts):
    # hole info
    birth = optimized.loc[idx, 'birth']
    death = optimized.loc[idx, 'death']

    # cycle nodes and edges
    cycle = optimized.loc[idx, 'cycle']
    cycle_nodes = concepts[
            cycle['simplex'].explode()
                .drop_duplicates()
                .to_list()
        ]
    cycle_edges = (
            cycle['simplex'].apply(lambda s: concepts[s])  # convert from numbers to nodes
                .apply(tuple)  # hashable type
                .to_numpy()
        )
    birth_edges = np.array([e for e in cycle_edges if G.edges[e]['norm_year'] == birth])
    
    # bounding chain nodes and edges
    if optimized.loc[idx, 'bounding_chain'] is None:
        return cycle_nodes, cycle_edges, birth_edges, np.array([]), np.array([]), np.array([]), np.array([]), np.array([])
    bounding_chain = optimized.loc[idx, 'bounding_chain']
    bounding_chain_nodes = concepts[
            bounding_chain['simplex'].explode()
                .drop_duplicates()
                .to_list()
        ]
    bounding_chain_nodes = bounding_chain_nodes[~np.isin(bounding_chain_nodes, cycle_nodes)]
    bounding_chain_edges = (
            bounding_chain['simplex'].apply(lambda s: concepts[s])  # convert from numbers to nodes
                .apply(lambda s: [s[[i, j]] for i in range(len(s)) for j in range(i+1, len(s))])
                .explode()
                .apply(tuple)  # hashable type
                .drop_duplicates()
                .to_numpy()  # create a 1d numpy array, otherwise the isin breaks later since it becomes a 2d array
        )
    bounding_chain_edges = bounding_chain_edges[~np.isin(bounding_chain_edges, cycle_edges)]
    tentpole_edges = np.array([e for e in bounding_chain_edges if (e[0] in bounding_chain_nodes and e[1] in cycle_nodes) or (e[0] in cycle_nodes and e[1] in bounding_chain_nodes)])
    arch_edges = np.array([e for e in bounding_chain_edges if e[0] in bounding_chain_nodes and e[1] in bounding_chain_nodes])
    death_edges = np.array([e for e in bounding_chain_edges if G.edges[e]['norm_year'] == death])
    
    return cycle_nodes, cycle_edges, birth_edges, bounding_chain_nodes, bounding_chain_edges, tentpole_edges, arch_edges, death_edges

The resulting typography is a tuple of arrays containing the types in the order
1. Cycle Nodes
2. Cycle Edges
3. Birth Edges
4. Bounding Chain Nodes
5. Bounding Chain Edges
6. Tentpole Edges
7. Arch Edges
8. Death Edges

In [4]:
cycle_typography(5000)

(array(['finite time interval', 'finite time stability',
        'integral inequality', 'non trivial solution',
        'hadamard fractional differential equation'], dtype='<U62'),
 array([('finite time interval', 'finite time stability'),
        ('integral inequality', 'non trivial solution'),
        ('finite time interval', 'non trivial solution'),
        ('hadamard fractional differential equation', 'integral inequality'),
        ('hadamard fractional differential equation', 'finite time stability')],
       dtype=object),
 array([['hadamard fractional differential equation',
         'finite time stability']], dtype='<U41'),
 array(['stability property', 'certain differential equation',
        'homogeneous differential equation', 'order equation',
        'sturm – liouville equation', 'comparison theorem',
        'oscillatory solution', 'riemann boundary value problem',
        'complex differential equation', 'general boundary value problem',
        'ise model', 'partial di

If the cycle never dies, then the arrays for bounding chain attributes are empty.

In [5]:
cycle_typography(2000)

(array(['linear topological space', 'power series', 'small parameter',
        'real line'], dtype='<U62'),
 array([('linear topological space', 'power series'),
        ('power series', 'small parameter'),
        ('linear topological space', 'real line'),
        ('real line', 'small parameter')], dtype=object),
 array([['real line', 'small parameter']], dtype='<U15'),
 array([], dtype=float64),
 array([], dtype=float64),
 array([], dtype=float64),
 array([], dtype=float64),
 array([], dtype=float64))

#### Visualization
Using our classifications, we can see how individual holes are created and fill.

In [6]:
idx = 23473

In [7]:
## animation
# get relevant network bits
cycle_nodes, cycle_edges, birth_edges, bounding_chain_nodes, bounding_chain_edges, tentpole_edges, arching_edges, death_edges = cycle_typography(idx)

# cycle info
G_cycle = nx.Graph()
G_cycle.add_nodes_from([(n, dict(G.nodes[n], type='cycle')) for n in cycle_nodes])
G_cycle.add_edges_from([(u, v, dict(G.edges[u, v], type='cycle')) for u, v in cycle_edges])
for e in birth_edges:
    G_cycle.edges[e]['type'] = 'birth'  # these were labeled as cycle
pos = nx.nx_agraph.graphviz_layout(G_cycle, prog='circo')  # positon of nodes in the cycle (circle outside the bounding chain)

# bounding chain info
if optimized.loc[idx, 'death'] < np.inf:
    G_cycle.add_nodes_from([(n, dict(G.nodes[n], type='bounding_chain')) for n in bounding_chain_nodes])
    G_cycle.add_edges_from([(u, v, dict(G.edges[u, v], type='bounding_chain')) for u, v in bounding_chain_edges])
    for e in tentpole_edges:
        G_cycle.edges[e]['type'] = 'tentpole'  # these were labled as cycle
    for e in arching_edges:
        G_cycle.edges[e]['type'] = 'arch'  # these were labled as cycle
    for e in death_edges:
        G_cycle.edges[e]['type'] = 'death'  # these were labled as cycle
    pos = nx.spring_layout(
            G_cycle,
            k=2/(3 * np.sqrt(len(G_cycle.nodes))),  # adjust to keep nodes within cycle
            pos=pos,
            fixed=cycle_nodes
        )

# general info
start = min([G_cycle.nodes[n]['year'] for n in G_cycle.nodes])  # time first node appears
end = max([G_cycle.edges[e]['year'] for e in G_cycle.edges])  # cycle birth if theres not a bounding chain, otherwise death
node_color_dict = {'cycle': 'tab:orange', 'bounding_chain': 'tab:blue'}
edge_color_dict = {
        'cycle': 'black', 'birth': 'tab:green',
        'bounding_chain': 'dimgray', 'tentpole': 'gray', 'arch': 'darkgray', 'death': 'red'
    }
format_label = lambda n: n.title()

# setup plot 
fig, ax = plt.subplots()
fig.set_figwidth(8)
fig.set_figheight(6)
fig.suptitle('<Year>')

# legend
legend_elements = [
        Line2D([0], [0], color='black', lw=2, label='Cycle Edge'),
        Line2D([0], [0], color='dimgray', lw=2, label='Bounding Chain Edge'),
        Line2D([0], [0], color='gray', lw=2, label='Tentpole Edge'),
        Line2D([0], [0], color='darkgray', lw=2, label='Arch Edge'),
        Line2D([0], [0], color='tab:green', lw=2, label='Birth Edge'),
        Line2D([0], [0], color='tab:red', lw=2, label='Death Edge'),
        Line2D([0], [0], marker='o', color='w', label='Cycle Node', markerfacecolor='tab:orange', markersize=10),
        Line2D([0], [0], marker='o', color='w', label='Bounding Chain Node', markerfacecolor='tab:blue', markersize=10),
    ]
fig.legend(handles=legend_elements, loc='lower center', ncols=4, frameon=False)

# network
ax.set_axis_off()
ax.set_aspect('equal')
pos_arr = np.array(list(pos.values()))
xmargin = (pos_arr[:, 0].max() - pos_arr[:, 0].min()) / 3
xlim = ax.set_xlim((pos_arr[:, 0].min()-xmargin, pos_arr[:, 0].max()+xmargin))
ymargin = (pos_arr[:, 1].max() - pos_arr[:, 1].min()) / 10
ylim = ax.set_ylim((pos_arr[:, 1].min()-ymargin, pos_arr[:, 1].max()+ymargin))
fig.subplots_adjust(left=0, right=1, top=0.975, bottom=0.075)

# make each frame
def update(year):
    ax.clear()

    # graph at year
    G = nx.Graph()
    G.add_nodes_from([(n, d) for n, d in G_cycle.nodes(data=True) if d['year'] <= year])
    G.add_edges_from([(u, v, d) for u, v, d in G_cycle.edges(data=True) if d['year'] <= year])
    tris = [t for t in nx.find_cliques(G) if len(t) == 3]  # since we make the network from the bounding chain, the cliques are all size 3 or less
    node_colors = [node_color_dict[d['type']] for _, d in G.nodes(data=True)]
    edge_colors = [edge_color_dict[d['type']] for _, _, d in G.edges(data=True)]
    labels = {n: format_label(n) for n in G.nodes}

    # plot it
    for t in tris:
        coords = [pos[n] for n in t]
        ax.fill(
                [x for x, _ in coords],
                [y for _, y in coords],
                c='lightgray',
                alpha=0.5
            )
    nx.draw(
            G,
            pos,
            with_labels=True,
            labels=labels,
            font_size=10,
            node_size=300,
            node_color=node_colors,
            width=2.5,
            edge_color=edge_colors,
            ax=ax,
        )

    # formatting
    fig.suptitle(year)
    ax.set_xlim((pos_arr[:, 0].min()-xmargin, pos_arr[:, 0].max()+xmargin))
    ax.set_ylim((pos_arr[:, 1].min()-ymargin, pos_arr[:, 1].max()+ymargin))
    # fig.tight_layout(pad=2)

# animate it
plt.close()  # dont show empty figure
anim = animation.FuncAnimation(fig, update, frames=range(start-1, end+1), interval=150, cache_frame_data=False, repeat=False)
HTML(anim.to_html5_video())


#### Classification of All Cycles
We can use a loop to classify all cycles. To keep track of the classifications, we'll create an attribute for each node an edge for each classification with the count of how many times it fits into each role.

In [8]:
# initialize to 0
nx.set_node_attributes(G, {n: {'cycle': 0, 'bounding_chain': 0} for n in G.nodes})
nx.set_edge_attributes(G, {e: {'cycle': 0, 'birth': 0, 'bounding_chain': 0, 'tentpole': 0, 'arch': 0, 'death': 0} for e in G.edges})

# get types from each cycle
for i in optimized.index:
    # get nodes/edges of each type in the cycle
    cycle_nodes, cycle_edges, birth_edges, bounding_chain_nodes, bounding_chain_edges, tentpole_edges, arch_edges, death_edges = cycle_typography(i)

    # ph basis checks (comment out to make it faster)
    assert birth_edges.size > 0
    assert optimized.loc[i, 'death'] == np.inf or death_edges.size > 0

    # keep track of types
    for n in cycle_nodes:
        G.nodes[n]['cycle'] += 1
    for e in cycle_edges:
        G.edges[e]['cycle'] += 1
    for e in birth_edges:
        G.edges[e]['birth'] += 1
    for n in bounding_chain_nodes:
        G.nodes[n]['bounding_chain'] += 1
    for e in bounding_chain_edges:
        G.edges[e]['bounding_chain'] += 1
    for e in tentpole_edges:
        G.edges[e]['tentpole'] += 1
    for e in arch_edges:
        G.edges[e]['arch'] += 1
    for e in death_edges:
        G.edges[e]['death'] += 1

# store results to dataframe to be used
node_type_df = pd.DataFrame(dict(G.nodes(data=True))).T.reset_index().rename(columns={'index': 'concept'})
node_type_df = node_type_df.rename(columns={'cycle': 'cycle_count', 'bounding_chain': 'bounding_chain_count'})
node_type_df['in_cycle'] = node_type_df['cycle_count'] > 0
node_type_df['in_bounding_chain'] = node_type_df['bounding_chain_count'] > 0
edge_type_df = nx.to_pandas_edgelist(G, source='concept_s', target='concept_t')
edge_type_df = edge_type_df.rename(columns={
        'cycle': 'cycle_count', 'birth': 'birth_count', 'bounding_chain': 'bounding_chain_count',
        'tentpole': 'tentpole_count', 'arch': 'arch_count', 'death': 'death_count'
    })
edge_type_df['in_cycle'] = edge_type_df['cycle_count'] > 0
edge_type_df['in_birth'] = edge_type_df['birth_count'] > 0
edge_type_df['in_bounding_chain'] = edge_type_df['bounding_chain_count'] > 0
edge_type_df['in_tentpole'] = edge_type_df['tentpole_count'] > 0
edge_type_df['in_arch'] = edge_type_df['arch_count'] > 0
edge_type_df['in_death'] = edge_type_df['death_count'] > 0
edge_type_df.loc[edge_type_df['concept_s'] > edge_type_df['concept_t'], ['concept_s', 'concept_t']] = edge_type_df.loc[edge_type_df['concept_s'] > edge_type_df['concept_t'], ['concept_t', 'concept_s']].values

In [9]:
node_type_df

Unnamed: 0,concept,year,norm_year,article_id,cycle_count,bounding_chain_count,in_cycle,in_bounding_chain
0,ads / cft correspondence,1998,0.772277,pub.1001769852,15,229,True,True
1,af algebra,1980,0.594059,pub.1021979973,9,0,True,False
2,akns hierarchy,1994,0.732673,pub.1022407584,2,8,True,True
3,abel differential equation,1988,0.673267,pub.1053385472,11,0,True,False
4,abel equation,1998,0.772277,pub.1046042236,12,16,True,True
...,...,...,...,...,...,...,...,...
4873,wreath product,1975,0.544554,pub.1049232761,7,0,True,False
4874,zero curvature equation,1987,0.663366,pub.1054031892,12,31,True,True
4875,zero curvature representation,1983,0.623762,pub.1007433019,25,114,True,True
4876,zero of solution,1956,0.356436,pub.1072264081,23,4,True,True


In [10]:
edge_type_df

Unnamed: 0,concept_s,concept_t,death_count,arch_count,norm_year,article_id,cycle_count,bounding_chain_count,year,birth_count,tentpole_count,in_cycle,in_birth,in_bounding_chain,in_tentpole,in_arch,in_death
0,ads / cft correspondence,root of unity,0,25,0.782178,pub.1005463738,3,26,1999,1,1,True,True,True,True,True,False
1,ads / cft correspondence,lattice model,1,5,0.920792,pub.1010393758,0,6,2013,0,1,False,False,True,True,True,True
2,ads / cft correspondence,solvable lattice model,0,0,0.900990,pub.1046517966,0,0,2011,0,0,False,False,False,False,False,False
3,ads / cft correspondence,ads5 × s5,0,0,0.782178,pub.1033484946,1,1,1999,0,1,True,False,True,True,False,False
4,ads / cft correspondence,cft correspondence,0,0,0.772277,pub.1018967595,0,1,1998,0,1,False,False,True,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110201,weight module,weight space,0,0,0.831683,pub.1001920505,0,0,2004,0,0,False,False,False,False,False,False
110202,weight sobolev space,well posedness result,2,5,0.930693,pub.1004122415,0,6,2014,0,1,False,False,True,True,True,True
110203,well posedness result,well posedness theory,1,0,1.000000,pub.1137706171,0,1,2021,0,1,False,False,True,True,False,True
110204,white noise,white noise analysis,0,0,0.851485,pub.1062986967,0,0,2006,0,0,False,False,False,False,False,False


#### Aggregate
We aggregate these by article and combine them with the citation data to create one dataframe with all the independent and dependent variables we want.

At a node level, this means creating a dataframe of all article-concept occurrences. We merge on the concept, not Article ID, so that we can also analyze the second (and later) occurrences of important concepts.

In [11]:
## node level article aggregation
article_concept_file = mn.filter_article_concept_file(
        ARTICLE_CONCEPT_FILE,
        min_relevance=MIN_RELEVANCE,
        min_articles=MIN_ARTICLE_FREQ,
        max_articles=MAX_ARTICLE_FREQ,
        min_year=MIN_YEAR
    )

# add in occurance count
article_concept_file['in_network'] = True
article_concept_file['past_concept_count'] = article_concept_file.groupby('concept')['year'].rank('min')  # using min means there can be multiple "first" articles
article_concept_file['first_concept_occ'] = article_concept_file['past_concept_count'] == 1
article_concept_file['second_concept_occ'] = article_concept_file['past_concept_count'] == 2

# merge concept role info
article_concept_file = article_concept_file.merge(
        node_type_df[['concept', 'in_cycle', 'cycle_count', 'in_bounding_chain', 'bounding_chain_count']],
        on='concept'
    )
article_concept_file['first_in_cycle'] = article_concept_file['first_concept_occ'] & article_concept_file['in_cycle']
article_concept_file['first_cycle_count'] = article_concept_file['first_concept_occ'] * article_concept_file['cycle_count']
article_concept_file['first_in_bounding_chain'] = article_concept_file['first_concept_occ'] & article_concept_file['in_bounding_chain']
article_concept_file['first_bounding_chain_count'] = article_concept_file['first_concept_occ'] * article_concept_file['bounding_chain_count']
article_concept_file['second_in_cycle'] = article_concept_file['second_concept_occ'] & article_concept_file['in_cycle']
article_concept_file['second_cycle_count'] = article_concept_file['second_concept_occ'] * article_concept_file['cycle_count']
article_concept_file['second_in_bounding_chain'] = article_concept_file['second_concept_occ'] & article_concept_file['in_bounding_chain']
article_concept_file['second_bounding_chain_count'] = article_concept_file['second_concept_occ'] * article_concept_file['bounding_chain_count']

# aggregate at an article level
article_concept_file = article_concept_file.groupby('article_id').agg(
        first_in_network=('first_concept_occ', 'any'),
        first_network_count=('first_concept_occ', 'sum'),
        second_in_network=('second_concept_occ', 'any'),
        second_network_count=('second_concept_occ', 'sum'),
        in_network=('in_network', 'any'),
        network_count=('in_network', 'sum'),
        first_in_cycle=('first_in_cycle', 'any'),
        first_cycle_count=('first_cycle_count', 'sum'),
        first_in_bounding_chain=('first_in_bounding_chain', 'any'),
        first_bounding_chain_count=('first_bounding_chain_count', 'sum'),
        second_in_cycle=('second_in_cycle', 'any'),
        second_cycle_count=('second_cycle_count', 'sum'),
        second_in_bounding_chain=('second_in_bounding_chain', 'any'),
        second_bounding_chain_count=('second_bounding_chain_count', 'sum'),
        in_cycle=('in_cycle', 'any'),
        cycle_count=('cycle_count', 'sum'),
        in_bounding_chain=('in_bounding_chain', 'any'),
        bounding_chain_count=('bounding_chain_count', 'sum'),
    ).reset_index()

At an edge level, we do the same thing, just merge on concept-pairs instead of raw concepts.

In [12]:
# edge level testing
# load the file (same params used to make the network)
article_edge_file = mn.filter_article_concept_file(
        ARTICLE_CONCEPT_FILE,
        min_relevance=MIN_RELEVANCE,
        min_articles=MIN_ARTICLE_FREQ,
        max_articles=MAX_ARTICLE_FREQ,
        min_year=MIN_YEAR
    )

# make edge dataframe
article_edge_file = article_edge_file[['article_id', 'year', 'concept']].merge(
        article_edge_file[['article_id', 'concept']],
        on='article_id',
        how='outer',
        suffixes=['_s', '_t']
    )
article_edge_file = article_edge_file[article_edge_file['concept_s'] < article_edge_file['concept_t']]

# add in occurance count
article_edge_file['in_network'] = True
article_edge_file['rank'] = article_edge_file.groupby(['concept_s', 'concept_t'])['year'].rank('min')
article_edge_file['first_concept_occ'] = article_edge_file['rank'] == 1
article_edge_file['second_concept_occ'] = article_edge_file['rank'] == 2

# merge concept role info
article_edge_file = article_edge_file.merge(
        edge_type_df[['concept_s', 'concept_t', 'in_cycle', 'cycle_count', 'in_birth', 'birth_count', 'in_bounding_chain', 'bounding_chain_count', 'in_tentpole', 'tentpole_count', 'in_arch', 'arch_count', 'in_death', 'death_count']],
        on=['concept_s', 'concept_t'],
        how='outer'
    )
article_edge_file['first_in_cycle'] = article_edge_file['first_concept_occ'] & article_edge_file['in_cycle']
article_edge_file['first_cycle_count'] = article_edge_file['first_concept_occ'] * article_edge_file['cycle_count']
article_edge_file['first_in_birth'] = article_edge_file['first_concept_occ'] & article_edge_file['in_birth']
article_edge_file['first_birth_count'] = article_edge_file['first_concept_occ'] * article_edge_file['birth_count']
article_edge_file['first_in_bounding_chain'] = article_edge_file['first_concept_occ'] & article_edge_file['in_bounding_chain']
article_edge_file['first_bounding_chain_count'] = article_edge_file['first_concept_occ'] * article_edge_file['bounding_chain_count']
article_edge_file['first_in_death'] = article_edge_file['first_concept_occ'] & article_edge_file['in_death']
article_edge_file['first_death_count'] = article_edge_file['first_concept_occ'] * article_edge_file['death_count']
article_edge_file['first_in_tentpole'] = article_edge_file['first_concept_occ'] & article_edge_file['in_tentpole']
article_edge_file['first_tentpole_count'] = article_edge_file['first_concept_occ'] * article_edge_file['tentpole_count']
article_edge_file['first_in_arch'] = article_edge_file['first_concept_occ'] & article_edge_file['in_arch']
article_edge_file['first_arch_count'] = article_edge_file['first_concept_occ'] * article_edge_file['arch_count']
article_edge_file['second_in_cycle'] = article_edge_file['second_concept_occ'] & article_edge_file['in_cycle']
article_edge_file['second_cycle_count'] = article_edge_file['second_concept_occ'] * article_edge_file['cycle_count']
article_edge_file['second_in_birth'] = article_edge_file['second_concept_occ'] & article_edge_file['in_birth']
article_edge_file['second_birth_count'] = article_edge_file['second_concept_occ'] * article_edge_file['birth_count']
article_edge_file['second_in_bounding_chain'] = article_edge_file['second_concept_occ'] & article_edge_file['in_bounding_chain']
article_edge_file['second_bounding_chain_count'] = article_edge_file['second_concept_occ'] * article_edge_file['bounding_chain_count']
article_edge_file['second_in_tentpole'] = article_edge_file['second_concept_occ'] & article_edge_file['in_tentpole']
article_edge_file['second_tentpole_count'] = article_edge_file['second_concept_occ'] * article_edge_file['tentpole_count']
article_edge_file['second_in_arch'] = article_edge_file['second_concept_occ'] & article_edge_file['in_arch']
article_edge_file['second_arch_count'] = article_edge_file['second_concept_occ'] * article_edge_file['arch_count']
article_edge_file['second_in_death'] = article_edge_file['second_concept_occ'] & article_edge_file['in_death']
article_edge_file['second_death_count'] = article_edge_file['second_concept_occ'] * article_edge_file['death_count']

# # make it just for articles
article_edge_file = article_edge_file.groupby('article_id').agg(
        first_in_network=('first_concept_occ', 'any'),
        first_network_count=('first_concept_occ', 'sum'),
        second_in_network=('second_concept_occ', 'any'),
        second_network_count=('second_concept_occ', 'sum'),
        in_network=('in_network', 'any'),
        network_count=('in_network', 'sum'),
        first_in_cycle=('first_in_cycle', 'any'),
        first_cycle_count=('first_cycle_count', 'sum'),
        first_in_birth=('first_in_birth', 'any'),
        first_birth_count=('first_birth_count', 'sum'),
        first_in_bounding_chain=('first_in_bounding_chain', 'any'),
        first_bounding_chain_count=('first_bounding_chain_count', 'sum'),
        first_in_tentpole=('first_in_tentpole', 'any'),
        first_tentpole_count=('first_tentpole_count', 'sum'),
        first_in_arch=('first_in_arch', 'any'),
        first_arch_count=('first_arch_count', 'sum'),
        first_in_death=('first_in_death', 'any'),
        first_death_count=('first_death_count', 'sum'),
        second_in_cycle=('second_in_cycle', 'any'),
        second_cycle_count=('second_cycle_count', 'sum'),
        second_in_birth=('second_in_birth', 'any'),
        second_birth_count=('second_birth_count', 'sum'),
        second_in_bounding_chain=('second_in_bounding_chain', 'any'),
        second_bounding_chain_count=('second_bounding_chain_count', 'sum'),
        second_in_tentpole=('second_in_tentpole', 'any'),
        second_tentpole_count=('second_tentpole_count', 'sum'),
        second_in_arch=('second_in_arch', 'any'),
        second_arch_count=('second_arch_count', 'sum'),
        second_in_death=('second_in_death', 'any'),
        second_death_count=('second_death_count', 'sum'),
        in_cycle=('in_cycle', 'any'),
        cycle_count=('cycle_count', 'sum'),
        in_birth=('in_birth', 'any'),
        birth_count=('birth_count', 'sum'),
        in_bounding_chain=('in_bounding_chain', 'any'),
        bounding_chain_count=('bounding_chain_count', 'sum'),
        in_tentpole=('in_tentpole', 'any'),
        tentpole_count=('tentpole_count', 'sum'),
        in_arch=('in_arch', 'any'),
        arch_count=('arch_count', 'sum'),
        in_death=('in_death', 'any'),
        death_count=('death_count', 'sum'),
    ).reset_index()

To crate the overall citation dataframe, we merge this onto citation (and other dependent variables?) info.

In [13]:
## citation data
citation_df = pd.read_csv(
        CITATION_FILE,
        compression='gzip'  # need to specify compression if you're reading from a ropbox link
    )

# add network role data
citation_df = citation_df.merge(
        article_concept_file,
        on='article_id',
        how='left'
    )
citation_df = citation_df.merge(
        article_edge_file,
        on='article_id',
        how='left',
        suffixes=['_node', '_edge']
    )

# fill nas
bool_cols = [
        'first_in_network_node', 'second_in_network_node', 'in_network_node',
        'first_in_cycle_node', 'second_in_cycle_node', 'in_cycle_node',
        'first_in_bounding_chain_node', 'second_in_bounding_chain_node', 'in_bounding_chain_node',
        'first_in_network_edge', 'second_in_network_edge', 'in_network_edge',
        'first_in_cycle_edge', 'second_in_cycle_edge', 'in_cycle_edge',
        'first_in_birth', 'second_in_birth', 'in_birth',
        'first_in_bounding_chain_edge', 'second_in_bounding_chain_edge', 'in_bounding_chain_edge',
        'first_in_tentpole', 'second_in_tentpole', 'in_tentpole',
        'first_in_arch', 'second_in_arch', 'in_arch',
        'first_in_death', 'second_in_death', 'in_death'
    ]
citation_df[bool_cols] = citation_df[bool_cols].astype('boolean').fillna(False).astype(bool)  # these weren't anywther in the network, so its false
# boolean -> pandas nullable boolean
num_cols = [
        'first_network_count_node', 'second_network_count_node', 'network_count_node',
        'first_cycle_count_node', 'second_cycle_count_node', 'cycle_count_node',
        'first_bounding_chain_count_node', 'second_bounding_chain_count_node', 'bounding_chain_count_node',
        'first_network_count_edge', 'second_network_count_edge', 'network_count_edge',
        'first_cycle_count_edge', 'second_cycle_count_edge', 'cycle_count_edge',
        'first_birth_count', 'second_birth_count', 'birth_count',
        'first_bounding_chain_count_edge', 'second_bounding_chain_count_edge', 'bounding_chain_count_edge',
        'first_tentpole_count', 'second_tentpole_count', 'tentpole_count',
        'first_arch_count', 'second_arch_count', 'arch_count',
        'first_death_count', 'second_death_count', 'death_count'
    ]
citation_df[num_cols] = citation_df[num_cols].astype(float).fillna(0)  # these weren't anywther in the network, so its 0

citation_df

Unnamed: 0,article_id,year,date,doi,volume,issue,pages,title_preferred,journal_title,citations_count,...,in_birth,birth_count,in_bounding_chain_edge,bounding_chain_count_edge,in_tentpole,tentpole_count,in_arch,arch_count,in_death,death_count
0,pub.1009401373,1854,1854-07-01,10.1515/crll.1854.48.137,1854,48,137-142,Ueber Producte und Potenzen bestimmter einfach...,Journal für die reine und angewandte Mathemati...,1.0,...,False,0.0,False,0.0,False,0.0,False,0.0,False,0.0
1,pub.1070920919,1985,1985-01-01,10.2748/tmj/1178228721,37,1,33-42,SEMISIMPLE DEGREE OF SYMMETRY AND MAPS OF NON-...,Tohoku Mathematical Journal,,...,False,0.0,False,0.0,False,0.0,False,0.0,False,0.0
2,pub.1043577808,1986,1986-12,10.1007/bf02621935,61,1,617-635,Forme de Blanchfield et cobordisme d’entrelacs...,Commentarii Mathematici Helvetici,15.0,...,False,0.0,False,0.0,False,0.0,False,0.0,False,0.0
3,pub.1042161599,1985,1985-05,10.1112/blms/17.3.295,17,3,295-298,TOTAL MEAN CURVATURE AND SUBMANIFOLDS OF FINIT...,Bulletin of the London Mathematical Society,5.0,...,False,0.0,False,0.0,False,0.0,False,0.0,False,0.0
4,pub.1015932083,2013,2013-12,10.1016/j.topol.2013.07.019,160,18,2233,Preface,Topology and its Applications,,...,False,0.0,False,0.0,False,0.0,False,0.0,False,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1422853,pub.1030009411,1999,1999-01,10.1016/s0893-9659(98)00128-1,12,1,63-70,A note on second-order wave forces on a circul...,Applied Mathematics Letters,4.0,...,False,0.0,False,0.0,False,0.0,False,0.0,False,0.0
1422854,pub.1045628958,2002,2002-11,10.1016/s1063-5203(02)00507-9,13,3,177-200,Wavelets on the sphere: implementation and app...,Applied and Computational Harmonic Analysis,75.0,...,False,0.0,False,0.0,False,0.0,False,0.0,False,0.0
1422855,pub.1092523788,2017,2017-12-13,10.1088/1361-6420/aa9830,34,1,015004,Wavefield reconstruction inversion with a mult...,Inverse Problems,13.0,...,False,0.0,False,0.0,False,0.0,False,0.0,False,0.0
1422856,pub.1135345254,2021,2021-02-01,10.1088/1757-899x/1047/1/012137,1047,1,012137,Hardware implementation of the coding algorith...,IOP Conference Series Materials Science and En...,,...,False,0.0,False,0.0,False,0.0,False,0.0,False,0.0


#### Save the Result
We save the results to a parquet to use to make models with.

In [14]:
citation_df.to_parquet(OUTDIR)