### **Reference Age and Citation**

This notebook computes reference age and citation counts for each paper's references.

For each citing paper, we compute:
- `reference_ids`: list of referenced paper IDs
- `reference_ages`: age of each cited paper at the time of citation (citing_year - cited_year)
- `reference_c1`: citation count of each cited paper within 1 year of its publication
- `reference_c3`: citation count of each cited paper within 3 years of its publication
- `reference_c5`: citation count of each cited paper within 5 years of its publication
- `reference_c10`: citation count of each cited paper within 10 years of its publication
- `reference_cinf`: total lifetime citation count of each cited paper
- `reference_cfocal`: citation count of each cited paper **at the time of citation** (cumulative citations up to citing_year)

In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 50)

In [2]:
# Load citation data and compute reference age
references = pd.read_feather('intermediate/citing_cited_paper_id_year.feather')
references['cited_paper_age'] = references['citing_year'] - references['cited_year']

print(f"Loaded {len(references):,} citation records")
print(f"Unique citing papers: {references['citing_paperid'].nunique():,}")
print(f"Unique cited papers: {references['cited_paperid'].nunique():,}")
references.head()

Loaded 2,082,399,190 citation records
Unique citing papers: 77,926,751
Unique cited papers: 82,122,147


Unnamed: 0,citing_paperid,cited_paperid,citing_year,cited_year,cited_paper_age
0,1047469350,1036753132,2005,1982,23
1,1047469350,1058332925,2005,1983,22
2,1047469350,1064398364,2005,1972,33
3,1047469350,1064401023,2005,1958,47
4,1047469350,1064408021,2005,1983,22


In [3]:
# Compute Cfocal: cumulative citations up to citing_year
print("Computing cumulative citation counts at citing time...")
annual_citations = references.groupby(['cited_paperid', 'citing_year']).size().reset_index(name='count')
annual_citations = annual_citations.sort_values(['cited_paperid', 'citing_year'])
annual_citations['cited_paper_cfocal'] = annual_citations.groupby('cited_paperid')['count'].cumsum()

references = references.merge(
    annual_citations[['cited_paperid', 'citing_year', 'cited_paper_cfocal']],
    on=['cited_paperid', 'citing_year'],
    how='left'
)
references['cited_paper_cfocal'] = references['cited_paper_cfocal'].fillna(0).astype(int)
print(f"Done. {len(references):,} records")

Computing cumulative citation counts at citing time...
Done. 2,082,399,190 records


In [4]:
# Load and merge time-window citation counts (C1, C3, C5, C10, Cinf)
print("Loading and merging time-window citation counts...")
citation_tw = pd.read_feather('data/citation-counts-within-time-window.feather')
citation_tw['cited_paperid'] = citation_tw['paper_id'].str.replace('pub.', '').astype(int)
citation_tw = citation_tw.rename(columns={
    'citation_1y': 'cited_paper_c1', 'citation_3y': 'cited_paper_c3',
    'citation_5y': 'cited_paper_c5', 'citation_10y': 'cited_paper_c10',
    'citation_inf': 'cited_paper_cinf'
})[['cited_paperid', 'cited_paper_c1', 'cited_paper_c3', 'cited_paper_c5', 'cited_paper_c10', 'cited_paper_cinf']]

references = references.merge(citation_tw, on='cited_paperid', how='left')
for col in ['cited_paper_c1', 'cited_paper_c3', 'cited_paper_c5', 'cited_paper_c10', 'cited_paper_cinf']:
    references[col] = references[col].fillna(0).astype(int)
print(f"Done. Columns: {references.columns.tolist()}")

Loading and merging time-window citation counts...
Done. Columns: ['citing_paperid', 'cited_paperid', 'citing_year', 'cited_year', 'cited_paper_age', 'cited_paper_cfocal', 'cited_paper_c1', 'cited_paper_c3', 'cited_paper_c5', 'cited_paper_c10', 'cited_paper_cinf']


In [5]:
# Aggregate by citing paper
print("Aggregating reference info by citing paper...")
result = references.groupby('citing_paperid').agg(
    reference_ids=('cited_paperid', list),
    reference_ages=('cited_paper_age', list),
    reference_c1=('cited_paper_c1', list),
    reference_c3=('cited_paper_c3', list),
    reference_c5=('cited_paper_c5', list),
    reference_c10=('cited_paper_c10', list),
    reference_cinf=('cited_paper_cinf', list),
    reference_cfocal=('cited_paper_cfocal', list)
).reset_index()
result['paper_id'] = 'pub.' + result['citing_paperid'].astype(str)
result = result.drop(columns=['citing_paperid'])

print(f"Aggregated {len(result):,} papers")
result.head()

Aggregating reference info by citing paper...
Aggregated 77,926,751 papers


Unnamed: 0,reference_ids,reference_ages,reference_c1,reference_c3,reference_c5,reference_c10,reference_cinf,reference_cfocal,paper_id
0,"[1002143281, 1002975241, 1005757338, 102267567...","[2, 0, 2, 0, 2, 3, 0, 0, 2, 0, 7, 2, 3, 2]","[9, 4, 5, 8, 2, 4, 8, 5, 0, 9, 11, 6, 9, 5]","[29, 6, 11, 15, 12, 24, 11, 7, 5, 16, 23, 19, ...","[40, 8, 12, 24, 13, 44, 14, 10, 10, 24, 35, 25...","[61, 8, 14, 29, 22, 75, 15, 14, 15, 30, 58, 37...","[84, 8, 20, 32, 28, 104, 22, 18, 22, 36, 67, 3...","[17, 1, 8, 3, 5, 24, 1, 1, 2, 1, 46, 8, 28, 15]",pub.1000000002
1,"[1015758664, 1037457933, 1051031921, 105577665...","[1, 9, 7, 1, 5, 1, 2]","[1, 1, 1, 14, 20, 15, 0]","[13, 1, 2, 66, 54, 59, 2]","[20, 4, 6, 129, 94, 105, 5]","[33, 19, 24, 266, 190, 165, 8]","[48, 407, 50, 395, 1683, 215, 10]","[1, 15, 15, 14, 94, 15, 1]",pub.1000000006
2,"[1000357624, 1001067672, 1001349503, 100181882...","[6, 12, 2, 5, 2, 3, 9, 3, 0, 4, 1, 9, 9, 5, 3,...","[1, 21, 26, 18, 6, 8, 11, 16, 3, 33, 12, 7, 13...","[3, 101, 66, 61, 19, 19, 30, 45, 7, 106, 38, 2...","[5, 237, 128, 88, 34, 27, 54, 103, 15, 192, 60...","[9, 852, 283, 174, 70, 42, 122, 242, 29, 464, ...","[14, 3192, 322, 255, 80, 45, 252, 312, 30, 651...","[6, 1168, 47, 88, 15, 19, 109, 45, 2, 146, 12,...",pub.1000000007
3,"[1000384421, 1000666938, 1003119029, 100579821...","[28, 1, 27, 19, 3, 8, 8, 2, 6, 3, 23, 58, 14, ...","[4, 1, 5, 14, 0, 0, 2, 0, 0, 1, 0, 2, 32, 5, 0...","[8, 1, 22, 40, 3, 0, 10, 1, 3, 2, 0, 10, 103, ...","[8, 1, 32, 66, 6, 0, 22, 1, 4, 3, 1, 12, 212, ...","[10, 3, 71, 132, 8, 4, 58, 1, 5, 3, 3, 30, 352...","[12, 4, 510, 226, 16, 34, 194, 21, 6, 4, 49, 2...","[12, 1, 262, 193, 3, 1, 44, 1, 5, 2, 6, 789, 3...",pub.1000000008
4,"[1001695924, 1004400545, 1006854670, 100876983...","[19, 6, 26, 3, 0, 6, 17, 2, 6, 12, 4, 8, 13, 7...","[0, 2, 1, 1, 2, 1, 3, 2, 0, 6, 1, 3, 6, 2, 0, ...","[10, 7, 3, 7, 3, 2, 8, 4, 0, 29, 6, 5, 24, 11,...","[14, 20, 6, 8, 6, 4, 12, 9, 0, 42, 14, 7, 34, ...","[33, 40, 24, 8, 22, 11, 19, 13, 6, 72, 24, 11,...","[199, 139, 3047, 25, 74, 34, 69, 34, 12, 213, ...","[66, 23, 453, 7, 2, 6, 27, 4, 2, 92, 9, 10, 99...",pub.1000000009


In [6]:
# Save results
result.to_parquet('data/paper_reference_age_citation.parquet')
result.to_feather('data/paper_reference_age_citation.feather')
print(f"Saved {len(result):,} papers with columns: {result.columns.tolist()}")

Saved 77,926,751 papers with columns: ['reference_ids', 'reference_ages', 'reference_c1', 'reference_c3', 'reference_c5', 'reference_c10', 'reference_cinf', 'reference_cfocal', 'paper_id']
