# Dimension-derived Variable Construction (English-only)

This notebook derives analysis-ready variables from the unified Dimensions dataset and writes a single Parquet for downstream analyses. The workflow adheres to academic reporting best practices: each variable is introduced by a short rationale (markdown) followed by a reproducible code cell. No intermediate files are written; only the final dataset is saved.

- Source dataset: `/Users/yann.jy/InvisibleResearch/data/processed/dimension_merged.parquet`
- External reference (for disciplinary core journal matching): `/Users/yann.jy/InvisibleResearch/data/processed/scimagojr_communication_journal_1999_2024.csv`
- Output dataset: `/Users/yann.jy/InvisibleResearch/data/processed/dimension_data_for_analysis.parquet`

Conventions:
- All module outputs include `id` for reliable merges.
- Column blocks are ordered by conceptual variable groups so related columns appear together in the final file.
- Time-window logic for invisibility is intentionally not applied here; analyses may add temporal filters later.


In [1]:
# Setup and paths
from __future__ import annotations

from pathlib import Path
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Paths (absolute as requested)
PROJECT_ROOT = Path('/Users/yann.jy/InvisibleResearch')
PARQUET_INPUT = PROJECT_ROOT / 'data/processed/dimension_merged.parquet'
SJR_CSV = PROJECT_ROOT / 'data/processed/scimagojr_communication_journal_1999_2024.csv'
OUTPUT_PARQUET = PROJECT_ROOT / 'data/processed/dimension_data_for_analysis.parquet'

pd.set_option('display.max_colwidth', 160)
pd.set_option('display.width', 160)

print('Input exists:', PARQUET_INPUT.exists())
print('SJR CSV exists:', SJR_CSV.exists())
print('Output path:', OUTPUT_PARQUET)


Input exists: True
SJR CSV exists: True
Output path: /Users/yann.jy/InvisibleResearch/data/processed/dimension_data_for_analysis.parquet


## Schema preview

We first inspect the Parquet schema and a small sample to confirm column availability and obtain a quick sense of the data. No full-table prints are performed to avoid large outputs.


In [2]:
# Read schema (column names) and show a small preview
pf = pq.ParquetFile(PARQUET_INPUT)
print('Number of columns:', len(pf.schema.names))
print('Columns:')
print(pf.schema.names)

# Lightweight row count and a tiny data preview
num_rows = pf.metadata.num_rows
print('Total rows (metadata):', num_rows)

# Sample a few rows without loading entire file (avoid unsupported filters arg)
base_head = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=pf.schema.names[:12]).head(5)
print('\nHead (subset of columns):')
print(base_head)


Number of columns: 76
Columns:
['abstract', 'acknowledgements', 'altmetric', 'altmetric_id', 'arxiv_id', 'authors', 'authors_count', 'book_doi', 'book_series_title', 'book_title', 'category_bra', 'category_for', 'category_for_2020', 'category_hra', 'category_hrcs_hc', 'category_hrcs_rac', 'category_icrp_cso', 'category_icrp_ct', 'category_rcdc', 'category_sdg', 'category_uoa', 'clinical_trial_ids', 'concepts', 'concepts_scores', 'date', 'date_inserted', 'date_online', 'date_print', 'dimensions_url', 'document_type', 'doi', 'editors', 'field_citation_ratio', 'funder_countries', 'funders', 'funding_section', 'id', 'isbn', 'issn', 'issue', 'journal.id', 'journal.title', 'journal_lists', 'journal_title_raw', 'linkout', 'mesh_terms', 'open_access', 'pages', 'pmcid', 'pmid', 'proceedings_title', 'publisher', 'recent_citations', 'reference_ids', 'referenced_pubs', 'relative_citation_ratio', 'research_org_cities', 'research_org_countries', 'research_org_country_names', 'research_org_names', 'r

## Variable: Invisibility (times_cited only)

Definition: A record is labeled as invisible if it has received zero citations, i.e., `invisibility = 1` when `times_cited == 0`; otherwise `0`. We retain `date` for analytical context but do not use it in the rule here. All outputs include `id` for merging.

Columns used: `id`, `times_cited`, `date`.

Output columns: `id`, `invisibility`, `times_cited`, `date`. 


In [3]:
# Build invisibility (robust to non-numeric times_cited)
base = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=['id','times_cited','date'])
inv = base[['id','times_cited','date']].copy()
# Convert to numeric, coercing invalid tokens (e.g., stray dict fragments) to NaN
_tc = pd.to_numeric(inv['times_cited'], errors='coerce')
# Initialize as NA; then fill where numeric is available
inv['invisibility'] = pd.Series(pd.NA, index=inv.index, dtype='Int8')
mask_num = _tc.notna()
inv.loc[mask_num, 'invisibility'] = (_tc.loc[mask_num] == 0).astype('int8')
inv['invisibility'] = inv['invisibility'].astype('Int8')
# Order output columns as specified
inv = inv[['id','invisibility','times_cited','date']]
print(inv.head())
print(inv['invisibility'].value_counts(dropna=False).rename('invisibility_counts'))


               id  invisibility times_cited        date
0  pub.1186290333             1           0  2000-01-01
1  pub.1186290332             1           0  2000-01-01
2  pub.1186290331             1           0  2000-01-01
3  pub.1186290329             1           0  2000-01-01
4  pub.1186290328             1           0  2000-01-01
invisibility
0       186551
1       171940
<NA>         2
Name: invisibility_counts, dtype: Int64


## Variable: Geographic (institutional location)

We retain institutional location fields to support geographic analyses. No transformation is applied here; the values may contain multiple institutions and countries per record.

Columns used: `id`, `research_org_country_names`, `research_org_names`, `research_org_types`.

Output columns: `id`, `research_org_country_names`, `research_org_names`, `research_org_types`. 


In [4]:
# Keep geographic fields
geo_cols = ['id','research_org_country_names','research_org_names','research_org_types']
existing_geo = [c for c in geo_cols if c in pf.schema.names]
geo = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_geo)
print(geo.head())


               id research_org_country_names research_org_names research_org_types
0  pub.1186290333                       None               None               None
1  pub.1186290332                       None               None               None
2  pub.1186290331                       None               None               None
3  pub.1186290329                       None               None               None
4  pub.1186290328                       None               None               None


## Variable: Topical (concepts)

We retain topic-related fields to enable later construction of mainstream-topic shares or binary mainstream indicators. No transformation is performed here.

Columns used: `id`, `concepts`, `concepts_scores`.

Output columns: `id`, `concepts`, `concepts_scores`. 


In [5]:
# Keep topical fields
top_cols = ['id','concepts','concepts_scores']
existing_top = [c for c in top_cols if c in pf.schema.names]
top = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_top)
print(top.head())


               id  \
0  pub.1186290333   
1  pub.1186290332   
2  pub.1186290331   
3  pub.1186290329   
4  pub.1186290328   

                                                                                                                                                          concepts  \
0  ['mass media', 'news habits', 'mass communication', 'public figures', 'media exposure', 'scholarly studies', 'private life', 'tabloids', 'news', 'U.S. probl...   
1  ['mass media', 'news habits', 'mass communication', 'public figures', 'media exposure', 'tabloids', 'scholarly studies', 'private life', 'U.S. problems', 'n...   
2  ['mass media', 'news habits', 'mass communication', 'public figures', 'media exposure', 'scholarly studies', 'private life', 'U.S. problems', 'tabloids', 'n...   
3  ['mass media', 'tabloid journalism', 'news habits', 'mass communication', 'public figures', 'media exposure', 'scholarly studies', 'tabloids', 'private life...   
4  ['mass media', 'tabloid journalism', 'ne

## Variable: Disciplinary (core journal hit via SJR)

We construct a binary indicator for whether the record's journal identifiers intersect the SJR communication journals ISSN list. The variable equals 1 if any standardized token from `issn` or `isbn` matches any SJR `Issn`; otherwise 0.

Columns used: `id`, `issn`, `isbn` (from Dimensions) and `Issn` (from SJR CSV).

Output columns: `id`, `issn`, `isbn`, `disciplinary`.

Normalization policy:
- Uppercase tokens; drop hyphens/whitespace; retain only digits and `X`.
- Support multi-valued identifiers split by common non-alphanumerics.


In [6]:
# Build disciplinary via SJR ISSN match (robust to multi-valued fields)
import re

# Base identifiers
disc_cols = ['id','issn','isbn']
existing_disc = [c for c in disc_cols if c in pf.schema.names]
disc = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_disc)

# --- Utilities ---
TOK = re.compile(r'[0-9A-Za-z]+')

def issn_tokens(val: object) -> set[str]:
    """Extract standardized ISSN-like tokens (length==8, digits/X) from any string-ish value.
    Handles multi-valued cases separated by commas/semicolons/spaces/brackets/quotes/hyphens.
    """
    if pd.isna(val):
        return set()
    # split into alnum chunks, then strip to [0-9X], uppercase
    raw = TOK.findall(str(val))
    cleaned = [re.sub(r'[^0-9X]', '', t.upper()) for t in raw]
    # keep only 8-char tokens (ISSN normalized without hyphen)
    return {t for t in cleaned if len(t) == 8}

# --- Build SJR ISSN set (SJR side may also contain comma-separated values) ---
sjr = pd.read_csv(SJR_CSV, dtype=str, usecols=['Issn'])
sjr_issn_set: set[str] = set()
for s in sjr['Issn'].astype(str):
    sjr_issn_set.update(issn_tokens(s))

# --- Match against SJR set ---
# Prefer explicit column checks for clarity
has_issn = 'issn' in disc.columns
has_isbn = 'isbn' in disc.columns

issn_hit = pd.Series(False, index=disc.index)
if has_issn:
    issn_hit = disc['issn'].map(issn_tokens).apply(lambda s: any(t in sjr_issn_set for t in s))

isbn_hit = pd.Series(False, index=disc.index)
if has_isbn:
    # ISBN tokens typically not length 8; this filter ensures only ISSN-like tokens can match
    isbn_hit = disc['isbn'].map(issn_tokens).apply(lambda s: any(t in sjr_issn_set for t in s))

disc['disciplinary'] = (issn_hit | isbn_hit).astype('Int8')
print(disc.head())
print(disc['disciplinary'].value_counts(dropna=False).rename('disciplinary_counts'))


               id  issn               isbn  disciplinary
0  pub.1186290333  None  ['9781461643852']             0
1  pub.1186290332  None  ['9781461643852']             0
2  pub.1186290331  None  ['9781461643852']             0
3  pub.1186290329  None  ['9781461643852']             0
4  pub.1186290328  None  ['9781461643852']             0
disciplinary
0    356157
1      2336
Name: disciplinary_counts, dtype: Int64


## Variable: Prestige (institutional ranking category)

We derive a prestige category by fuzzy-matching `research_org_names` against SCImago Institution Rankings (communication), then mapping rank to bins:
- Elite: top 100
- High: 101–500
- Medium: 501–1000
- Low: Unranked or no confident match

Inputs:
- `research_org_names` (may contain multiple institutions per record)
- Rankings CSV: `data/processed/scimagoir_2025_Overall Rank_Communication.csv` (expects columns like `institution` and an overall-rank column)

Output columns: `id`, `prestige`. 


In [7]:
# Build prestige from SCImagoIR 2025 (Communication) by fuzzy-matching (rapidfuzz-accelerated)
import re
import numpy as np

# Ensure rapidfuzz is available in the current kernel
try:
    from rapidfuzz import process as rf_process, fuzz as rf_fuzz
except ModuleNotFoundError:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'rapidfuzz'])
    from rapidfuzz import process as rf_process, fuzz as rf_fuzz

# Tunable fuzzy threshold (validated by sampling)
PRESTIGE_FUZZY_THRESHOLD = 0.92
SCORE_CUTOFF = int(PRESTIGE_FUZZY_THRESHOLD * 100)

rank_csv = PROJECT_ROOT / 'data/processed/scimagoir_2025_Overall Rank_Communication.csv'
# Robust CSV reading with explicit columns and trailing placeholder to avoid misalignment
rank_df_raw = pd.read_csv(
    rank_csv,
    dtype=str,
    sep=';',
    engine='python',
    header=0,
    names=['Global Rank', 'Institution', 'Country', 'Sector', '_extra']
)
rank_df_raw = rank_df_raw.drop(columns=['_extra'], errors='ignore')

# Use fixed columns from SCImagoIR export
name_col = 'Institution'
rank_col = 'Global Rank'

# Normalize institution names and numeric rank

def norm_text(x: object) -> str:
    s = str(x) if pd.notna(x) else ''
    s = s.lower().strip()
    s = re.sub(r"[\-–—'’`\"]", ' ', s)
    s = re.sub(r"[^a-z0-9&\s]", ' ', s)
    s = re.sub(r"\s+", ' ', s).strip()
    return s

rank_df = rank_df_raw[[name_col, rank_col]].copy()
rank_df[name_col] = rank_df[name_col].map(norm_text)
# Drop empty institution names explicitly (treat empty string as missing)
rank_df = rank_df[rank_df[name_col].astype(str).str.strip() != '']
rank_df[rank_col] = pd.to_numeric(rank_df[rank_col], errors='coerce')

# Build lookup: normalized name → (best rank, original Institution)
rank_best = (
    rank_df.join(rank_df_raw[[name_col]], rsuffix='_orig')
            .dropna(subset=[rank_col])
            .sort_values(rank_col)
            .drop_duplicates(subset=[name_col], keep='first')
)
rank_map = rank_best.set_index(name_col)[rank_col].to_dict()
orig_by_norm = rank_best.set_index(name_col)[f'{name_col}_orig'].to_dict()
choices = list(rank_map.keys())

# Read research_org_names from the geographic frame (may contain multi-values)
geo_names = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=['id','research_org_names'])


def split_names(val: object) -> list[str]:
    if pd.isna(val):
        return []
    # Split by commas/semicolons/slashes/pipes and collapse bracket/quote noise
    s = str(val)
    # Replace brackets/quotes with space
    s = re.sub(r"[\[\]\(\)\{\}\'\"]", ' ', s)
    # Then split on common delimiters or 2+ spaces
    tokens = re.split(r"[;,/\|]|\s{2,}", s)
    # Drop empties and very short tokens
    return [t.strip() for t in tokens if t and t.strip()]

# Build unique normalized tokens once
unique_tokens: set[str] = set()
for val in geo_names['research_org_names']:
    if pd.isna(val) or str(val).strip() == '':
        continue
    for t in split_names(val):
        qn = norm_text(t)
        if qn:
            unique_tokens.add(qn)

# Map tokens → meta (rank, original matched institution, score, type, token)
token_to_meta: dict[str, dict] = {}
for tok in unique_tokens:
    if tok in rank_map:
        token_to_meta[tok] = {
            'rank': float(rank_map[tok]),
            'choice': orig_by_norm.get(tok, tok),
            'score': 100.0,
            'match_type': 'exact',
            'token': tok,
        }
        continue
    match = rf_process.extractOne(tok, choices, scorer=rf_fuzz.ratio, score_cutoff=SCORE_CUTOFF)
    if match is None:
        token_to_meta[tok] = {
            'rank': None, 'choice': None, 'score': None, 'match_type': 'none', 'token': tok
        }
    else:
        choice_norm, score, _ = match
        token_to_meta[tok] = {
            'rank': float(rank_map[choice_norm]),
            'choice': orig_by_norm.get(choice_norm, choice_norm),
            'score': float(score),
            'match_type': 'fuzzy',
            'token': tok,
        }

# For each row, compute best candidate by (rank asc, score desc)
rows = []
for rid, names in geo_names[['id','research_org_names']].itertuples(index=False):
    candidates = []
    for n in split_names(names):
        qn = norm_text(n)
        if not qn:
            continue
        m = token_to_meta.get(qn)
        if m is not None and (m['rank'] is not None):
            candidates.append(m)
    if candidates:
        best = sorted(candidates, key=lambda m: (m['rank'], - (m['score'] or 0.0)))[0]
        rows.append((rid, best['rank'], best['choice'], best['token'], best['match_type'], best['score']))
    else:
        rows.append((rid, None, None, None, None, None))

prest_df = pd.DataFrame(rows, columns=['id','best_rank','matched_institution','matched_token','match_type','match_score'])

# Map to bins (Unknown for unmatched/missing)

def to_prestige(r: float | None) -> str:
    if r is None or (isinstance(r, float) and np.isnan(r)):
        return 'Unknown'
    r = float(r)
    if r <= 100:
        return 'Elite'
    if r <= 500:
        return 'High'
    if r <= 1000:
        return 'Medium'
    return 'Low'

prest_df['prestige'] = prest_df['best_rank'].map(to_prestige)
prest = prest_df[['id','best_rank','matched_institution','matched_token','match_type','match_score','prestige']]
print(prest.head())


               id  best_rank matched_institution matched_token match_type  match_score prestige
0  pub.1186290333        NaN                None          None       None          NaN  Unknown
1  pub.1186290332        NaN                None          None       None          NaN  Unknown
2  pub.1186290331        NaN                None          None       None          NaN  Unknown
3  pub.1186290329        NaN                None          None       None          NaN  Unknown
4  pub.1186290328        NaN                None          None       None          NaN  Unknown


## Variable: Open Access status

We retain the open access indicator as-is for subsequent stratified analyses.

Columns used: `id`, `open_access`.

Output columns: `id`, `open_access`. 


In [8]:
# Keep OA field
oa_cols = ['id','open_access']
existing_oa = [c for c in oa_cols if c in pf.schema.names]
oa = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_oa)
print(oa['open_access'].value_counts(dropna=False).head())
print(oa.head())


open_access
['closed']              207616
['oa_all', 'gold']       74537
['oa_all', 'green']      26712
['oa_all', 'hybrid']     26096
['oa_all', 'bronze']     23530
Name: count, dtype: int64
               id open_access
0  pub.1186290333  ['closed']
1  pub.1186290332  ['closed']
2  pub.1186290331  ['closed']
3  pub.1186290329  ['closed']
4  pub.1186290328  ['closed']


## Variable: First author experience

Definition: The difference between this paper's publication year and the first author's earliest publication year observed within this dataset. The goal is to estimate the first author's academic experience at the time of publication.

- Identification of first author: prefer stable IDs; we do not use name-based matching to avoid collisions
  - Priority: researchers[0].id → authors[0].id; if neither is available, the value remains missing
- Year source: prefer `year`; fallback to extracting the year from `date`
- Output columns: `id`, `first_author_experience`
- Debug-only (not merged into final output): `first_author_key`, `first_author_first_year`


In [14]:
# Build first_author_experience (robust to multi-valued id fields)
import json, ast, re

# Base columns
fae_cols = ['id','year','date','researchers','authors']
existing_fae = [c for c in fae_cols if c in pf.schema.names]
fa_raw = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_fae)

# Derive paper_year: prefer `year`, fallback to year parsed from `date`
if 'year' in fa_raw.columns:
    paper_year = pd.to_numeric(fa_raw['year'], errors='coerce')
else:
    paper_year = pd.Series(pd.NA, index=fa_raw.index, dtype='Float64')

if 'date' in fa_raw.columns:
    def year_from_date(v: object):
        if pd.isna(v):
            return pd.NA
        s = str(v)
        m = re.search(r'(\d{4})', s)
        return float(m.group(1)) if m else pd.NA
    paper_year = paper_year.fillna(pd.to_numeric(fa_raw['date'].map(year_from_date), errors='coerce'))

fa_raw['paper_year'] = paper_year

# Helpers
ID_PATTERN = re.compile(r'^[a-z]{2,}\.[0-9]+', re.IGNORECASE)

def to_struct(val: object):
    if isinstance(val, (list, dict)):
        return val
    if pd.isna(val):
        return None
    s = str(val)
    if s.strip().lower() in {'', 'none', 'nan', 'null'}:
        return None
    try:
        return json.loads(s)
    except Exception:
        try:
            return ast.literal_eval(s)
        except Exception:
            return None

def extract_id_from_dict(d: dict) -> str | None:
    # primary
    for k in ('id','researcher_id','author_id'):
        if k in d and d[k] not in (None, '', 'None'):
            sid = str(d[k])
            if ID_PATTERN.match(sid):
                return sid
    # nested ids container
    if 'ids' in d:
        ids_obj = d['ids']
        ids_parsed = to_struct(ids_obj)
        if isinstance(ids_parsed, list):
            for itm in ids_parsed:
                if isinstance(itm, dict):
                    v = itm.get('id') or itm.get('value') or itm.get('dimensions_id')
                    if v is not None and ID_PATTERN.match(str(v)):
                        return str(v)
                elif isinstance(itm, str) and ID_PATTERN.match(itm):
                    return itm
        elif isinstance(ids_parsed, dict):
            for v in ids_parsed.values():
                if v is not None and ID_PATTERN.match(str(v)):
                    return str(v)
    return None

def extract_first_id(val: object) -> str | None:
    obj = to_struct(val)
    if obj is None:
        return None
    if isinstance(obj, list) and obj:
        first = obj[0]
        if isinstance(first, dict):
            got = extract_id_from_dict(first)
            if got:
                return got
            return None
        if isinstance(first, str) and ID_PATTERN.match(first):
            return first
        return None
    if isinstance(obj, dict):
        return extract_id_from_dict(obj)
    return None

# First-author key priority: researchers[0].id -> authors[0].id (no name fallback)
fa_key = pd.Series([None]*len(fa_raw), index=fa_raw.index, dtype='object')
if 'researchers' in fa_raw.columns:
    fa_key = fa_raw['researchers'].map(extract_first_id)
if 'authors' in fa_raw.columns:
    fallback_ids = fa_raw['authors'].map(extract_first_id)
    fa_key = fa_key.fillna(fallback_ids)
fa_raw['first_author_key'] = fa_key

# Earliest first-author year map (within this dataset)
valid_rows = fa_raw.dropna(subset=['first_author_key','paper_year']).copy()
valid_rows['paper_year'] = pd.to_numeric(valid_rows['paper_year'], errors='coerce')
first_year_map = valid_rows.groupby('first_author_key')['paper_year'].min()

# Experience = paper_year - earliest_year, clipped to non-negative
delta = pd.to_numeric(fa_raw['paper_year'], errors='coerce') - fa_raw['first_author_key'].map(first_year_map)
delta = pd.to_numeric(delta, errors='coerce')
delta = delta.where(delta.isna() | (delta >= 0), other=0)
fa_exp = delta.round().astype('Int16')

# Output frames
fae = pd.DataFrame({'id': fa_raw['id'], 'first_author_experience': fa_exp})
fae_debug = pd.DataFrame({
    'id': fa_raw['id'],
    'first_author_key': fa_raw['first_author_key'],
    'first_author_first_year': fa_raw['first_author_key'].map(first_year_map)
})

print(fae.head())
print(fae['first_author_experience'].value_counts(dropna=False).head())
print('first_author_key non-null:', int(fae_debug['first_author_key'].notna().sum()))
print(fae_debug.head())


               id  first_author_experience
0  pub.1186290333                     <NA>
1  pub.1186290332                     <NA>
2  pub.1186290331                     <NA>
3  pub.1186290329                     <NA>
4  pub.1186290328                     <NA>
first_author_experience
<NA>    139555
0       128853
1        13675
2        11632
3         9799
Name: count, dtype: Int64
first_author_key non-null: 218938
               id first_author_key  first_author_first_year
0  pub.1186290333             None                      NaN
1  pub.1186290332             None                      NaN
2  pub.1186290331             None                      NaN
3  pub.1186290329             None                      NaN
4  pub.1186290328             None                      NaN


## Variables: Controls (document and referencing)

We retain common control variables for modeling and descriptive statistics.

Columns used: `id`, `document_type`, `type`, `authors_count`, `reference_ids`, `referenced_pubs`.

Output columns: `id`, `document_type`, `type`, `authors_count`, `reference_ids`, `referenced_pubs`. 


In [10]:
# Keep control variables
ctrl_cols = ['id','document_type','type','authors_count','reference_ids','referenced_pubs']
existing_ctrl = [c for c in ctrl_cols if c in pf.schema.names]
ctrl = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_ctrl)
print(ctrl.head())


               id       document_type     type authors_count reference_ids referenced_pubs
0  pub.1186290333  OTHER_BOOK_CONTENT  chapter             0          None            None
1  pub.1186290332                None  chapter             0          None            None
2  pub.1186290331      REFERENCE_WORK  chapter             0          None            None
3  pub.1186290329                None  chapter             0          None            None
4  pub.1186290328                None  chapter             0          None            None


## Merge and write output (column order by conceptual blocks)

We left-join all module dataframes by `id` and order columns so that variables belonging to the same construct appear together. The final dataset is written as a single Parquet file with Snappy compression.


In [11]:
# Merge by id and order columns
from functools import reduce

# Ensure required frames exist
frames = [inv, geo, top, disc, prest, oa, fae, ctrl]
final = reduce(lambda l, r: l.merge(r, on='id', how='left'), frames)

# Column ordering by conceptual blocks
ordered_cols = (
    ['id'] +
    ['invisibility','times_cited','date','first_author_experience'] +
    ['research_org_country_names','research_org_names','research_org_types'] +
    ['concepts','concepts_scores'] +
    ['issn','isbn','disciplinary'] +
    # Prestige block with matching details
    ['best_rank','matched_institution','matched_token','match_type','match_score','prestige'] +
    ['open_access'] +
    ['document_type','type','authors_count','reference_ids','referenced_pubs']
)
final_cols = [c for c in ordered_cols if c in final.columns]
final = final[final_cols]

print('Final shape:', final.shape)
print('Columns (ordered):', final.columns.tolist())

# Write Parquet (single file)
final.to_parquet(OUTPUT_PARQUET, engine='pyarrow', compression='snappy', index=False)
print('Wrote:', OUTPUT_PARQUET)


Final shape: (358493, 25)
Columns (ordered): ['id', 'invisibility', 'times_cited', 'date', 'first_author_experience', 'research_org_country_names', 'research_org_names', 'research_org_types', 'concepts', 'concepts_scores', 'issn', 'isbn', 'disciplinary', 'best_rank', 'matched_institution', 'matched_token', 'match_type', 'match_score', 'prestige', 'open_access', 'document_type', 'type', 'authors_count', 'reference_ids', 'referenced_pubs']
Wrote: /Users/yann.jy/InvisibleResearch/data/processed/dimension_data_for_analysis.parquet


## Data quality and reliability checks

We perform lightweight but informative reliability checks on the final dataset:
- Missingness summary (counts and ratios) at column level
- Key constraints: `id` uniqueness and row-count parity against the base table
- Domain checks: `invisibility` in {0,1}; non-negativity for `times_cited` and `authors_count`
- Format checks: ISSN/ISBN token validity (normalized tokens of length 8 with digits/X)
- Distribution snapshots: value counts or quantiles for selected variables
- Risk indicators: normalized duplicate DOI (if a DOI column exists upstream), rows missing both `issn` and `isbn`


In [12]:
# Load final and run checks
final_df = pd.read_parquet(OUTPUT_PARQUET, engine='pyarrow')
print('Final shape (reloaded):', final_df.shape)

import numpy as np

# Expanded missingness for variable-related columns ONLY (treat empty string '' as NA)
var_cols = [
    # Invisibility block
    'invisibility','times_cited','date','first_author_experience',
    # Geographic/Institutional (include names/types as requested)
    'research_org_country_names','research_org_names','research_org_types',
    # Topical
    'concepts','concepts_scores',
    # Disciplinary
    'issn','isbn','disciplinary',
    # Prestige
    'prestige',
    # OA
    'open_access',
    # Controls
    'document_type','type','authors_count','reference_ids','referenced_pubs'
]
var_cols = [c for c in var_cols if c in final_df.columns]

def is_blank(s: pd.Series) -> pd.Series:
    return s.isna() | s.astype(str).str.strip().eq('')

var_na_counts = {c: int(is_blank(final_df[c]).sum()) for c in var_cols}
var_na_ratio = {c: round(var_na_counts[c] / len(final_df), 4) for c in var_cols}
var_missing = (
    pd.DataFrame({'na_count': pd.Series(var_na_counts), 'na_ratio': pd.Series(var_na_ratio)})
      .sort_values('na_ratio', ascending=False)
)
print('\nVariable columns missingness (NA + empty strings), sorted by na_ratio (final output columns only):')
print(var_missing)

# Key constraints
print('\nID uniqueness check:')
print('Total ids:', final_df['id'].size, 'Distinct ids:', final_df['id'].nunique())

# Domain checks
if 'invisibility' in final_df.columns:
    print('\nInvisibility value counts:')
    print(final_df['invisibility'].value_counts(dropna=False))

if 'times_cited' in final_df.columns:
    # Coerce to numeric for robust comparison and quantiles
    tc = pd.to_numeric(final_df['times_cited'], errors='coerce')
    print('\nTimes cited quantiles:')
    print(tc.describe(percentiles=[0.5,0.9,0.99]))
    neg_tc = (tc.dropna() < 0).sum()
    print('Negative times_cited count:', int(neg_tc))

if 'authors_count' in final_df.columns:
    ac = pd.to_numeric(final_df['authors_count'], errors='coerce')
    print('\nAuthors count quantiles:')
    print(ac.describe(percentiles=[0.5,0.9,0.99]))
    neg_ac = (ac.dropna() < 0).sum()
    print('Negative authors_count count:', int(neg_ac))

# Format checks for identifiers
import re
TOK = re.compile(r'[0-9A-Za-z]+')

def norm_tokens(val: object) -> list[str]:
    if pd.isna(val):
        return []
    tokens = TOK.findall(str(val))
    tokens = [re.sub(r'[^0-9X]', '', t.upper()) for t in tokens]
    return [t for t in tokens if t]

if ('issn' in final_df.columns) or ('isbn' in final_df.columns):
    def valid_issn_like(x: object) -> bool:
        toks = norm_tokens(x)
        # Accept any token with length 8
        return any(len(t) == 8 for t in toks)
    both_missing = 0
    if ('issn' in final_df.columns) and ('isbn' in final_df.columns):
        both_missing = (is_blank(final_df['issn']) & is_blank(final_df['isbn'])).sum()
    print('\nRows missing both issn and isbn (NA + empty strings):', int(both_missing))
    if 'issn' in final_df.columns:
        print('Share of rows with a token that looks like ISSN:', round(final_df['issn'].map(valid_issn_like).mean(), 4))

# Risk indicators (base-level DOI duplicates if DOI exists upstream)
if 'doi' in pf.schema.names:
    doi_base = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=['id', 'doi'])
    doi_base['doi_norm'] = (
        doi_base['doi'].astype(str)
        .str.lower()
        .str.replace(r'^https?://(dx\.)?doi\.org/', '', regex=True)
        .str.replace(r'\s+', '', regex=True)
    )
    dup = doi_base[
        doi_base['doi_norm'].ne('') & doi_base['doi_norm'].notna()
    ].duplicated('doi_norm', keep=False).sum()
    print('\nDuplicate DOI (normalized) at base level:', int(dup))


Final shape (reloaded): (358493, 25)

Variable columns missingness (NA + empty strings), sorted by na_ratio (final output columns only):
                            na_count  na_ratio
first_author_experience       358493    1.0000
isbn                          274561    0.7659
research_org_types            184569    0.5148
research_org_country_names    176122    0.4913
referenced_pubs               169945    0.4741
reference_ids                 169944    0.4741
research_org_names            164923    0.4600
issn                           78802    0.2198
document_type                  75065    0.2094
concepts_scores                25829    0.0720
concepts                       25829    0.0720
times_cited                        0    0.0000
disciplinary                       0    0.0000
prestige                           0    0.0000
open_access                        0    0.0000
type                               0    0.0000
authors_count                      0    0.0000
date             

In [None]:
# QA extension: first_author_experience checks
print('=== Debug Info from Computation ===')
print('first_author_key extracted:', int(fae_debug['first_author_key'].notna().sum()), '/', len(fae_debug))
print('first_author_first_year mapped:', int(fae_debug['first_author_first_year'].notna().sum()))
print('\nSample with valid keys (first 10):')
print(fae_debug[fae_debug['first_author_key'].notna()].head(10))

print('\n=== Final Output Stats ===')
try:
    _ = final_df  # reuse if available
except NameError:
    final_df = pd.read_parquet(OUTPUT_PARQUET, engine='pyarrow')

if 'first_author_experience' in final_df.columns:
    fae_col = final_df['first_author_experience']
    fae_num = pd.to_numeric(fae_col, errors='coerce')
    
    print('Missing count:', int(fae_num.isna().sum()), '/', len(fae_num))
    print('Valid count:', int(fae_num.notna().sum()))
    
    print('\nValue counts (top 10):')
    print(fae_num.value_counts(dropna=False).head(10))
    
    if fae_num.notna().sum() > 0:
        print('\nQuantiles (valid values only):')
        print(fae_num.describe(percentiles=[0.5, 0.9, 0.99]))
        neg_count = int((fae_num.dropna() < 0).sum())
        print('Negative count:', neg_count)
    else:
        print('\nNo valid first_author_experience values to compute quantiles.')
else:
    print('first_author_experience not found in final output columns')



first_author_experience missing count: 358493
first_author_experience value counts (top 10 incl. NA):
first_author_experience
<NA>    358493
Name: count, dtype: Int64

first_author_experience quantiles:
count     0.0
mean     <NA>
std      <NA>
min      <NA>
50%      <NA>
90%      <NA>
99%      <NA>
max      <NA>
Name: first_author_experience, dtype: Float64
Negative first_author_experience count: 0
