# Dimension-derived Variable Construction (English-only)

This notebook derives analysis-ready variables from the unified Dimensions dataset and writes a single Parquet for downstream analyses. The workflow adheres to academic reporting best practices: each variable is introduced by a short rationale (markdown) followed by a reproducible code cell. No intermediate files are written; only the final dataset is saved.

- Source dataset: `/Users/yann.jy/InvisibleResearch/data/processed/dimension_merged.parquet`
- External reference (for disciplinary core journal matching): `/Users/yann.jy/InvisibleResearch/data/processed/scimagojr_communication_journal_1999_2024.csv`
- Output dataset: `/Users/yann.jy/InvisibleResearch/data/processed/dimension_data_for_analysis.parquet`

Conventions:
- All module outputs include `id` for reliable merges.
- Column blocks are ordered by conceptual variable groups so related columns appear together in the final file.
- Time-window logic for invisibility is intentionally not applied here; analyses may add temporal filters later.


In [1]:
# Setup and paths
from __future__ import annotations

from pathlib import Path
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Paths (absolute as requested)
PROJECT_ROOT = Path('/Users/yann.jy/InvisibleResearch')
PARQUET_INPUT = PROJECT_ROOT / 'data/processed/dimension_merged.parquet'
SJR_CSV = PROJECT_ROOT / 'data/processed/scimagojr_communication_journal_1999_2024.csv'
OUTPUT_PARQUET = PROJECT_ROOT / 'data/processed/dimension_data_for_analysis.parquet'

pd.set_option('display.max_colwidth', 160)
pd.set_option('display.width', 160)

print('Input exists:', PARQUET_INPUT.exists())
print('SJR CSV exists:', SJR_CSV.exists())
print('Output path:', OUTPUT_PARQUET)


Input exists: True
SJR CSV exists: True
Output path: /Users/yann.jy/InvisibleResearch/data/processed/dimension_data_for_analysis.parquet


## Schema preview

We first inspect the Parquet schema and a small sample to confirm column availability and obtain a quick sense of the data. No full-table prints are performed to avoid large outputs.


In [2]:
# Read schema (column names) and show a small preview
pf = pq.ParquetFile(PARQUET_INPUT)
print('Number of columns:', len(pf.schema.names))
print('Columns:')
print(pf.schema.names)

# Lightweight row count and a tiny data preview
num_rows = pf.metadata.num_rows
print('Total rows (metadata):', num_rows)

# Sample a few rows without loading entire file (avoid unsupported filters arg)
base_head = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=pf.schema.names[:12]).head(5)
print('\nHead (subset of columns):')
print(base_head)


Number of columns: 76
Columns:
['abstract', 'acknowledgements', 'altmetric', 'altmetric_id', 'arxiv_id', 'authors', 'authors_count', 'book_doi', 'book_series_title', 'book_title', 'category_bra', 'category_for', 'category_for_2020', 'category_hra', 'category_hrcs_hc', 'category_hrcs_rac', 'category_icrp_cso', 'category_icrp_ct', 'category_rcdc', 'category_sdg', 'category_uoa', 'clinical_trial_ids', 'concepts', 'concepts_scores', 'date', 'date_inserted', 'date_online', 'date_print', 'dimensions_url', 'document_type', 'doi', 'editors', 'field_citation_ratio', 'funder_countries', 'funders', 'funding_section', 'id', 'isbn', 'issn', 'issue', 'journal.id', 'journal.title', 'journal_lists', 'journal_title_raw', 'linkout', 'mesh_terms', 'open_access', 'pages', 'pmcid', 'pmid', 'proceedings_title', 'publisher', 'recent_citations', 'reference_ids', 'referenced_pubs', 'relative_citation_ratio', 'research_org_cities', 'research_org_countries', 'research_org_country_names', 'research_org_names', 'r

## Variable: Invisibility (times_cited only)

Definition: A record is labeled as invisible if it has received zero citations, i.e., `invisibility = 1` when `times_cited == 0`; otherwise `0`. We retain `date` for analytical context but do not use it in the rule here. All outputs include `id` for merging.

Columns used: `id`, `times_cited`, `date`.

Output columns: `id`, `invisibility`, `times_cited`, `date`. 


In [3]:
# Build invisibility (robust to non-numeric times_cited)
base = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=['id','times_cited','date'])
inv = base[['id','times_cited','date']].copy()
# Convert to numeric, coercing invalid tokens (e.g., stray dict fragments) to NaN
_tc = pd.to_numeric(inv['times_cited'], errors='coerce')
# Initialize as NA; then fill where numeric is available
inv['invisibility'] = pd.Series(pd.NA, index=inv.index, dtype='Int8')
mask_num = _tc.notna()
inv.loc[mask_num, 'invisibility'] = (_tc.loc[mask_num] == 0).astype('int8')
inv['invisibility'] = inv['invisibility'].astype('Int8')
# Order output columns as specified
inv = inv[['id','invisibility','times_cited','date']]
print(inv.head())
print(inv['invisibility'].value_counts(dropna=False).rename('invisibility_counts'))


               id  invisibility times_cited        date
0  pub.1186290333             1           0  2000-01-01
1  pub.1186290332             1           0  2000-01-01
2  pub.1186290331             1           0  2000-01-01
3  pub.1186290329             1           0  2000-01-01
4  pub.1186290328             1           0  2000-01-01
invisibility
0       186551
1       171940
<NA>         2
Name: invisibility_counts, dtype: Int64


## Variable: Geographic (institutional location)

We retain institutional location fields to support geographic analyses. No transformation is applied here; the values may contain multiple institutions and countries per record.

Columns used: `id`, `research_org_country_names`, `research_org_names`, `research_org_types`.

Output columns: `id`, `research_org_country_names`, `research_org_names`, `research_org_types`. 


In [4]:
# Keep geographic fields
geo_cols = ['id','research_org_country_names','research_org_names','research_org_types']
existing_geo = [c for c in geo_cols if c in pf.schema.names]
geo = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_geo)
print(geo.head())


               id research_org_country_names research_org_names research_org_types
0  pub.1186290333                       None               None               None
1  pub.1186290332                       None               None               None
2  pub.1186290331                       None               None               None
3  pub.1186290329                       None               None               None
4  pub.1186290328                       None               None               None


## Variable: Topical (concepts)

We retain topic-related fields to enable later construction of mainstream-topic shares or binary mainstream indicators. No transformation is performed here.

Columns used: `id`, `concepts`, `concepts_scores`.

Output columns: `id`, `concepts`, `concepts_scores`. 


In [5]:
# Keep topical fields
top_cols = ['id','concepts','concepts_scores']
existing_top = [c for c in top_cols if c in pf.schema.names]
top = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_top)
print(top.head())


               id  \
0  pub.1186290333   
1  pub.1186290332   
2  pub.1186290331   
3  pub.1186290329   
4  pub.1186290328   

                                                                                                                                                          concepts  \
0  ['mass media', 'news habits', 'mass communication', 'public figures', 'media exposure', 'scholarly studies', 'private life', 'tabloids', 'news', 'U.S. probl...   
1  ['mass media', 'news habits', 'mass communication', 'public figures', 'media exposure', 'tabloids', 'scholarly studies', 'private life', 'U.S. problems', 'n...   
2  ['mass media', 'news habits', 'mass communication', 'public figures', 'media exposure', 'scholarly studies', 'private life', 'U.S. problems', 'tabloids', 'n...   
3  ['mass media', 'tabloid journalism', 'news habits', 'mass communication', 'public figures', 'media exposure', 'scholarly studies', 'tabloids', 'private life...   
4  ['mass media', 'tabloid journalism', 'ne

## Variable: Disciplinary (core journal hit via SJR)

We construct a binary indicator for whether the record's journal identifiers intersect the SJR communication journals ISSN list. The variable equals 1 if any standardized token from `issn` or `isbn` matches any SJR `Issn`; otherwise 0.

Columns used: `id`, `issn`, `isbn` (from Dimensions) and `Issn` (from SJR CSV).

Output columns: `id`, `issn`, `isbn`, `disciplinary`.

Normalization policy:
- Uppercase tokens; drop hyphens/whitespace; retain only digits and `X`.
- Support multi-valued identifiers split by common non-alphanumerics.


In [6]:
# Build disciplinary via SJR ISSN match (robust to multi-valued fields)
import re

# Base identifiers
disc_cols = ['id','issn','isbn']
existing_disc = [c for c in disc_cols if c in pf.schema.names]
disc = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_disc)

# --- Utilities ---
TOK = re.compile(r'[0-9A-Za-z]+')

def issn_tokens(val: object) -> set[str]:
    """Extract standardized ISSN-like tokens (length==8, digits/X) from any string-ish value.
    Handles multi-valued cases separated by commas/semicolons/spaces/brackets/quotes/hyphens.
    """
    if pd.isna(val):
        return set()
    # split into alnum chunks, then strip to [0-9X], uppercase
    raw = TOK.findall(str(val))
    cleaned = [re.sub(r'[^0-9X]', '', t.upper()) for t in raw]
    # keep only 8-char tokens (ISSN normalized without hyphen)
    return {t for t in cleaned if len(t) == 8}

# --- Build SJR ISSN set (SJR side may also contain comma-separated values) ---
sjr = pd.read_csv(SJR_CSV, dtype=str, usecols=['Issn'])
sjr_issn_set: set[str] = set()
for s in sjr['Issn'].astype(str):
    sjr_issn_set.update(issn_tokens(s))

# --- Match against SJR set ---
# Prefer explicit column checks for clarity
has_issn = 'issn' in disc.columns
has_isbn = 'isbn' in disc.columns

issn_hit = pd.Series(False, index=disc.index)
if has_issn:
    issn_hit = disc['issn'].map(issn_tokens).apply(lambda s: any(t in sjr_issn_set for t in s))

isbn_hit = pd.Series(False, index=disc.index)
if has_isbn:
    # ISBN tokens typically not length 8; this filter ensures only ISSN-like tokens can match
    isbn_hit = disc['isbn'].map(issn_tokens).apply(lambda s: any(t in sjr_issn_set for t in s))

disc['disciplinary'] = (issn_hit | isbn_hit).astype('Int8')
print(disc.head())
print(disc['disciplinary'].value_counts(dropna=False).rename('disciplinary_counts'))


               id  issn               isbn  disciplinary
0  pub.1186290333  None  ['9781461643852']             0
1  pub.1186290332  None  ['9781461643852']             0
2  pub.1186290331  None  ['9781461643852']             0
3  pub.1186290329  None  ['9781461643852']             0
4  pub.1186290328  None  ['9781461643852']             0
disciplinary
0    356157
1      2336
Name: disciplinary_counts, dtype: Int64


## Variable: Prestige (institutional ranking category)

We derive a prestige category by fuzzy-matching `research_org_names` against SCImago Institution Rankings (communication), then mapping rank to bins:
- Elite: top 100
- High: 101–500
- Medium: 501–1000
- Low: Unranked or no confident match

Inputs:
- `research_org_names` (may contain multiple institutions per record)
- Rankings CSV: `data/processed/scimagoir_2025_Overall Rank_Communication.csv` (expects columns like `institution` and an overall-rank column)

Output columns: `id`, `prestige`. 


In [None]:
# Build prestige from SCImagoIR 2025 (Communication) by fuzzy-matching
import re
from difflib import SequenceMatcher

rank_csv = PROJECT_ROOT / 'data/processed/scimagoir_2025_Overall Rank_Communication.csv'
# Try robust CSV reading: auto-detect delimiter, handle quoted fields and bad lines
try:
    # First try semicolon (observed in file header)
    rank_df = pd.read_csv(rank_csv, dtype=str, sep=';', engine='python')
except Exception:
    # Fallbacks: auto and common delimiters with skipping bad lines
    for sep in [None, ',', '\t', '|']:
        try:
            rank_df = pd.read_csv(rank_csv, dtype=str, sep=sep, engine='python', on_bad_lines='skip')
            break
        except Exception:
            rank_df = None
    if rank_df is None:
        raise

# Heuristics to detect rank and institution columns (support semicolon CSVs)
# Normalize headers first for robust matching
rank_df.columns = [c.strip() for c in rank_df.columns]

name_col = None
for cand in ['institution','Institution','Global Institution','Global institution','Global  Institution']:
    if cand in rank_df.columns:
        name_col = cand
        break
if name_col is None:
    # Try case-insensitive match
    lower_map = {c.lower(): c for c in rank_df.columns}
    for key in ['institution','global rank institution','global institution','name']:
        if key in lower_map:
            name_col = lower_map[key]
            break
if name_col is None:
    # As last resort, if a column contains many alphabetic strings and not numeric-only, pick it
    texty = [c for c in rank_df.columns if rank_df[c].astype(str).str.contains(r"[A-Za-z]", regex=True).mean() > 0.8]
    if texty:
        name_col = texty[0]
if name_col is None:
    raise RuntimeError(f'Cannot locate institution name column in SCImagoIR CSV. Columns: {list(rank_df.columns)}')

rank_col = None
for cand in ['overall_rank','overall rank','Overall Rank','Rank','rank','Global Rank','Global rank','GlobalRank']:
    if cand in rank_df.columns:
        rank_col = cand
        break
if rank_col is None:
    # Case-insensitive fallback
    lower_map = {c.lower(): c for c in rank_df.columns}
    for key in ['global rank','rank','overall rank']:
        if key in lower_map:
            rank_col = lower_map[key]
            break
if rank_col is None:
    # Numeric-looking dominant column
    numeric_like = []
    for c in rank_df.columns:
        try:
            vals = pd.to_numeric(rank_df[c], errors='coerce')
            if vals.notna().mean() > 0.8:
                numeric_like.append((c, vals))
        except Exception:
            pass
    if numeric_like:
        rank_col = numeric_like[0][0]
if rank_col is None:
    raise RuntimeError(f'Cannot locate rank column in SCImagoIR CSV. Columns: {list(rank_df.columns)}')

# Normalize institution names and numeric rank
def norm_text(x: object) -> str:
    s = str(x) if pd.notna(x) else ''
    s = s.lower().strip()
    s = re.sub(r"[\-–—'’`\"]", ' ', s)
    s = re.sub(r"[^a-z0-9&\s]", ' ', s)
    s = re.sub(r"\s+", ' ', s).strip()
    return s

rank_df = rank_df[[name_col, rank_col]].copy()
rank_df[name_col] = rank_df[name_col].map(norm_text)
rank_df[rank_col] = pd.to_numeric(rank_df[rank_col], errors='coerce')
rank_df = rank_df.dropna(subset=[name_col])

# Build a fast lookup: exact normalized name → best rank
rank_map = (
    rank_df.dropna(subset=[rank_col])
           .sort_values(rank_col)
           .drop_duplicates(subset=[name_col], keep='first')
           .set_index(name_col)[rank_col]
           .to_dict()
)
rank_names = list(rank_map.keys())

# Helper: best fuzzy match per input string using SequenceMatcher ratio
def best_fuzzy_rank(q: str, threshold: float = 0.86) -> float | None:
    qn = norm_text(q)
    if not qn:
        return None
    # Exact hit first
    if qn in rank_map:
        return float(rank_map[qn])
    # Fallback: fuzzy similarity
    best_rank = None
    best_score = 0.0
    for cand in rank_names:
        score = SequenceMatcher(None, qn, cand).ratio()
        if score > best_score:
            best_score = score
            best_rank = float(rank_map[cand])
    return best_rank if best_score >= threshold else None

# Read research_org_names from the geographic frame (may contain multi-values)
geo_names = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=['id','research_org_names'])

def split_names(val: object) -> list[str]:
    if pd.isna(val):
        return []
    # Split by commas/semicolons/slashes/pipes and collapse bracket/quote noise
    s = str(val)
    # Replace brackets/quotes with space
    s = re.sub(r"[\[\]\(\)\{\}\'\"]", ' ', s)
    # Then split on common delimiters or 2+ spaces
    tokens = re.split(r"[;,/\|]|\s{2,}", s)
    # Drop empties and very short tokens
    return [t.strip() for t in tokens if t and t.strip()]

# For each row, compute best (lowest) rank among matched institutions
records = []
for rid, names in geo_names[['id','research_org_names']].itertuples(index=False):
    ranks = []
    for n in split_names(names):
        r = best_fuzzy_rank(n)
        if r is not None:
            ranks.append(r)
    best = min(ranks) if ranks else None
    records.append((rid, best))

prest_df = pd.DataFrame(records, columns=['id','best_rank'])

# Map to bins
def to_prestige(r: float | None) -> str:
    if r is None or np.isnan(r):
        return 'Low'  # Unranked
    r = float(r)
    if r <= 100:
        return 'Elite'
    if r <= 500:
        return 'High'
    if r <= 1000:
        return 'Medium'
    return 'Low'

prest_df['prestige'] = prest_df['best_rank'].map(to_prestige)
prest = prest_df[['id','prestige']]
print(prest.head())


RuntimeError: Cannot locate institution name column in SCImagoIR CSV

## Variable: Open Access status

We retain the open access indicator as-is for subsequent stratified analyses.

Columns used: `id`, `open_access`.

Output columns: `id`, `open_access`. 


In [None]:
# Keep OA field
oa_cols = ['id','open_access']
existing_oa = [c for c in oa_cols if c in pf.schema.names]
oa = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_oa)
print(oa['open_access'].value_counts(dropna=False).head())
print(oa.head())


## Variables: Controls (document and referencing)

We retain common control variables for modeling and descriptive statistics.

Columns used: `id`, `document_type`, `type`, `authors_count`, `reference_ids`, `referenced_pubs`.

Output columns: `id`, `document_type`, `type`, `authors_count`, `reference_ids`, `referenced_pubs`. 


In [None]:
# Keep control variables
ctrl_cols = ['id','document_type','type','authors_count','reference_ids','referenced_pubs']
existing_ctrl = [c for c in ctrl_cols if c in pf.schema.names]
ctrl = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=existing_ctrl)
print(ctrl.head())


## Merge and write output (column order by conceptual blocks)

We left-join all module dataframes by `id` and order columns so that variables belonging to the same construct appear together. The final dataset is written as a single Parquet file with Snappy compression.


In [None]:
# Merge by id and order columns
from functools import reduce

# Ensure required frames exist
frames = [inv, geo, top, disc, prest, oa, ctrl]
final = reduce(lambda l, r: l.merge(r, on='id', how='left'), frames)

# Column ordering by conceptual blocks
ordered_cols = (
    ['id'] +
    ['invisibility','times_cited','date'] +
    ['research_org_country_names','research_org_names','research_org_types'] +
    ['concepts','concepts_scores'] +
    ['issn','isbn','disciplinary'] +
    ['open_access'] +
    ['document_type','type','authors_count','reference_ids','referenced_pubs']
)
final_cols = [c for c in ordered_cols if c in final.columns]
final = final[final_cols]

print('Final shape:', final.shape)
print('Columns (ordered):', final.columns.tolist())

# Write Parquet (single file)
final.to_parquet(OUTPUT_PARQUET, engine='pyarrow', compression='snappy', index=False)
print('Wrote:', OUTPUT_PARQUET)


## Data quality and reliability checks

We perform lightweight but informative reliability checks on the final dataset:
- Missingness summary (counts and ratios) at column level
- Key constraints: `id` uniqueness and row-count parity against the base table
- Domain checks: `invisibility` in {0,1}; non-negativity for `times_cited` and `authors_count`
- Format checks: ISSN/ISBN token validity (normalized tokens of length 8 with digits/X)
- Distribution snapshots: value counts or quantiles for selected variables
- Risk indicators: normalized duplicate DOI (if a DOI column exists upstream), rows missing both `issn` and `isbn`


In [None]:
# Load final and run checks
final_df = pd.read_parquet(OUTPUT_PARQUET, engine='pyarrow')
print('Final shape (reloaded):', final_df.shape)

import numpy as np

# Expanded missingness for variable-related columns ONLY (treat empty string '' as NA)
var_cols = [
    # Invisibility block
    'invisibility','times_cited','date',
    # Geographic/Institutional (include names/types as requested)
    'research_org_country_names','research_org_names','research_org_types',
    # Topical
    'concepts','concepts_scores',
    # Disciplinary
    'issn','isbn','disciplinary',
    # OA
    'open_access',
    # Controls
    'document_type','type','authors_count','reference_ids','referenced_pubs'
]
var_cols = [c for c in var_cols if c in final_df.columns]

def is_blank(s: pd.Series) -> pd.Series:
    return s.isna() | s.astype(str).str.strip().eq('')

var_na_counts = {c: int(is_blank(final_df[c]).sum()) for c in var_cols}
var_na_ratio = {c: round(var_na_counts[c] / len(final_df), 4) for c in var_cols}
var_missing = (
    pd.DataFrame({'na_count': pd.Series(var_na_counts), 'na_ratio': pd.Series(var_na_ratio)})
      .sort_values('na_ratio', ascending=False)
)
print('\nVariable columns missingness (NA + empty strings), sorted by na_ratio (final output columns only):')
print(var_missing)

# Key constraints
print('\nID uniqueness check:')
print('Total ids:', final_df['id'].size, 'Distinct ids:', final_df['id'].nunique())

# Domain checks
if 'invisibility' in final_df.columns:
    print('\nInvisibility value counts:')
    print(final_df['invisibility'].value_counts(dropna=False))

if 'times_cited' in final_df.columns:
    # Coerce to numeric for robust comparison and quantiles
    tc = pd.to_numeric(final_df['times_cited'], errors='coerce')
    print('\nTimes cited quantiles:')
    print(tc.describe(percentiles=[0.5,0.9,0.99]))
    neg_tc = (tc.dropna() < 0).sum()
    print('Negative times_cited count:', int(neg_tc))

if 'authors_count' in final_df.columns:
    ac = pd.to_numeric(final_df['authors_count'], errors='coerce')
    print('\nAuthors count quantiles:')
    print(ac.describe(percentiles=[0.5,0.9,0.99]))
    neg_ac = (ac.dropna() < 0).sum()
    print('Negative authors_count count:', int(neg_ac))

# Format checks for identifiers
import re
TOK = re.compile(r'[0-9A-Za-z]+')

def norm_tokens(val: object) -> list[str]:
    if pd.isna(val):
        return []
    tokens = TOK.findall(str(val))
    tokens = [re.sub(r'[^0-9X]', '', t.upper()) for t in tokens]
    return [t for t in tokens if t]

if ('issn' in final_df.columns) or ('isbn' in final_df.columns):
    def valid_issn_like(x: object) -> bool:
        toks = norm_tokens(x)
        # Accept any token with length 8
        return any(len(t) == 8 for t in toks)
    both_missing = 0
    if ('issn' in final_df.columns) and ('isbn' in final_df.columns):
        both_missing = (is_blank(final_df['issn']) & is_blank(final_df['isbn'])).sum()
    print('\nRows missing both issn and isbn (NA + empty strings):', int(both_missing))
    if 'issn' in final_df.columns:
        print('Share of rows with a token that looks like ISSN:', round(final_df['issn'].map(valid_issn_like).mean(), 4))

# Risk indicators (base-level DOI duplicates if DOI exists upstream)
if 'doi' in pf.schema.names:
    doi_base = pd.read_parquet(PARQUET_INPUT, engine='pyarrow', columns=['id', 'doi'])
    doi_base['doi_norm'] = (
        doi_base['doi'].astype(str)
        .str.lower()
        .str.replace(r'^https?://(dx\.)?doi\.org/', '', regex=True)
        .str.replace(r'\s+', '', regex=True)
    )
    dup = doi_base[
        doi_base['doi_norm'].ne('') & doi_base['doi_norm'].notna()
    ].duplicated('doi_norm', keep=False).sum()
    print('\nDuplicate DOI (normalized) at base level:', int(dup))
