---
title: "Code Contribution and Authorship"
author:
  - name: "Eva Maxfield Brown"
    email: evamxb@uw.edu
    orcid: 0000-0003-2564-0373
    affliation:
      name: University of Washington Information School
      city: Seattle
      state: Washington
      country: USA
  - name: "Nicholas Weber"
    email: nmweber@uw.edu
    orcid: 0000-0002-6008-3763
    affliation:
      name: University of Washington Information School
      city: Seattle
      state: Washington
      country: USA

abstract: |
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur eget porta erat. Morbi consectetur est vel gravida pretium. Suspendisse ut dui eu ante cursus gravida non sed sem. Nullam sapien tellus, commodo id velit id, eleifend volutpat quam. Phasellus mauris velit, dapibus finibus elementum vel, pulvinar non tellus. Nunc pellentesque pretium diam, quis maximus dolor faucibus id. Nunc convallis sodales ante, ut ullamcorper est egestas vitae. Nam sit amet enim ultrices, ultrices elit pulvinar, volutpat risus.

## Basics
bibliography: main.bib

## Number sections (required for section cross ref)
number-sections: true

## Citation Style Language
# See https://github.com/citation-style-language/styles for more options
# We default to PNAS (Proceedings of the National Academy of Sciences)
# csl: support/acm-proceedings.csl

## Specific for target format
format:
  html:
    code-tools: true
    code-fold: true
    code-summary: "Show the code"
    standalone: true
    embed-resources: true
    toc: true
    toc-location: left
    reference-location: margin
    citation-location: margin

  pdf:
    toc: false
    execute:
      echo: false
    include-in-header:  
      - text: |
          \usepackage{multirow}

---

# Introduction

In [1]:
from datetime import datetime
from pathlib import Path

import IPython.display
import pandas as pd
import statsmodels.api as sm
from sci_soft_models.dev_author_em.data import load_annotated_dev_author_em_dataset
from sqlalchemy import text
from sqlmodel import create_engine

from rs_graph.db import models as db_models

# Get db engine for production database
db_path = Path("rs-graph-temp.db").resolve().absolute()
db_conn = create_engine(f"sqlite:///{db_path}")

- Contemporary scientific research has become increasingly dependent on specialized software tools and computational methods.
  - define scientific software (scripts, tools, infrastructure)
  - importance in enabling large scale experiments and acting as a direct log of processing and analysis
  - scientific code sharing is on the rise

- Despite increased reliance on computational methodologies, the developers of scientific software have historically not been given traditional academic credit for their work: authorship on research articles.
  - qualitative research which talks about acknowledgements sections instead of authorship
  - lack of authorship can affect career prospects

- While new credit systems aim to be more inclusive towards more contribution types, they still suffer from two key problems.
	- Contributor Roles Taxonomy (CRediT) allows for specific “software” contribution
  - Others have used CREDIT to understand distribution of labor…
	- they are still based around an author list (it’s hard to change existing human practices, especially biased ones)
	- they aren’t verifiable, they are self-reported

- To address these problems, we create a novel predictive model that enables matching scientific article authors and source code developer accounts.
	- a predictive model is the best choice for entity matching because while authors have ORCIDs, developer accounts do not***
	- further, developer account information may be slightly different from publication information (preferred name / legal name), username’s, etc
	- a fine-tuned transformer model enables us to connect accounts which have similar enough information, hopefully providing us with many more author-code-contributor matches than would be possible on exact name or email address matching alone

- Our predictive model serves two primary purposes: identifying authors who directly contribute to an article’s associated codebase, and, revealing developers who were not included on the article’s authorship list.
	- while predictive, it is grounded in the commit logs of source code repositories, no longer self reported
	- individuals who have been left off can at least for now be identified by their developer account

- Further, by applying our model across a large corpora of paired research articles and source code repositories, we enable objective insight into the software development dynamics of research teams.
	- much like studies of CRediT, we can investigate both how many article authors contribute code
	- similarly, we can investigate who contributes code (by author position and external characteristics
	- again, this is via commit logs and contribution histories, rather than self-reported data

- To summarize, this paper makes the following contributions:
	- we train, evaluate, and make publicly available a predictive model to match article authors with developer accounts together
	- we create a large dataset of linked articles and source code repositories with accompanying bibliometric and repository information, and, further match article authors with repository developers
	- demonstration of the value of our predictive model through preliminary analysis of research team software development dynamics and code contributor characteristics

- The rest of this paper is organized as follows:
	- …

# Data and Methods

## Linking Scientific Articles and Associated Source Code Repositories

- Our trained predictive model and our preliminary analyses are based on datasets of linked bibliographic and source code repository information from multiple journals and publication platforms.
	- Each data source (the journals and publication platforms) either requires or recommends the sharing of code repositories related to a piece of work at the time of publication.
	- In turn, this allows us to mine article information for their either required, or recommended “data or code availability” links.
	- our data sources are:
    - PLOS: research articles
    - JOSS: software articles
    - SoftwareX: software articles
    - Papers with Code / ArXiv: pre-prints

- Using each data source, we process the pairs of scientific articles and associated source code repositories, in order to extract the authorship and source code repository contributor lists as well as other bibliometric and repository information.
	- we use open alex to extract bibliometric information
	- we use the github API to extract repository information

In [2]:
def read_table(table: str) -> pd.DataFrame:
    return pd.read_sql(text(f"SELECT * FROM {table}"), db_conn)


# Read all data from database
doc_repo_links = read_table(db_models.DocumentRepositoryLink.__tablename__)
researchers = read_table(db_models.Researcher.__tablename__)
devs = read_table(db_models.DeveloperAccount.__tablename__)
documents = read_table(db_models.Document.__tablename__)
document_contributors = read_table(db_models.DocumentContributor.__tablename__)
repositories = read_table(db_models.Repository.__tablename__)
repository_contributors = read_table(db_models.RepositoryContributor.__tablename__)
topics = read_table(db_models.Topic.__tablename__)
document_topics = read_table(db_models.DocumentTopic.__tablename__)
dataset_sources = read_table(db_models.DatasetSource.__tablename__)
researcher_dev_links = read_table(
    db_models.ResearcherDeveloperAccountLink.__tablename__
)

# Drop all "updated_datetime" and "created_datetime" columns
for df in [
    doc_repo_links,
    researchers,
    devs,
    documents,
    document_contributors,
    repositories,
    repository_contributors,
    topics,
    document_topics,
    dataset_sources,
    researcher_dev_links,
]:
    df.drop(columns=["updated_datetime", "created_datetime"], inplace=True)

# Specifically drop doc_repo_links "id" column
# It isn't used and will get in the way later when we do a lot of joins
doc_repo_links.drop(columns=["id"], inplace=True)

# Construct reduced doc_repo_links
original_doc_repo_links_len = len(doc_repo_links)
doc_repo_links = doc_repo_links.drop_duplicates(subset=["document_id"], keep=False)
doc_repo_links = doc_repo_links.drop_duplicates(subset=["repository_id"], keep=False)
print(
    "doc-repo-links that point at mult- docs or repos:",
    original_doc_repo_links_len - len(doc_repo_links),
)
print("these are currently ignored / dropped before analysis")

# Reduce other tables to only documents / repositories in the updated doc_repo_links
documents = documents[documents["id"].isin(doc_repo_links["document_id"])]
repositories = repositories[repositories["id"].isin(doc_repo_links["repository_id"])]
document_contributors = document_contributors[
    document_contributors["document_id"].isin(documents["id"])
]
repository_contributors = repository_contributors[
    repository_contributors["repository_id"].isin(repositories["id"])
]
document_topics = document_topics[document_topics["document_id"].isin(documents["id"])]

# Reduce researchers and devs to only those in the
# updated document_contributors and repository_contributors
researchers = researchers[
    researchers["id"].isin(document_contributors["researcher_id"])
]
devs = devs[devs["id"].isin(repository_contributors["developer_account_id"])]
researcher_dev_links = researcher_dev_links[
    (
        researcher_dev_links["researcher_id"].isin(researchers["id"])
        & researcher_dev_links["developer_account_id"].isin(devs["id"])
    )
]

# Sort document topics and keep first
document_topics = document_topics.sort_values("score", ascending=False)
document_topics = document_topics.drop_duplicates(subset=["document_id"], keep="first")

# Create document, document topic merged table
merged_document_topics = pd.merge(
    document_topics, topics, left_on="topic_id", right_on="id"
)

# Create basic merged tables
merged_document_contributor_doc_repo_links = pd.merge(
    document_contributors, doc_repo_links, left_on="document_id", right_on="document_id"
)
merged_repository_contributor_doc_repo_links = pd.merge(
    repository_contributors,
    doc_repo_links,
    left_on="repository_id",
    right_on="repository_id",
)

# Compute stats for data sources
data_source_stats = []
for _, data_source in dataset_sources.iterrows():
    # Get total article-repo pairs
    data_source_stats.append(
        {
            "data_source": data_source["name"],
            "n_article_repo_pairs": len(
                doc_repo_links[doc_repo_links["dataset_source_id"] == data_source["id"]]
            ),
            "n_authors": merged_document_contributor_doc_repo_links.loc[
                merged_document_contributor_doc_repo_links["dataset_source_id"]
                == data_source["id"]
            ]["researcher_id"].nunique(),
            "n_devs": merged_repository_contributor_doc_repo_links.loc[
                merged_repository_contributor_doc_repo_links["dataset_source_id"]
                == data_source["id"]
            ]["developer_account_id"].nunique(),
        }
    )

# Create topic merged tables
merged_doc_repo_links_topics = pd.merge(
    doc_repo_links, document_topics, left_on="document_id", right_on="document_id"
).merge(topics, left_on="topic_id", right_on="id")
merged_doc_repo_links_topics_document_contributors = pd.merge(
    merged_doc_repo_links_topics,
    document_contributors,
    left_on="document_id",
    right_on="document_id",
)
merged_doc_repo_links_topics_repository_contributors = pd.merge(
    merged_doc_repo_links_topics,
    repository_contributors,
    left_on="repository_id",
    right_on="repository_id",
)

# Compute stats for domains
domain_stats = []
for domain in merged_doc_repo_links_topics.domain_name.unique():
    # Get total article-repo pairs
    domain_stats.append(
        {
            "domain": domain,
            "n_article_repo_pairs": len(
                merged_doc_repo_links_topics[
                    merged_doc_repo_links_topics["domain_name"] == domain
                ]
            ),
            "n_authors": merged_doc_repo_links_topics_document_contributors.loc[
                merged_doc_repo_links_topics_document_contributors["domain_name"]
                == domain
            ]["researcher_id"].nunique(),
            "n_devs": merged_doc_repo_links_topics_repository_contributors.loc[
                merged_doc_repo_links_topics_repository_contributors["domain_name"]
                == domain
            ]["developer_account_id"].nunique(),
        }
    )

# Create document merged tables
merged_doc_repo_links_documents = pd.merge(
    doc_repo_links, documents, left_on="document_id", right_on="id"
)
merged_doc_repo_links_documents_document_contributors = pd.merge(
    merged_doc_repo_links_documents,
    document_contributors,
    left_on="document_id",
    right_on="document_id",
)
merged_doc_repo_links_documents_repository_contributors = pd.merge(
    merged_doc_repo_links_documents,
    repository_contributors,
    left_on="repository_id",
    right_on="repository_id",
)

# Compute stats for document types
# This isn't a standard data pull
# In short:
# - pairs from PLOS are "research articles"
# - pairs from JOSS are "software articles"
# - pairs from SoftwareX are "software articles"
# - pairs from Papers with Code / ArXiv are "pre-prints"
#   UNLESS they have been published in a journal
# All of those should be easy to assert / apply a label to with the exception
# of Papers with Code / ArXiv pre-prints that have been published in a journal
# In that case, we need to look at the existing document type in the database
# If the document type is "preprint" use preprint, otherwise, if it's anything else,
# use "research article"

# Create a "reduced_doc_types" dataframe with document_id and "reduced_doc_type"
# columns
reduced_doc_types_rows = []
# We can use the "reduced_doc_types" dataframe to calculate the stats

# Iter over data sources even though we are looking for doc types
for _, data_source in dataset_sources.iterrows():
    # Get total article-repo pairs
    doc_type = None
    if data_source["name"] in ["plos", "joss", "softwarex"]:
        if data_source["name"] == "plos":
            doc_type = "research article"
        else:
            doc_type = "software article"

        # Add all document_ids to reduced_doc_types_rows
        reduced_doc_types_rows.extend(
            [
                {"document_id": doc_id, "reduced_doc_type": doc_type}
                for doc_id in doc_repo_links[
                    (doc_repo_links["dataset_source_id"] == data_source["id"])
                ]["document_id"]
            ]
        )

    # Handle PwC
    else:
        # Get preprint pairs
        preprint_pairs = merged_doc_repo_links_documents[
            (merged_doc_repo_links_documents["dataset_source_id"] == data_source["id"])
            & (merged_doc_repo_links_documents["document_type"] == "preprint")
        ]

        # Add all document_ids to reduced_doc_types_rows
        reduced_doc_types_rows.extend(
            [
                {"document_id": doc_id, "reduced_doc_type": "preprint"}
                for doc_id in preprint_pairs["document_id"]
            ]
        )

        # Get research article pairs
        # This is the same just inverted to != "preprint"
        research_article_pairs = merged_doc_repo_links_documents[
            (merged_doc_repo_links_documents["dataset_source_id"] == data_source["id"])
            & (merged_doc_repo_links_documents["document_type"] != "preprint")
        ]

        # Add all document_ids to reduced_doc_types_rows
        reduced_doc_types_rows.extend(
            [
                {"document_id": doc_id, "reduced_doc_type": "research article"}
                for doc_id in research_article_pairs["document_id"]
            ]
        )

# Create reduced_doc_types dataframe
reduced_doc_types = pd.DataFrame(reduced_doc_types_rows)

# Now compute stats
doc_type_stats = reduced_doc_types.groupby("reduced_doc_type").apply(
    lambda x: {
        "doc_type": x.name,
        "n_article_repo_pairs": len(x),
        "n_authors": merged_doc_repo_links_documents_document_contributors.loc[
            merged_doc_repo_links_documents_document_contributors["document_id"].isin(
                x["document_id"]
            )
        ]["researcher_id"].nunique(),
        "n_devs": merged_doc_repo_links_documents_repository_contributors.loc[
            merged_doc_repo_links_documents_repository_contributors["document_id"].isin(
                x["document_id"]
            )
        ]["developer_account_id"].nunique(),
    },
    include_groups=False,
)

# Compute stats for access status
access_stats = []
for access_status_int, access_status_name in [
    (0, "Closed"),
    (1, "Open"),
]:
    # Get total article-repo pairs
    access_stats.append(
        {
            "access_status": access_status_name,
            "n_article_repo_pairs": len(
                merged_doc_repo_links_documents[
                    merged_doc_repo_links_documents["is_open_access"]
                    == access_status_int
                ]
            ),
            "n_authors": merged_doc_repo_links_documents_document_contributors.loc[
                merged_doc_repo_links_documents_document_contributors["is_open_access"]
                == access_status_int
            ]["researcher_id"].nunique(),
            "n_devs": merged_doc_repo_links_documents_repository_contributors.loc[
                merged_doc_repo_links_documents_repository_contributors[
                    "is_open_access"
                ]
                == access_status_int
            ]["developer_account_id"].nunique(),
        }
    )

# Compute totals
total_article_repo_pairs = len(doc_repo_links)
total_authors = merged_document_contributor_doc_repo_links["researcher_id"].nunique()
total_devs = merged_repository_contributor_doc_repo_links[
    "developer_account_id"
].nunique()

###############################################################################
# Constuct HTML Table

# Construct multi-row span HTML table
# Columns should be: "n_article_repo_pairs", "n_authors", "n_devs"
# Rows should be:
# "By Data Source", "By Domain", "By Document Type", "By Access Status", and "Total"

# HTML templates
stats_piece_inital_row_template = """
<tr>
  <td rowspan="{n_rows}">{row_name}</td>
  <td>{value_name}</td>
  <td>{n_article_repo_pairs}</td>
  <td>{n_authors}</td>
  <td>{n_devs}</td>
</tr>
""".strip()

stats_piece_subsequent_row_template = """
<tr>
  <td>{value_name}</td>
  <td>{n_article_repo_pairs}</td>
  <td>{n_authors}</td>
  <td>{n_devs}</td>
</tr>
""".strip()

# Iter over stats portions (and total)
stats_portions_html = []
for stats_portion, stats_name, value_key in [
    (data_source_stats, "<b>By Data Source</b>", "data_source"),
    (domain_stats, "<b>By Domain</b>", "domain"),
    (doc_type_stats, "<b>By Document Type</b>", "doc_type"),
    (access_stats, "<b>By Access Status</b>", "access_status"),
    (
        [
            {
                "empty": "",
                "n_article_repo_pairs": f"<b>{total_article_repo_pairs}</b>",
                "n_authors": f"<b>{total_authors}</b>",
                "n_devs": f"<b>{total_devs}</b>",
            }
        ],
        "<b>Total</b>",
        "empty",
    ),
]:
    # Order by article-repo pairs
    stats_portion = sorted(
        stats_portion, key=lambda x: x["n_article_repo_pairs"], reverse=True
    )

    stats_portion_html = []
    for i, stats_piece in enumerate(stats_portion):
        if i == 0:
            stats_portion_html.append(
                stats_piece_inital_row_template.format(
                    n_rows=len(stats_portion),
                    row_name=stats_name,
                    value_name=stats_piece[value_key],
                    n_article_repo_pairs=stats_piece["n_article_repo_pairs"],
                    n_authors=stats_piece["n_authors"],
                    n_devs=stats_piece["n_devs"],
                )
            )
        else:
            stats_portion_html.append(
                stats_piece_subsequent_row_template.format(
                    value_name=stats_piece[value_key],
                    n_article_repo_pairs=stats_piece["n_article_repo_pairs"],
                    n_authors=stats_piece["n_authors"],
                    n_devs=stats_piece["n_devs"],
                )
            )

    stats_portions_html.append("\n".join(stats_portion_html))

# Concat and wrap in table
stats_table_html = f"""
<table>
  <tr>
    <th><b>Category</b></th>
    <th><b>Subset</b></th>
    <th><b># Article-Repository Pairs</b></th>
    <th><b># Authors</b></th>
    <th><b># Developers</b></th>
  </tr>
  {" ".join(stats_portions_html)}
</table>
""".strip()

doc-repo-links that point at mult- docs or repos: 21512
these are currently ignored / dropped before analysis


- Our final dataset contains the bibliometric and code repository information for hundreds of thousands of scientific-article-source-code-repository pairs from multiple article types and fields.
  - Specifically, our dataset contains `{python} total_article_repo_pairs` article-repository pairs, `{python} total_authors` distinct authors, and `{python} total_devs` distinct developer accounts.

In [3]:
# | label: tbl-rs-graph-overall-counts
# | tbl-cap: "Counts of Article-Repository Pairs, Authors, and Developers broken out by Data Sources, Domains, Document Types, and Access Status."
# | echo: false

IPython.display.HTML(stats_table_html)

Category,Subset,# Article-Repository Pairs,# Authors,# Developers
By Data Source,pwc,117212,245480,125922
By Data Source,plos,6101,30272,8811
By Data Source,joss,2354,7157,11417
By Domain,Physical Sciences,105491,224368,121605
By Domain,Social Sciences,7993,26724,12753
By Domain,Life Sciences,7212,29914,11519
By Domain,Health Sciences,4747,24395,6738
By Document Type,preprint,64991,157838,80687
By Document Type,research article,58321,163619,74153
By Document Type,software article,2354,7157,11417


## Manual Matching of Article Authors and Source Code Repository Contributors

- Before we can train and validate a predictive entity matching model, we must first create a large annotated dataset of article authors and source code repository contributor pairs.
	- describe the task (we have info about an author identity and a developer identity, are they the same identity)
	- add figure for more detail

- We had two annotators each label 3000 pairs of article author and source code repository contributor information.
	- we use the subset of our dataset of joss authors and contributors.
	- we use JOSS as we believe a software article sample will provide us with the highest rate of positive identity matches for training (or a somewhat balanced dataset)
	- we create author-developer-account annotation pairs using data from individual single paper-repository pairs.
	- that is, developers and authors were only paired for annotation if they were paired together meaning that we would never annotate a author-developer-account pair that had developer information with an author from an unrelated paper
	- After each annotator completed labeling all 3000 author-code-contributor pairs, annotators then resolved any differences between their labels.

- Our final annotated dataset used for model training consists of the author names and source code repository contributor information from the 3000 labeled author-code-contributor pairs.
	- basic numbers, number of “positive” and “negative” matches
	- note however that some developer accounts do not have a complete set of information available
	- table of number of developer accounts with each feature and by their match

In [4]:
# Load annotated dataset
annotated_dataset = load_annotated_dev_author_em_dataset()

annotated_dataset

Fetching dev-author-em model data...


Unnamed: 0,github_id,semantic_scholar_id,dev_details,author_details,match
0,JonasGe,48985590,username: JonasGe;\nname: Jonas Geuens;\nemail...,name: J. Geuens;\nrepos: https://github.com/On...,True
1,lindonroberts,2671934,username: lindonroberts;\nname: Lindon Roberts...,name: Á. Bürmen;\nrepos: https://github.com/jf...,False
2,retdop,3278559,username: retdop;\nname: Gabriel Bastard;\nema...,name: David Eargle;\nrepos: https://github.com...,False
3,benjaminpope,2086347474,username: benjaminpope;\nname: Benjamin Pope;\...,name: Jordan Dennis;\nrepos: https://github.co...,False
4,zachmayer,144385402,username: zachmayer;\nname: Zach Deane-Mayer;\...,name: Yuan Tang;\nrepos: https://github.com/te...,False
...,...,...,...,...,...
2994,WilliamZekaiWang,153846264,username: WilliamZekaiWang;\nname: None;\nemai...,name: Mathias S. Renaud;\nrepos: https://githu...,False
2995,avalentino,51907604,username: avalentino;\nname: Antonio Valentino...,name: L. Uieda;\nrepos: https://github.com/fat...,False
2996,prakharb10,71208381,username: prakharb10;\nname: Prakhar Bhatnagar...,name: Matthew Treinish;\nrepos: https://github...,False
2997,jmsexton03,2108239862,username: jmsexton03;\nname: Jean M. Sexton;\n...,name: Weiqun Zhang;\nrepos: https://github.com...,False


# A Predictive Model for Matching Article Authors and Source Code Contributors

- To optimize our predictive model for author-contributor matching, we evaluate a variety of Transformer-based architectures and input features.
	- multiple transformer base models available and there isn’t clear information as to which is “best” for entity matching
	- we have minimal info for authors, just their name, but we have a few features for developer accounts and it isn’t clear which are most important or useful
	- explain potential problems and benefits of certain features

- To ensure that our trained model is as accurate as possible, we trained and evaluated multiple combinations of pre-trained Transformer base models and different developer account information feature sets.
	- explain the feature sets a bit more (username only, username + name, etc.)
	- explain the testing strategy (10% of unique authors and developers are used for testing)

- After testing all base-model and feature set combinations, we find that our best performing model is fine-tuned from: Y and uses Z features.
	- specifics of best model
	- table of model configurations and results
	- minor observations about feature sets that perform poorly

- Finally, we additionally make our best performing model publicly available for reuse.
	- We provide a structured python library for interaction with the model at link
	- Direct access to the model files can be found on huggingface.

# Preliminary Analysis Code Contributor Authorship and Development Dynamics of Research Teams

In [5]:
# Create rolled up counts for each researcher

# Get all documents tied to this researcher
researcher_docs = document_contributors[[
    "researcher_id",
    "document_id",
    "position",
    "is_corresponding",
]].rename(
    columns={"position": "author_position", "is_corresponding": "is_corresponding_author"},
).replace({0: "is not", 1: "is"}).merge(
    documents[[
        "id",
        "is_open_access",
    ]].rename(
        columns={"is_open_access": "is_open_access_document"},
    ).replace({0: "is not", 1: "is"}),
    left_on="document_id",
    right_on="id",
).drop(
    columns=["id"],  # remove "id" from document table
).merge(
    reduced_doc_types.rename(
        columns={"reduced_doc_type": "document_type"},
    ),
    left_on="document_id",
    right_on="document_id",
).merge(
    document_topics[[
        "document_id",
        "topic_id",
    ]],
    left_on="document_id",
    right_on="document_id",
).merge(
    topics[[
        "id",
        "domain_name",
    ]],
    left_on="topic_id",
    right_on="id",
).drop(
    columns=["id", "topic_id"],  # remove "id" from topics table
).merge(
    doc_repo_links[[
        "document_id",
        "dataset_source_id",
    ]],
    left_on="document_id",
    right_on="document_id",
).merge(
    dataset_sources[[
        "id",
        "name",
    ]].rename(columns={"name": "dataset"}),
    left_on="dataset_source_id",
    right_on="id",
).drop(
    columns=["id", "dataset_source_id"],  # remove "id" from dataset_sources table
)

researcher_position_counts = researcher_docs.groupby("researcher_id")["author_position"].value_counts().unstack(fill_value=0)
researcher_position_counts["most_frequent_author_position"] = researcher_position_counts.idxmax(axis=1)
researcher_position_counts = researcher_position_counts.rename(
    columns={
        position: f"n_times_{position}_author"
        for position in researcher_docs["author_position"].unique()
    }
)

researcher_is_corresponding_counts = researcher_docs.groupby("researcher_id")["is_corresponding_author"].value_counts().unstack(fill_value=0)
researcher_is_corresponding_counts["most_frequent_corresponding_author_status"] = researcher_is_corresponding_counts.idxmax(axis=1)
researcher_is_corresponding_counts = researcher_is_corresponding_counts.rename(
    columns={
        is_corresponding: f"n_times_{is_corresponding.replace(' ', '_')}_corresponding_author"
        for is_corresponding in researcher_docs["is_corresponding_author"].unique()
    }
)

document_is_open_access_counts = researcher_docs.groupby("researcher_id")["is_open_access_document"].value_counts().unstack(fill_value=0)
document_is_open_access_counts["most_frequent_document_open_access_status"] = document_is_open_access_counts.idxmax(axis=1)
document_is_open_access_counts = document_is_open_access_counts.rename(
    columns={
        is_open_access: f"n_times_document_{is_open_access.replace(' ', '_')}_open_access"
        for is_open_access in researcher_docs["is_open_access_document"].unique()
    }
)

document_type_counts = researcher_docs.groupby("researcher_id")["document_type"].value_counts().unstack(fill_value=0)
document_type_counts["most_frequent_document_type"] = document_type_counts.idxmax(axis=1)
document_type_counts = document_type_counts.rename(
    columns={
        doc_type: f"n_times_document_{doc_type.replace(' ', '_')}"
        for doc_type in researcher_docs["document_type"].unique()
    }
)

domain_counts = researcher_docs.groupby("researcher_id")["domain_name"].value_counts().unstack(fill_value=0)
domain_counts["most_frequent_domain"] = domain_counts.idxmax(axis=1)
domain_counts = domain_counts.rename(
    columns={
        domain: f"n_times_domain_{domain.replace(' ', '_')}"
        for domain in researcher_docs["domain_name"].unique()
    }
)

dataset_counts = researcher_docs.groupby("researcher_id")["dataset"].value_counts().unstack(fill_value=0)
dataset_counts["most_frequent_dataset"] = dataset_counts.idxmax(axis=1)
dataset_counts = dataset_counts.rename(
    columns={
        dataset: f"n_times_dataset_{dataset.replace(' ', '_')}"
        for dataset in researcher_docs["dataset"].unique()
    }
)

# Merge all the counts
researcher_counts = researchers[[
    "id",
    "works_count",
    "cited_by_count",
    "h_index",
    "i10_index",
    "two_year_mean_citedness",
]].rename(columns={"id": "researcher_id"}).copy()
researcher_counts = researcher_counts.merge(
    researcher_position_counts,
    left_on="researcher_id",
    right_on="researcher_id",
).merge(
    researcher_is_corresponding_counts,
    left_on="researcher_id",
    right_on="researcher_id",
).merge(
    document_is_open_access_counts,
    left_on="researcher_id",
    right_on="researcher_id",
).merge(
    document_type_counts,
    left_on="researcher_id",
    right_on="researcher_id",
).merge(
    domain_counts,
    left_on="researcher_id",
    right_on="researcher_id",
).merge(
    dataset_counts,
    left_on="researcher_id",
    right_on="researcher_id",
)

# Get the count of the number of documents we have for each researcher
researcher_docs_count = researcher_docs["researcher_id"].value_counts().to_dict()
researcher_counts["n_documents"] = researcher_counts["researcher_id"].map(researcher_docs_count)

# Get the number of time a researcher has coded on a document
researcher_coded_counts = researcher_dev_links[[
    "researcher_id",
    "developer_account_id",
]].merge(
    document_contributors[[
        "document_id",
        "researcher_id",
    ]],
    left_on="researcher_id",
    right_on="researcher_id",
)

# Get the number of times a researcher has coded
researcher_coded_counts = researcher_coded_counts.groupby("researcher_id")["document_id"].count().to_frame().reset_index().rename(
    columns={"document_id": "n_times_coded"}
)

# Merge the counts
researcher_counts = researcher_counts.merge(
    researcher_coded_counts,
    left_on="researcher_id",
    right_on="researcher_id",
    how="left",
)
researcher_counts["n_times_coded"] = researcher_counts["n_times_coded"].fillna(0)

# Finally, add "most_frequent_coding_status" which is if "n_times_coded" is greater than or equal to 50% n_documents
researcher_counts["most_frequent_coding_status"] = (
    researcher_counts["n_times_coded"] >= (0.5 * researcher_counts["n_documents"])
)

# Convert specific columns to 1 and 0
for col in [
    "most_frequent_corresponding_author_status",
    "most_frequent_document_open_access_status",
]:
    researcher_counts[col] = researcher_counts[col].apply(lambda x: 1 if x == "is" else 0)

# Get dummies
researcher_counts_with_dummies = pd.get_dummies(
    researcher_counts,
    columns=[
        "most_frequent_author_position",
        "most_frequent_document_type",
        "most_frequent_domain",
        "most_frequent_dataset",
    ],
    drop_first=True,
)

# Construct X and y, add constant, convert to float
X = researcher_counts_with_dummies.drop(
    columns=[
        "researcher_id",
        "cited_by_count",
        "h_index",
        "i10_index",
        "two_year_mean_citedness",
    ],
)
X = sm.add_constant(X)
X = X.astype(float)

In [6]:
y = researcher_counts_with_dummies["cited_by_count"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
model.summary()

0,1,2,3
Dep. Variable:,cited_by_count,No. Observations:,276455.0
Model:,GLM,Df Residuals:,276429.0
Model Family:,Poisson,Df Model:,25.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-976420000.0
Date:,"Mon, 04 Nov 2024",Deviance:,1950900000.0
Time:,15:34:47,Pearson chi2:,6250000000.0
No. Iterations:,33,Pseudo R-squ. (CS):,1.0
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,6.9506,0.001,1.11e+04,0.000,6.949,6.952
works_count,0.0002,1.56e-08,1.52e+04,0.000,0.000,0.000
n_times_first_author,-0.0728,4.54e-05,-1604.539,0.000,-0.073,-0.073
n_times_last_author,0.0093,2.23e-05,418.720,0.000,0.009,0.009
n_times_middle_author,0.0485,2.69e-05,1804.206,0.000,0.048,0.049
n_times_is_corresponding_author,-0.0332,4.48e-05,-741.554,0.000,-0.033,-0.033
n_times_is_not_corresponding_author,0.0183,3.98e-05,458.484,0.000,0.018,0.018
most_frequent_corresponding_author_status,0.0926,0.000,536.283,0.000,0.092,0.093
n_times_document_is_open_access,0.1632,3e-05,5432.020,0.000,0.163,0.163


In [7]:
y = researcher_counts_with_dummies["h_index"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
model.summary()

0,1,2,3
Dep. Variable:,h_index,No. Observations:,276455.0
Model:,GLM,Df Residuals:,276429.0
Model Family:,Poisson,Df Model:,25.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-2438400.0
Date:,"Mon, 04 Nov 2024",Deviance:,3888800.0
Time:,15:34:51,Pearson chi2:,4860000.0
No. Iterations:,21,Pseudo R-squ. (CS):,0.9728
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,2.2004,0.009,240.253,0.000,2.182,2.218
works_count,0.0002,3.49e-07,549.516,0.000,0.000,0.000
n_times_first_author,-0.0186,0.001,-31.348,0.000,-0.020,-0.017
n_times_last_author,-0.0126,0.000,-39.418,0.000,-0.013,-0.012
n_times_middle_author,0.0213,0.000,56.251,0.000,0.021,0.022
n_times_is_corresponding_author,-0.0065,0.001,-8.733,0.000,-0.008,-0.005
n_times_is_not_corresponding_author,-0.0034,0.001,-5.249,0.000,-0.005,-0.002
most_frequent_corresponding_author_status,0.0692,0.003,27.360,0.000,0.064,0.074
n_times_document_is_open_access,0.1439,0.001,277.340,0.000,0.143,0.145


In [8]:
y = researcher_counts_with_dummies["i10_index"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
model.summary()

0,1,2,3
Dep. Variable:,i10_index,No. Observations:,276455.0
Model:,GLM,Df Residuals:,276429.0
Model Family:,Poisson,Df Model:,25.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-8631100.0
Date:,"Mon, 04 Nov 2024",Deviance:,16318000.0
Time:,15:35:12,Pearson chi2:,33400000.0
No. Iterations:,100,Pseudo R-squ. (CS):,1.0
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,2.6078,0.006,409.598,0.000,2.595,2.620
works_count,0.0002,1.47e-07,1673.572,0.000,0.000,0.000
n_times_first_author,-0.0453,0.000,-113.232,0.000,-0.046,-0.044
n_times_last_author,-0.0023,0.000,-11.587,0.000,-0.003,-0.002
n_times_middle_author,0.0268,0.000,106.815,0.000,0.026,0.027
n_times_is_corresponding_author,-0.0220,0.000,-48.810,0.000,-0.023,-0.021
n_times_is_not_corresponding_author,0.0013,0.000,3.077,0.002,0.000,0.002
most_frequent_corresponding_author_status,0.0728,0.002,42.890,0.000,0.069,0.076
n_times_document_is_open_access,0.1581,0.000,582.845,0.000,0.158,0.159


In [9]:
y = researcher_counts_with_dummies["two_year_mean_citedness"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()
model.summary()

0,1,2,3
Dep. Variable:,two_year_mean_citedness,No. Observations:,276455.0
Model:,GLM,Df Residuals:,276429.0
Model Family:,Poisson,Df Model:,25.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-1429500.0
Date:,"Mon, 04 Nov 2024",Deviance:,2261400.0
Time:,15:35:32,Pearson chi2:,22000000.0
No. Iterations:,100,Pseudo R-squ. (CS):,0.2185
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.9291,0.018,51.095,0.000,0.894,0.965
works_count,2.327e-05,1.68e-06,13.889,0.000,2e-05,2.66e-05
n_times_first_author,0.0604,0.001,63.967,0.000,0.059,0.062
n_times_last_author,-0.0448,0.001,-73.431,0.000,-0.046,-0.044
n_times_middle_author,-0.0062,0.001,-8.820,0.000,-0.008,-0.005
n_times_is_corresponding_author,-0.0253,0.002,-13.575,0.000,-0.029,-0.022
n_times_is_not_corresponding_author,0.0348,0.002,21.096,0.000,0.032,0.038
most_frequent_corresponding_author_status,0.0319,0.006,5.786,0.000,0.021,0.043
n_times_document_is_open_access,0.0490,0.001,35.868,0.000,0.046,0.052


In [10]:
# Create rolled up table for each document

# Create rollup tables
team_members_per_doc = document_contributors.groupby("document_id").size().reset_index().rename(columns={0: "n_authors"})

# Create lookup table for document code contributors
document_repository_contribs_lookup = repository_contributors[[
    "repository_id",
    "developer_account_id",
]].merge(
    doc_repo_links[[
        "repository_id",
        "document_id",
    ]],
    left_on="repository_id",
    right_on="repository_id",
).groupby("document_id").size().reset_index().rename(columns={0: "n_code_contributors"})

team_members_per_doc = team_members_per_doc.merge(
    document_repository_contribs_lookup,
    on="document_id",
    how="left",
).fillna(0)

# Create a lookup table for document code contributing authors
doc_code_contrib_authors = repository_contributors[[
    "repository_id",
    "developer_account_id",
]].merge(
    researcher_dev_links[[
        "researcher_id",
        "developer_account_id",
    ]],
    left_on="developer_account_id",
    right_on="developer_account_id",
    how="right",
).merge(
    doc_repo_links[[
        "repository_id",
        "document_id",
    ]],
    left_on="repository_id",
    right_on="repository_id",
    how="left",
)

team_members_per_doc = team_members_per_doc.merge(
    doc_code_contrib_authors,
    on="document_id",
    how="left",
).groupby("document_id").agg({
    "n_authors": "first",
    "n_code_contributors": "first",
    "developer_account_id": "nunique",
}).reset_index().rename(columns={"developer_account_id": "n_code_contributing_authors"})

team_members_per_doc["n_code_contributing_non_authors"] = (
    team_members_per_doc["n_code_contributors"] - team_members_per_doc["n_code_contributing_authors"]
)

# Merge the document metadata
team_members_per_doc = team_members_per_doc.merge(
    documents[[
        "id",
        "publication_date",
        "cited_by_count",
        "cited_by_percentile_year_min",
        "fwci",
        "is_open_access",
    ]],
    left_on="document_id",
    right_on="id",
).drop(
    columns=["id"],  # drop document id
).merge(
    reduced_doc_types[[
        "document_id",
        "reduced_doc_type",
    ]].rename(columns={"reduced_doc_type": "document_type"}),
).merge(
    document_topics[[
        "document_id",
        "topic_id",
    ]],
    left_on="document_id",
    right_on="document_id",
).merge(
    topics[[
        "id",
        "domain_name",
    ]],
    left_on="topic_id",
    right_on="id",
).drop(
    columns=["topic_id", "id"],  # drop topic ids
).merge(
    doc_repo_links[[
        "document_id",
        "dataset_source_id",
    ]],
    left_on="document_id",
    right_on="document_id",
).merge(
    dataset_sources[[
        "id",
        "name",
    ]].rename(columns={"name": "data_source"}),
    left_on="dataset_source_id",
    right_on="id",
).drop(
    columns=["dataset_source_id", "id"],  # drop dataset source id
)

# Compute duration in years since publication from current datetime
team_members_per_doc["years_since_publication"] = (
    datetime.now() - pd.to_datetime(team_members_per_doc["publication_date"])
).dt.days / 365.25

# Remove publication date
team_members_per_doc = team_members_per_doc.drop(columns=["publication_date"])

# Add total team size column (n_authors + n_code_contributing_non_authors)
team_members_per_doc["total_team_size"] = (
    team_members_per_doc["n_authors"] + team_members_per_doc["n_code_contributing_non_authors"]
)

# Add columns for each of the counts as a percentage of the total team size
team_members_per_doc["percent_authors"] = team_members_per_doc["n_authors"] / team_members_per_doc["total_team_size"]
team_members_per_doc["percent_code_contributors"] = team_members_per_doc["n_code_contributors"] / team_members_per_doc["total_team_size"]
team_members_per_doc["percent_code_contributing_authors"] = team_members_per_doc["n_code_contributing_authors"] / team_members_per_doc["total_team_size"]
team_members_per_doc["percent_code_contributing_non_authors"] = team_members_per_doc["n_code_contributing_non_authors"] / team_members_per_doc["total_team_size"]

# Remove any publications published in the last year
team_members_per_doc = team_members_per_doc[
    team_members_per_doc["years_since_publication"] >= 1
]

# Create dummies for document_type, domain, and data_source
team_members_per_doc_with_dummies = pd.get_dummies(
    team_members_per_doc,
    columns=["document_type", "domain_name", "data_source"],
    drop_first=True,
)

team_members_per_doc_with_dummies

Unnamed: 0,document_id,n_authors,n_code_contributors,n_code_contributing_authors,n_code_contributing_non_authors,cited_by_count,cited_by_percentile_year_min,fwci,is_open_access,years_since_publication,...,percent_code_contributors,percent_code_contributing_authors,percent_code_contributing_non_authors,document_type_research article,document_type_software article,domain_name_Life Sciences,domain_name_Physical Sciences,domain_name_Social Sciences,data_source_plos,data_source_pwc
0,1,4,1.0,1,0.0,0,0,0.000,1,3.214237,...,0.250000,0.250000,0.00,True,False,False,False,False,False,True
1,2,1,1.0,1,0.0,1,58,,1,3.841205,...,1.000000,1.000000,0.00,False,False,False,True,False,False,True
2,3,5,1.0,1,0.0,34,95,2.795,1,4.843258,...,0.200000,0.200000,0.00,True,False,False,True,False,False,True
3,7,6,1.0,1,0.0,73,98,4.176,1,5.303217,...,0.166667,0.166667,0.00,True,False,False,True,False,False,True
4,8,8,7.0,5,2.0,1,58,,1,3.841205,...,0.700000,0.500000,0.20,False,False,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125426,142277,4,2.0,2,0.0,2,81,,1,1.842574,...,0.500000,0.500000,0.00,False,False,False,True,False,False,True
125427,142278,4,1.0,1,0.0,13,90,,1,3.841205,...,0.250000,0.250000,0.00,False,False,False,True,False,False,True
125428,142279,4,3.0,3,0.0,1,61,,1,2.841889,...,0.750000,0.750000,0.00,False,False,False,False,False,False,True
125429,142280,2,1.0,1,0.0,0,0,,1,2.841889,...,0.500000,0.500000,0.00,False,False,False,True,False,False,True


In [11]:
# Remove outliers for cited_by_count
data_for_cited_by_count = team_members_per_doc_with_dummies.copy()

data_for_cited_by_count = data_for_cited_by_count[
    data_for_cited_by_count["cited_by_count"] < data_for_cited_by_count["cited_by_count"].quantile(0.95)
]
data_for_cited_by_count = data_for_cited_by_count[
    data_for_cited_by_count["cited_by_count"] > data_for_cited_by_count["cited_by_count"].quantile(0.05)
]

# Create model
X = data_for_cited_by_count.drop(
    columns=[
        "document_id",
        "cited_by_count",
        "fwci",
        "cited_by_percentile_year_min",
        "n_authors",
        "n_code_contributors",
        "n_code_contributing_authors",
        "n_code_contributing_non_authors",
        # "percent_authors",
        # "percent_code_contributors",
        # "percent_code_contributing_authors",
        # "percent_code_contributing_non_authors",
    ],
)

# Add constant
X = sm.add_constant(X)
X = X.astype(float)

# Fit model
y = data_for_cited_by_count["cited_by_count"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit(maxiter=100)
model.summary()

0,1,2,3
Dep. Variable:,cited_by_count,No. Observations:,70311.0
Model:,GLM,Df Residuals:,70299.0
Model Family:,Poisson,Df Model:,11.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-758740.0
Date:,"Mon, 04 Nov 2024",Deviance:,1242700.0
Time:,15:35:35,Pearson chi2:,1660000.0
No. Iterations:,100,Pseudo R-squ. (CS):,0.984
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.6779,0.005,134.137,0.000,0.668,0.688
is_open_access,0.2744,0.008,36.540,0.000,0.260,0.289
years_since_publication,0.2009,0.000,501.850,0.000,0.200,0.202
total_team_size,0.0115,9.17e-05,125.610,0.000,0.011,0.012
percent_authors,0.3434,0.004,97.836,0.000,0.337,0.350
percent_code_contributors,0.0814,0.002,38.087,0.000,0.077,0.086
percent_code_contributing_authors,-0.2531,0.003,-93.560,0.000,-0.258,-0.248
percent_code_contributing_non_authors,0.3345,0.003,99.053,0.000,0.328,0.341
document_type_research article,0.6448,0.002,302.126,0.000,0.641,0.649


In [12]:
# Remove outliers for cited_by_count
data_for_fwci = team_members_per_doc_with_dummies.copy()

data_for_fwci = data_for_fwci[
    data_for_fwci["fwci"] < data_for_fwci["fwci"].quantile(0.95)
]
data_for_fwci = data_for_fwci[
    data_for_fwci["fwci"] > data_for_fwci["fwci"].quantile(0.05)
]

# Create model
X = data_for_fwci.drop(
    columns=[
        "document_id",
        "cited_by_count",
        "fwci",
        "cited_by_percentile_year_min",
        "n_authors",
        "n_code_contributors",
        "n_code_contributing_authors",
        "n_code_contributing_non_authors",
        # "percent_authors",
        # "percent_code_contributors",
        # "percent_code_contributing_authors",
        # "percent_code_contributing_non_authors",
    ],
)

# Add constant
X = sm.add_constant(X)
X = X.astype(float)

# Fit model
y = data_for_fwci["fwci"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit(maxiter=100)
model.summary()

0,1,2,3
Dep. Variable:,fwci,No. Observations:,42237.0
Model:,GLM,Df Residuals:,42225.0
Model Family:,Poisson,Df Model:,11.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-116110.0
Date:,"Mon, 04 Nov 2024",Deviance:,122240.0
Time:,15:35:36,Pearson chi2:,150000.0
No. Iterations:,100,Pseudo R-squ. (CS):,0.1735
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.3544,0.160,2.214,0.027,0.041,0.668
is_open_access,0.1139,0.018,6.450,0.000,0.079,0.149
years_since_publication,0.0815,0.001,58.986,0.000,0.079,0.084
total_team_size,0.0092,0.000,32.794,0.000,0.009,0.010
percent_authors,0.1370,0.096,1.425,0.154,-0.051,0.325
percent_code_contributors,0.0301,0.033,0.924,0.355,-0.034,0.094
percent_code_contributing_authors,-0.1873,0.033,-5.692,0.000,-0.252,-0.123
percent_code_contributing_non_authors,0.2174,0.065,3.366,0.001,0.091,0.344
document_type_research article,0.1222,0.463,0.264,0.792,-0.785,1.029


In [13]:
# Remove outliers for cited_by_count
data_for_cbpym = team_members_per_doc_with_dummies.copy()

# data_for_cbpym = data_for_cbpym[
#     data_for_cbpym["cited_by_percentile_year_min"] < data_for_cbpym["cited_by_percentile_year_min"].quantile(0.97)
# ]
# data_for_cbpym = data_for_cbpym[
#     data_for_cbpym["cited_by_percentile_year_min"] > data_for_cbpym["cited_by_percentile_year_min"].quantile(0.03)
# ]

# Create model
X = data_for_cbpym.drop(
    columns=[
        "document_id",
        "cited_by_count",
        "fwci",
        "cited_by_percentile_year_min",
        "n_authors",
        "n_code_contributors",
        "n_code_contributing_authors",
        "n_code_contributing_non_authors",
        # "percent_authors",
        # "percent_code_contributors",
        # "percent_code_contributing_authors",
        # "percent_code_contributing_non_authors",
    ],
)

# Add constant
X = sm.add_constant(X)
X = X.astype(float)

# Fit model
y = data_for_cbpym["cited_by_percentile_year_min"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit(maxiter=100)
model.summary()

0,1,2,3
Dep. Variable:,cited_by_percentile_year_min,No. Observations:,96551.0
Model:,GLM,Df Residuals:,96539.0
Model Family:,Poisson,Df Model:,11.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-1617000.0
Date:,"Mon, 04 Nov 2024",Deviance:,2761600.0
Time:,15:35:39,Pearson chi2:,1780000.0
No. Iterations:,100,Pseudo R-squ. (CS):,0.9977
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,1.487e+11,1.62e+10,9.170,0.000,1.17e+11,1.8e+11
is_open_access,0.0825,0.003,27.075,0.000,0.077,0.088
years_since_publication,0.0782,0.000,398.946,0.000,0.078,0.079
total_team_size,0.0057,5.08e-05,113.173,0.000,0.006,0.006
percent_authors,-2.031e+11,2.36e+10,-8.616,0.000,-2.49e+11,-1.57e+11
percent_code_contributors,-6.811e+10,7.95e+09,-8.566,0.000,-8.37e+10,-5.25e+10
percent_code_contributing_authors,6.811e+10,7.95e+09,8.566,0.000,5.25e+10,8.37e+10
percent_code_contributing_non_authors,-1.35e+11,1.56e+10,-8.642,0.000,-1.66e+11,-1.04e+11
document_type_research article,0.5364,0.001,618.255,0.000,0.535,0.538


- To enrich our pre-existing dataset, we apply our trained predictive model across pairs of authors and developer accounts.
	- again, these pairs are all combinations of author and developer account within an individual paper
	- specifics, how many unique author-developer account pairs are we able to find
	- table of author-developer account pairs for by data source / by field
	- we next use this enriched dataset to understand software development dynamics within research teams, and characterize the authors who are and who aren’t code contributors.

## Software Development Dynamics Within Research Teams

- We begin by measuring the distributions of different coding and non-coding contributors across all of the article-code-repository pairs within our dataset.
	- explain more, what are the different types of contributions? (coding contributor, coding-with-authorship contributor, non-coding-author, etc.)
	- what are the basics / what do we see across the board? What are the distributions of each of these contributor types
	- compare against analysis built on CRediT statements?

- Next we investigate if these distributions change over time, or, by “research team size”.
	- define research team size, in our case this is the total number of author-developers + non-coding authors + non-credited developers
	- plot the medians of the contributor type distributions over time (by publication year)
	- create subplots of different bins of research team size (i.e. <= 3 members, >3 <= 5, >5 <= 10, >10) and show distributions again.
	- results in summary

- We further investigate how these distributions are affected by article type and research domain.
	- refresher on article type (research articles, software articles, and pre-prints)
	- explain research domains
	- subplots of both
	- results in summary

## Characteristics of Scientific Code Contributors

- Next we investigate the differences between coding and non-coding article authors.
	- specifics, author position in authorship list is a commonly used tool in scientometrics
	- similarly, metrics of “scientific impact” such as h-index, i10 index, and two-year mean citedness are also available to us.
	- plot / table of the distributions between coding and non-coding authors
	- ANOVA / Chi2 tests to see if these differences are significant
	- results in summary

- Just as before, we next investigate if these results are affected by article type and research domain.
	- subplot + stats tests for differences by each article type
	- subplot + stats tests for differences by each domain
	- results in summary

# Discussion