---
title: "Code Contribution and Authorship"
author:
  - name: "Eva Maxfield Brown"
    email: evamxb@uw.edu
    orcid: 0000-0003-2564-0373
    affliation:
      name: University of Washington Information School
      city: Seattle
      state: Washington
      country: USA
  - name: "Nicholas Weber"
    email: nmweber@uw.edu
    orcid: 0000-0002-6008-3763
    affliation:
      name: University of Washington Information School
      city: Seattle
      state: Washington
      country: USA

abstract: |
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur eget porta erat. Morbi consectetur est vel gravida pretium. Suspendisse ut dui eu ante cursus gravida non sed sem. Nullam sapien tellus, commodo id velit id, eleifend volutpat quam. Phasellus mauris velit, dapibus finibus elementum vel, pulvinar non tellus. Nunc pellentesque pretium diam, quis maximus dolor faucibus id. Nunc convallis sodales ante, ut ullamcorper est egestas vitae. Nam sit amet enim ultrices, ultrices elit pulvinar, volutpat risus.

## Basics
bibliography: main.bib

## Number sections (required for section cross ref)
number-sections: true

## Citation Style Language
# See https://github.com/citation-style-language/styles for more options
# We default to PNAS (Proceedings of the National Academy of Sciences)
# csl: support/acm-proceedings.csl

## Specific for target format
format:
  html:
    code-tools: true
    code-fold: true
    code-summary: "Show the code"
    standalone: true
    embed-resources: true
    toc: true
    toc-location: left
    reference-location: margin
    citation-location: margin

  pdf:
    toc: false
    execute:
      echo: false
    include-in-header:  
      - text: |
          \usepackage{multirow}

---

# Introduction

In [1]:
from pathlib import Path

import IPython.display
import pandas as pd
import statsmodels.api as sm
from sci_soft_models.dev_author_em.data import load_annotated_dev_author_em_dataset
from sqlalchemy import text
from sqlmodel import create_engine
import swifter  # noqa

from rs_graph.db import models as db_models

# Get db engine for production database
db_path = Path("rs-graph-temp.db").resolve().absolute()
db_conn = create_engine(f"sqlite:///{db_path}")

- Contemporary scientific research has become increasingly dependent on specialized software tools and computational methods.
  - define scientific software (scripts, tools, infrastructure)
  - importance in enabling large scale experiments and acting as a direct log of processing and analysis
  - scientific code sharing is on the rise

- Despite increased reliance on computational methodologies, the developers of scientific software have historically not been given traditional academic credit for their work: authorship on research articles.
  - qualitative research which talks about acknowledgements sections instead of authorship
  - lack of authorship can affect career prospects

- While new credit systems aim to be more inclusive towards more contribution types, they still suffer from two key problems.
	- Contributor Roles Taxonomy (CRediT) allows for specific “software” contribution
  - Others have used CREDIT to understand distribution of labor…
	- they are still based around an author list (it’s hard to change existing human practices, especially biased ones)
	- they aren’t verifiable, they are self-reported

- To address these problems, we create a novel predictive model that enables matching scientific article authors and source code developer accounts.
	- a predictive model is the best choice for entity matching because while authors have ORCIDs, developer accounts do not***
	- further, developer account information may be slightly different from publication information (preferred name / legal name), username’s, etc
	- a fine-tuned transformer model enables us to connect accounts which have similar enough information, hopefully providing us with many more author-code-contributor matches than would be possible on exact name or email address matching alone

- Our predictive model serves two primary purposes: identifying authors who directly contribute to an article’s associated codebase, and, revealing developers who were not included on the article’s authorship list.
	- while predictive, it is grounded in the commit logs of source code repositories, no longer self reported
	- individuals who have been left off can at least for now be identified by their developer account

- Further, by applying our model across a large corpora of paired research articles and source code repositories, we enable objective insight into the software development dynamics of research teams.
	- much like studies of CRediT, we can investigate both how many article authors contribute code
	- similarly, we can investigate who contributes code (by author position and external characteristics
	- again, this is via commit logs and contribution histories, rather than self-reported data

- To summarize, this paper makes the following contributions:
	- we train, evaluate, and make publicly available a predictive model to match article authors with developer accounts together
	- we create a large dataset of linked articles and source code repositories with accompanying bibliometric and repository information, and, further match article authors with repository developers
	- demonstration of the value of our predictive model through preliminary analysis of research team software development dynamics and code contributor characteristics

- The rest of this paper is organized as follows:
	- …

# Data and Methods

## Linking Scientific Articles and Associated Source Code Repositories

- Our trained predictive model and our preliminary analyses are based on datasets of linked bibliographic and source code repository information from multiple journals and publication platforms.
	- Each data source (the journals and publication platforms) either requires or recommends the sharing of code repositories related to a piece of work at the time of publication.
	- In turn, this allows us to mine article information for their either required, or recommended “data or code availability” links.
	- our data sources are:
    - PLOS: research articles
    - JOSS: software articles
    - SoftwareX: software articles
    - Papers with Code / ArXiv: pre-prints

- Using each data source, we process the pairs of scientific articles and associated source code repositories, in order to extract the authorship and source code repository contributor lists as well as other bibliometric and repository information.
	- we use open alex to extract bibliometric information
	- we use the github API to extract repository information

In [2]:
def read_table(table: str) -> pd.DataFrame:
    return pd.read_sql(text(f"SELECT * FROM {table}"), db_conn)


# Read all data from database
doc_repo_links = read_table(db_models.DocumentRepositoryLink.__tablename__)
researchers = read_table(db_models.Researcher.__tablename__)
devs = read_table(db_models.DeveloperAccount.__tablename__)
documents = read_table(db_models.Document.__tablename__)
document_contributors = read_table(db_models.DocumentContributor.__tablename__)
repositories = read_table(db_models.Repository.__tablename__)
repository_contributors = read_table(db_models.RepositoryContributor.__tablename__)
topics = read_table(db_models.Topic.__tablename__)
document_topics = read_table(db_models.DocumentTopic.__tablename__)
dataset_sources = read_table(db_models.DatasetSource.__tablename__)
researcher_dev_links = read_table(
    db_models.ResearcherDeveloperAccountLink.__tablename__
)

# Drop all "updated_datetime" and "created_datetime" columns
for df in [
    doc_repo_links,
    researchers,
    devs,
    documents,
    document_contributors,
    repositories,
    repository_contributors,
    topics,
    document_topics,
    dataset_sources,
    researcher_dev_links,
]:
    df.drop(columns=["updated_datetime", "created_datetime"], inplace=True)

# Specifically drop doc_repo_links "id" column
# It isn't used and will get in the way later when we do a lot of joins
doc_repo_links.drop(columns=["id"], inplace=True)

# Construct reduced doc_repo_links
original_doc_repo_links_len = len(doc_repo_links)
doc_repo_links = doc_repo_links.drop_duplicates(subset=["document_id"], keep=False)
doc_repo_links = doc_repo_links.drop_duplicates(subset=["repository_id"], keep=False)
print(
    "doc-repo-links that point at mult- docs or repos:",
    original_doc_repo_links_len - len(doc_repo_links),
)
print("these are currently ignored / dropped before analysis")

# Reduce other tables to only documents / repositories in the updated doc_repo_links
documents = documents[documents["id"].isin(doc_repo_links["document_id"])]
repositories = repositories[repositories["id"].isin(doc_repo_links["repository_id"])]
document_contributors = document_contributors[
    document_contributors["document_id"].isin(documents["id"])
]
repository_contributors = repository_contributors[
    repository_contributors["repository_id"].isin(repositories["id"])
]
document_topics = document_topics[document_topics["document_id"].isin(documents["id"])]

# Reduce researchers and devs to only those in the
# updated document_contributors and repository_contributors
researchers = researchers[
    researchers["id"].isin(document_contributors["researcher_id"])
]
devs = devs[devs["id"].isin(repository_contributors["developer_account_id"])]
researcher_dev_links = researcher_dev_links[
    (
        researcher_dev_links["researcher_id"].isin(researchers["id"])
        & researcher_dev_links["developer_account_id"].isin(devs["id"])
    )
]

# Sort document topics and keep first
document_topics = document_topics.sort_values("score", ascending=False)
document_topics = document_topics.drop_duplicates(subset=["document_id"], keep="first")

# Create document, document topic merged table
merged_document_topics = pd.merge(
    document_topics, topics, left_on="topic_id", right_on="id"
)

# Create basic merged tables
merged_document_contributor_doc_repo_links = pd.merge(
    document_contributors, doc_repo_links, left_on="document_id", right_on="document_id"
)
merged_repository_contributor_doc_repo_links = pd.merge(
    repository_contributors,
    doc_repo_links,
    left_on="repository_id",
    right_on="repository_id",
)

# Compute stats for data sources
data_source_stats = []
for _, data_source in dataset_sources.iterrows():
    # Get total article-repo pairs
    data_source_stats.append(
        {
            "data_source": data_source["name"],
            "n_article_repo_pairs": len(
                doc_repo_links[doc_repo_links["dataset_source_id"] == data_source["id"]]
            ),
            "n_authors": merged_document_contributor_doc_repo_links.loc[
                merged_document_contributor_doc_repo_links["dataset_source_id"]
                == data_source["id"]
            ]["researcher_id"].nunique(),
            "n_devs": merged_repository_contributor_doc_repo_links.loc[
                merged_repository_contributor_doc_repo_links["dataset_source_id"]
                == data_source["id"]
            ]["developer_account_id"].nunique(),
        }
    )

# Create topic merged tables
merged_doc_repo_links_topics = pd.merge(
    doc_repo_links, document_topics, left_on="document_id", right_on="document_id"
).merge(topics, left_on="topic_id", right_on="id")
merged_doc_repo_links_topics_document_contributors = pd.merge(
    merged_doc_repo_links_topics,
    document_contributors,
    left_on="document_id",
    right_on="document_id",
)
merged_doc_repo_links_topics_repository_contributors = pd.merge(
    merged_doc_repo_links_topics,
    repository_contributors,
    left_on="repository_id",
    right_on="repository_id",
)

# Compute stats for domains
domain_stats = []
for domain in merged_doc_repo_links_topics.domain_name.unique():
    # Get total article-repo pairs
    domain_stats.append(
        {
            "domain": domain,
            "n_article_repo_pairs": len(
                merged_doc_repo_links_topics[
                    merged_doc_repo_links_topics["domain_name"] == domain
                ]
            ),
            "n_authors": merged_doc_repo_links_topics_document_contributors.loc[
                merged_doc_repo_links_topics_document_contributors["domain_name"]
                == domain
            ]["researcher_id"].nunique(),
            "n_devs": merged_doc_repo_links_topics_repository_contributors.loc[
                merged_doc_repo_links_topics_repository_contributors["domain_name"]
                == domain
            ]["developer_account_id"].nunique(),
        }
    )

# Create document merged tables
merged_doc_repo_links_documents = pd.merge(
    doc_repo_links, documents, left_on="document_id", right_on="id"
)
merged_doc_repo_links_documents_document_contributors = pd.merge(
    merged_doc_repo_links_documents,
    document_contributors,
    left_on="document_id",
    right_on="document_id",
)
merged_doc_repo_links_documents_repository_contributors = pd.merge(
    merged_doc_repo_links_documents,
    repository_contributors,
    left_on="repository_id",
    right_on="repository_id",
)

# Compute stats for document types
# This isn't a standard data pull
# In short:
# - pairs from PLOS are "research articles"
# - pairs from JOSS are "software articles"
# - pairs from SoftwareX are "software articles"
# - pairs from Papers with Code / ArXiv are "pre-prints"
#   UNLESS they have been published in a journal
# All of those should be easy to assert / apply a label to with the exception
# of Papers with Code / ArXiv pre-prints that have been published in a journal
# In that case, we need to look at the existing document type in the database
# If the document type is "preprint" use preprint, otherwise, if it's anything else,
# use "research article"

# Create a "reduced_doc_types" dataframe with document_id and "reduced_doc_type"
# columns
reduced_doc_types_rows = []
# We can use the "reduced_doc_types" dataframe to calculate the stats

# Iter over data sources even though we are looking for doc types
for _, data_source in dataset_sources.iterrows():
    # Get total article-repo pairs
    doc_type = None
    if data_source["name"] in ["plos", "joss", "softwarex"]:
        if data_source["name"] == "plos":
            doc_type = "research article"
        else:
            doc_type = "software article"

        # Add all document_ids to reduced_doc_types_rows
        reduced_doc_types_rows.extend(
            [
                {"document_id": doc_id, "reduced_doc_type": doc_type}
                for doc_id in doc_repo_links[
                    (doc_repo_links["dataset_source_id"] == data_source["id"])
                ]["document_id"]
            ]
        )

    # Handle PwC
    else:
        # Get preprint pairs
        preprint_pairs = merged_doc_repo_links_documents[
            (merged_doc_repo_links_documents["dataset_source_id"] == data_source["id"])
            & (merged_doc_repo_links_documents["document_type"] == "preprint")
        ]

        # Add all document_ids to reduced_doc_types_rows
        reduced_doc_types_rows.extend(
            [
                {"document_id": doc_id, "reduced_doc_type": "preprint"}
                for doc_id in preprint_pairs["document_id"]
            ]
        )

        # Get research article pairs
        # This is the same just inverted to != "preprint"
        research_article_pairs = merged_doc_repo_links_documents[
            (merged_doc_repo_links_documents["dataset_source_id"] == data_source["id"])
            & (merged_doc_repo_links_documents["document_type"] != "preprint")
        ]

        # Add all document_ids to reduced_doc_types_rows
        reduced_doc_types_rows.extend(
            [
                {"document_id": doc_id, "reduced_doc_type": "research article"}
                for doc_id in research_article_pairs["document_id"]
            ]
        )

# Create reduced_doc_types dataframe
reduced_doc_types = pd.DataFrame(reduced_doc_types_rows)

# Now compute stats
doc_type_stats = reduced_doc_types.groupby("reduced_doc_type").apply(
    lambda x: {
        "doc_type": x.name,
        "n_article_repo_pairs": len(x),
        "n_authors": merged_doc_repo_links_documents_document_contributors.loc[
            merged_doc_repo_links_documents_document_contributors["document_id"].isin(
                x["document_id"]
            )
        ]["researcher_id"].nunique(),
        "n_devs": merged_doc_repo_links_documents_repository_contributors.loc[
            merged_doc_repo_links_documents_repository_contributors["document_id"].isin(
                x["document_id"]
            )
        ]["developer_account_id"].nunique(),
    },
    include_groups=False,
)

# Compute stats for access status
access_stats = []
for access_status_int, access_status_name in [
    (0, "Closed"),
    (1, "Open"),
]:
    # Get total article-repo pairs
    access_stats.append(
        {
            "access_status": access_status_name,
            "n_article_repo_pairs": len(
                merged_doc_repo_links_documents[
                    merged_doc_repo_links_documents["is_open_access"]
                    == access_status_int
                ]
            ),
            "n_authors": merged_doc_repo_links_documents_document_contributors.loc[
                merged_doc_repo_links_documents_document_contributors["is_open_access"]
                == access_status_int
            ]["researcher_id"].nunique(),
            "n_devs": merged_doc_repo_links_documents_repository_contributors.loc[
                merged_doc_repo_links_documents_repository_contributors[
                    "is_open_access"
                ]
                == access_status_int
            ]["developer_account_id"].nunique(),
        }
    )

# Compute totals
total_article_repo_pairs = len(doc_repo_links)
total_authors = merged_document_contributor_doc_repo_links["researcher_id"].nunique()
total_devs = merged_repository_contributor_doc_repo_links[
    "developer_account_id"
].nunique()

###############################################################################
# Constuct HTML Table

# Construct multi-row span HTML table
# Columns should be: "n_article_repo_pairs", "n_authors", "n_devs"
# Rows should be:
# "By Data Source", "By Domain", "By Document Type", "By Access Status", and "Total"

# HTML templates
stats_piece_inital_row_template = """
<tr>
  <td rowspan="{n_rows}">{row_name}</td>
  <td>{value_name}</td>
  <td>{n_article_repo_pairs}</td>
  <td>{n_authors}</td>
  <td>{n_devs}</td>
</tr>
""".strip()

stats_piece_subsequent_row_template = """
<tr>
  <td>{value_name}</td>
  <td>{n_article_repo_pairs}</td>
  <td>{n_authors}</td>
  <td>{n_devs}</td>
</tr>
""".strip()

# Iter over stats portions (and total)
stats_portions_html = []
for stats_portion, stats_name, value_key in [
    (data_source_stats, "<b>By Data Source</b>", "data_source"),
    (domain_stats, "<b>By Domain</b>", "domain"),
    (doc_type_stats, "<b>By Document Type</b>", "doc_type"),
    (access_stats, "<b>By Access Status</b>", "access_status"),
    (
        [
            {
                "empty": "",
                "n_article_repo_pairs": f"<b>{total_article_repo_pairs}</b>",
                "n_authors": f"<b>{total_authors}</b>",
                "n_devs": f"<b>{total_devs}</b>",
            }
        ],
        "<b>Total</b>",
        "empty",
    ),
]:
    # Order by article-repo pairs
    stats_portion = sorted(
        stats_portion, key=lambda x: x["n_article_repo_pairs"], reverse=True
    )

    stats_portion_html = []
    for i, stats_piece in enumerate(stats_portion):
        if i == 0:
            stats_portion_html.append(
                stats_piece_inital_row_template.format(
                    n_rows=len(stats_portion),
                    row_name=stats_name,
                    value_name=stats_piece[value_key],
                    n_article_repo_pairs=stats_piece["n_article_repo_pairs"],
                    n_authors=stats_piece["n_authors"],
                    n_devs=stats_piece["n_devs"],
                )
            )
        else:
            stats_portion_html.append(
                stats_piece_subsequent_row_template.format(
                    value_name=stats_piece[value_key],
                    n_article_repo_pairs=stats_piece["n_article_repo_pairs"],
                    n_authors=stats_piece["n_authors"],
                    n_devs=stats_piece["n_devs"],
                )
            )

    stats_portions_html.append("\n".join(stats_portion_html))

# Concat and wrap in table
stats_table_html = f"""
<table>
  <tr>
    <th><b>Category</b></th>
    <th><b>Subset</b></th>
    <th><b># Article-Repository Pairs</b></th>
    <th><b># Authors</b></th>
    <th><b># Developers</b></th>
  </tr>
  {" ".join(stats_portions_html)}
</table>
""".strip()

doc-repo-links that point at mult- docs or repos: 21185
these are currently ignored / dropped before analysis


- Our final dataset contains the bibliometric and code repository information for hundreds of thousands of scientific-article-source-code-repository pairs from multiple article types and fields.
  - Specifically, our dataset contains `{python} total_article_repo_pairs` article-repository pairs, `{python} total_authors` distinct authors, and `{python} total_devs` distinct developer accounts.

In [3]:
# | label: tbl-rs-graph-overall-counts
# | tbl-cap: "Counts of Article-Repository Pairs, Authors, and Developers broken out by Data Sources, Domains, Document Types, and Access Status."
# | echo: false

IPython.display.HTML(stats_table_html)

Category,Subset,# Article-Repository Pairs,# Authors,# Developers
By Data Source,pwc,115437,241870,124528
By Data Source,plos,6101,30272,8811
By Data Source,joss,2354,7157,11417
By Domain,Physical Sciences,103996,221199,120370
By Domain,Social Sciences,7829,26132,12546
By Domain,Life Sciences,7155,29693,11417
By Domain,Health Sciences,4699,24132,6628
By Document Type,preprint,63410,154063,79195
By Document Type,research article,58128,163017,73948
By Document Type,software article,2354,7157,11417


## Manual Matching of Article Authors and Source Code Repository Contributors

- Before we can train and validate a predictive entity matching model, we must first create a large annotated dataset of article authors and source code repository contributor pairs.
	- describe the task (we have info about an author identity and a developer identity, are they the same identity)
	- add figure for more detail

- We had two annotators each label 3000 pairs of article author and source code repository contributor information.
	- we use the subset of our dataset of joss authors and contributors.
	- we use JOSS as we believe a software article sample will provide us with the highest rate of positive identity matches for training (or a somewhat balanced dataset)
	- we create author-developer-account annotation pairs using data from individual single paper-repository pairs.
	- that is, developers and authors were only paired for annotation if they were paired together meaning that we would never annotate a author-developer-account pair that had developer information with an author from an unrelated paper
	- After each annotator completed labeling all 3000 author-code-contributor pairs, annotators then resolved any differences between their labels.

- Our final annotated dataset used for model training consists of the author names and source code repository contributor information from the 3000 labeled author-code-contributor pairs.
	- basic numbers, number of “positive” and “negative” matches
	- note however that some developer accounts do not have a complete set of information available
	- table of number of developer accounts with each feature and by their match

In [4]:
# Load annotated dataset
annotated_dataset = load_annotated_dev_author_em_dataset()

annotated_dataset

Fetching dev-author-em model data...


Unnamed: 0,github_id,semantic_scholar_id,dev_details,author_details,match
0,JonasGe,48985590,username: JonasGe;\nname: Jonas Geuens;\nemail...,name: J. Geuens;\nrepos: https://github.com/On...,True
1,lindonroberts,2671934,username: lindonroberts;\nname: Lindon Roberts...,name: Á. Bürmen;\nrepos: https://github.com/jf...,False
2,retdop,3278559,username: retdop;\nname: Gabriel Bastard;\nema...,name: David Eargle;\nrepos: https://github.com...,False
3,benjaminpope,2086347474,username: benjaminpope;\nname: Benjamin Pope;\...,name: Jordan Dennis;\nrepos: https://github.co...,False
4,zachmayer,144385402,username: zachmayer;\nname: Zach Deane-Mayer;\...,name: Yuan Tang;\nrepos: https://github.com/te...,False
...,...,...,...,...,...
2994,WilliamZekaiWang,153846264,username: WilliamZekaiWang;\nname: None;\nemai...,name: Mathias S. Renaud;\nrepos: https://githu...,False
2995,avalentino,51907604,username: avalentino;\nname: Antonio Valentino...,name: L. Uieda;\nrepos: https://github.com/fat...,False
2996,prakharb10,71208381,username: prakharb10;\nname: Prakhar Bhatnagar...,name: Matthew Treinish;\nrepos: https://github...,False
2997,jmsexton03,2108239862,username: jmsexton03;\nname: Jean M. Sexton;\n...,name: Weiqun Zhang;\nrepos: https://github.com...,False


# A Predictive Model for Matching Article Authors and Source Code Contributors

- To optimize our predictive model for author-contributor matching, we evaluate a variety of Transformer-based architectures and input features.
	- multiple transformer base models available and there isn’t clear information as to which is “best” for entity matching
	- we have minimal info for authors, just their name, but we have a few features for developer accounts and it isn’t clear which are most important or useful
	- explain potential problems and benefits of certain features

- To ensure that our trained model is as accurate as possible, we trained and evaluated multiple combinations of pre-trained Transformer base models and different developer account information feature sets.
	- explain the feature sets a bit more (username only, username + name, etc.)
	- explain the testing strategy (10% of unique authors and developers are used for testing)

- After testing all base-model and feature set combinations, we find that our best performing model is fine-tuned from: Y and uses Z features.
	- specifics of best model
	- table of model configurations and results
	- minor observations about feature sets that perform poorly

- Finally, we additionally make our best performing model publicly available for reuse.
	- We provide a structured python library for interaction with the model at link
	- Direct access to the model files can be found on huggingface.

# Preliminary Analysis Code Contributor Authorship and Development Dynamics of Research Teams

In [5]:
# We want to create a dataset of author h-index and i10-index for each author
# We want additional columns of author position in authorship list and whether they are a coding author
# Further, we want to record the data source, the domain, the doc type, and the access status
# There will be multiple rows for each author (if they have multiple pubs in the stored data)
# So really we should iter over the doc-repo pairs and get all the info we need as we go

# Get JOSS and PLOS subset
joss_plos_subset_doc_repo_links = doc_repo_links.loc[
    doc_repo_links["dataset_source_id"].isin(
        dataset_sources[dataset_sources["name"].isin(["joss", "plos"])].id
    )
]


def _process_doc_repo_link(doc_repo_link: pd.Series) -> list[dict]:
    # Get document
    document = documents.loc[documents["id"] == doc_repo_link["document_id"]].iloc[0]

    # Get document contributors
    document_contributors_subset = document_contributors.loc[
        document_contributors["document_id"] == document["id"]
    ]

    # Get repo contributors
    repository_contributors_subset = repository_contributors.loc[
        repository_contributors["repository_id"] == doc_repo_link["repository_id"]
    ]

    # Iter over each document contributor
    this_doc_contributor_rows = []
    for _, document_contributor in document_contributors_subset.iterrows():
        # Get author
        author = researchers.loc[
            researchers["id"] == document_contributor["researcher_id"]
        ].iloc[0]

        # Try and see if this author contributed code in this paper
        # We can do this by checking if the author has a developer account
        # then if that developer account is tied to the repo linked to this document

        # Start false, only switch if we find a match
        author_was_coding_contributor = False
        if len(repository_contributors_subset) > 0:
            # Find any developer account linked to this author
            author_dev_link = researcher_dev_links.loc[
                researcher_dev_links["researcher_id"] == author["id"]
            ]
            if len(author_dev_link) > 0:
                # Iter over each checking for dev on repo
                for _, dev_link in author_dev_link.iterrows():
                    # Get dev
                    dev = devs.loc[devs["id"] == dev_link["developer_account_id"]].iloc[
                        0
                    ]

                    # Check if dev is in repo contributors
                    if (
                        len(
                            repository_contributors_subset.loc[
                                repository_contributors_subset["developer_account_id"]
                                == dev["id"]
                            ]
                        )
                        > 0
                    ):
                        author_was_coding_contributor = True
                        break

        # Get author position
        author_position = document_contributor["position"]

        # Get author is corresponding author
        author_is_corresponding = document_contributor["is_corresponding"]

        # Get author h-index and i10-index
        author_h_index = author["h_index"]
        author_i10_index = author["i10_index"]

        # Get document type
        document_type = reduced_doc_types.loc[
            reduced_doc_types["document_id"] == document["id"]
        ].iloc[0]["reduced_doc_type"]

        # Get access status
        access_status = document["is_open_access"]

        # Get domain
        domain = merged_document_topics.loc[
            merged_document_topics["document_id"] == document["id"]
        ]
        if len(domain) > 0:
            domain = domain.iloc[0]["domain_name"]
        else:
            return None

        # Get data source
        data_source = dataset_sources.loc[
            dataset_sources["id"] == doc_repo_link["dataset_source_id"]
        ].iloc[0]["name"]

        # Append row
        this_doc_contributor_rows.append(
            {
                "document_id": document["id"],
                "author_id": author["id"],
                "author_total_works": author["works_count"],
                "author_h_index": author_h_index,
                "author_i10_index": author_i10_index,
                "author_total_citations": author["cited_by_count"],
                "author_two_year_mean_citedness": author["two_year_mean_citedness"],
                "author_position": author_position,
                "author_was_coding_contributor": author_was_coding_contributor,
                "document_type": document_type,
                "is_open_access": access_status,
                "domain": domain,
                "data_source": data_source,
                "author_is_corresponding": author_is_corresponding,
            }
        )

    return this_doc_contributor_rows


# Process each doc-repo link
doc_contributor_lists_of_rows = joss_plos_subset_doc_repo_links.swifter.apply(
    _process_doc_repo_link, axis=1
)

# Create dataframe
doc_contributor_rows = [
    row
    for sublist in doc_contributor_lists_of_rows
    if sublist is not None
    for row in sublist
]
doc_contributor_df = pd.DataFrame(doc_contributor_rows)

# Create dummies for author_position, document_type, domain, and data_source
doc_contributor_df_with_dummies = pd.get_dummies(
    doc_contributor_df,
    columns=["author_position", "document_type", "domain", "data_source"],
)

# Create model
X = doc_contributor_df_with_dummies.drop(
    columns=[
        "author_id",
        "document_id",
        "author_h_index",
        "author_i10_index",
        "author_total_citations",
        "author_two_year_mean_citedness",
    ],
)

# Add constant
X = sm.add_constant(X)
X = X.astype(int)

# Fit model
y = doc_contributor_df_with_dummies["author_h_index"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit(maxiter=100)
model.summary()

Pandas Apply:   0%|          | 0/8455 [00:00<?, ?it/s]

0,1,2,3
Dep. Variable:,author_h_index,No. Observations:,42631.0
Model:,GLM,Df Residuals:,42620.0
Model Family:,Poisson,Df Model:,10.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-472430.0
Date:,"Tue, 29 Oct 2024",Deviance:,762270.0
Time:,10:52:26,Pearson chi2:,901000.0
No. Iterations:,10,Pseudo R-squ. (CS):,0.9728
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.2641,0.387,0.682,0.495,-0.495,1.023
author_total_works,8.669e-05,3.41e-07,253.991,0.000,8.6e-05,8.74e-05
author_was_coding_contributor,-0.3126,0.003,-97.354,0.000,-0.319,-0.306
is_open_access,2.3792,1.000,2.379,0.017,0.419,4.339
author_is_corresponding,0.0299,0.002,12.330,0.000,0.025,0.035
author_position_first,-0.3784,0.129,-2.932,0.003,-0.631,-0.126
author_position_last,0.5625,0.129,4.359,0.000,0.310,0.815
author_position_middle,0.0801,0.129,0.621,0.535,-0.173,0.333
document_type_research article,0.1411,0.194,0.729,0.466,-0.238,0.520


In [6]:
# Fit model
y = doc_contributor_df_with_dummies["author_i10_index"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit(maxiter=100)
model.summary()

0,1,2,3
Dep. Variable:,author_i10_index,No. Observations:,42631.0
Model:,GLM,Df Residuals:,42620.0
Model Family:,Poisson,Df Model:,10.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-1732100.0
Date:,"Tue, 29 Oct 2024",Deviance:,3276700.0
Time:,10:52:26,Pearson chi2:,5380000.0
No. Iterations:,23,Pseudo R-squ. (CS):,1.0
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-7.2048,4566.740,-0.002,0.999,-8957.851,8943.441
author_total_works,0.0001,9.3e-08,1384.505,0.000,0.000,0.000
author_was_coding_contributor,-0.5537,0.002,-240.867,0.000,-0.558,-0.549
is_open_access,22.4387,1.18e+04,0.002,0.998,-2.31e+04,2.31e+04
author_is_corresponding,0.0369,0.002,22.925,0.000,0.034,0.040
author_position_first,-3.0820,1522.247,-0.002,0.998,-2986.631,2980.467
author_position_last,-1.7769,1522.247,-0.001,0.999,-2985.326,2981.772
author_position_middle,-2.3459,1522.247,-0.002,0.999,-2985.895,2981.203
document_type_research article,-3.5885,2283.370,-0.002,0.999,-4478.912,4471.735


In [7]:
# Fit model
y = doc_contributor_df_with_dummies["author_total_citations"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit(maxiter=100)
model.summary()

0,1,2,3
Dep. Variable:,author_total_citations,No. Observations:,42631.0
Model:,GLM,Df Residuals:,42620.0
Model Family:,Poisson,Df Model:,10.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-281320000.0
Date:,"Tue, 29 Oct 2024",Deviance:,562290000.0
Time:,10:52:27,Pearson chi2:,1470000000.0
No. Iterations:,12,Pseudo R-squ. (CS):,1.0
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0.7060,0.274,2.579,0.010,0.170,1.242
author_total_works,0.0001,1.09e-08,1.09e+04,0.000,0.000,0.000
author_was_coding_contributor,-0.4847,0.000,-2229.151,0.000,-0.485,-0.484
is_open_access,6.6829,0.707,9.451,0.000,5.297,8.069
author_is_corresponding,-0.0052,0.000,-34.339,0.000,-0.006,-0.005
author_position_first,-0.5688,0.091,-6.234,0.000,-0.748,-0.390
author_position_last,0.9501,0.091,10.413,0.000,0.771,1.129
author_position_middle,0.3247,0.091,3.559,0.000,0.146,0.504
document_type_research article,0.3583,0.137,2.618,0.009,0.090,0.627


In [8]:
# Fit model
y = doc_contributor_df_with_dummies["author_two_year_mean_citedness"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit(maxiter=100)
model.summary()

0,1,2,3
Dep. Variable:,author_two_year_mean_citedness,No. Observations:,42631.0
Model:,GLM,Df Residuals:,42620.0
Model Family:,Poisson,Df Model:,10.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-221900.0
Date:,"Tue, 29 Oct 2024",Deviance:,331780.0
Time:,10:52:27,Pearson chi2:,1380000.0
No. Iterations:,21,Pseudo R-squ. (CS):,0.08947
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-7.8084,5530.332,-0.001,0.999,-1.08e+04,1.08e+04
author_total_works,2.783e-05,2.15e-06,12.960,0.000,2.36e-05,3.2e-05
author_was_coding_contributor,0.0682,0.006,10.713,0.000,0.056,0.081
is_open_access,21.6228,1.43e+04,0.002,0.999,-2.8e+04,2.8e+04
author_is_corresponding,0.0062,0.005,1.190,0.234,-0.004,0.016
author_position_first,-2.6871,1843.444,-0.001,0.999,-3615.771,3610.397
author_position_last,-2.6251,1843.444,-0.001,0.999,-3615.709,3610.459
author_position_middle,-2.4962,1843.444,-0.001,0.999,-3615.580,3610.588
document_type_research article,-3.8990,2765.166,-0.001,0.999,-5423.525,5415.727


In [9]:
# Create a model based on papers and paper outcome
# That is, we want to predict paper metrics
# (cited_by_count, fwci, cited_by_percentile_midpoint)
# based on the data source, domain, document type, and access status
# and, number of authors (as percent of total team size),
# number of developers (as percent of total team size),
# and number of author-developers (as percent of total team size),
# total team size

from datetime import datetime

# total team size is the sum of number of authors and the number of non-author developers

def _process_doc_repo_link(doc_repo_link: pd.Series) -> dict:
    # Get document
    document = documents.loc[documents["id"] == doc_repo_link["document_id"]].iloc[0]

    # Get document contributors
    document_contributors_subset = document_contributors.loc[
        document_contributors["document_id"] == document["id"]
    ]

    # Get repo contributors
    repository_contributors_subset = repository_contributors.loc[
        repository_contributors["repository_id"] == doc_repo_link["repository_id"]
    ]

    # Get links between researchers and their developer accounts
    researcher_dev_links_subset = researcher_dev_links.loc[
        (
            researcher_dev_links["researcher_id"].isin(
                document_contributors_subset["researcher_id"]
            )
        )
    ]

    # Find non-author developers
    non_author_devs_subset = repository_contributors_subset.loc[
        ~repository_contributors_subset["developer_account_id"].isin(
            researcher_dev_links_subset["developer_account_id"]
        )
    ]

    # Find author-dev subset
    author_devs_subset = repository_contributors_subset.loc[
        repository_contributors_subset["developer_account_id"].isin(
            researcher_dev_links_subset["developer_account_id"]
        )
    ]

    # Get total team size
    total_team_size = len(document_contributors_subset) + len(non_author_devs_subset)

    # Get number of authors
    n_authors = len(document_contributors_subset)

    # Get number of developers
    n_devs = len(repository_contributors_subset)

    # Get number of non-author developers
    n_non_author_devs = len(non_author_devs_subset)

    # Get number of author-developers
    n_author_devs = len(author_devs_subset)

    # Get document type
    document_type = reduced_doc_types.loc[
        reduced_doc_types["document_id"] == document["id"]
    ].iloc[0]["reduced_doc_type"]

    # Get access status
    access_status = document["is_open_access"]

    # Get domain
    domain = merged_document_topics.loc[
        merged_document_topics["document_id"] == document["id"]
    ]
    if len(domain) > 0:
        domain = domain.iloc[0]["domain_name"]
    else:
        return None

    # Get data source
    data_source = dataset_sources.loc[
        dataset_sources["id"] == doc_repo_link["dataset_source_id"]
    ].iloc[0]["name"]

    # Get duration since publication
    duration_since_publication = (
        datetime.now() - datetime.fromisoformat(document["publication_date"])
    ).days / 365

    # Return all metrics
    return {
        "document_id": document["id"],
        # "authors_pct_of_team_size": n_authors / total_team_size,
        # "devs_pct_of_team_size": n_devs / total_team_size,
        # "non_author_devs_pct_of_team_size": n_non_author_devs / total_team_size,
        # "author_devs_pct_of_team_size": n_author_devs / total_team_size,
        "n_authors": n_authors,
        "n_devs": n_devs,
        "n_non_author_devs": n_non_author_devs,
        "n_author_devs": n_author_devs,
        "total_team_size": total_team_size,
        "document_type": document_type,
        "is_open_access": access_status,
        "domain": domain,
        "data_source": data_source,
        "duration_since_publication_years": duration_since_publication,
        "total_citations": document["cited_by_count"],
        "fwci": document["fwci"],
        # "cited_by_percentile_year_midpoint": (
        #     document["cited_by_percentile_year_min"] + document["cited_by_percentile_year_max"]
        # ) / 2,
    }

# Apply on JOSS PLOS subset
doc_repo_link_metrics_rows = joss_plos_subset_doc_repo_links.swifter.apply(
    _process_doc_repo_link, axis=1
)

# Create dataframe
doc_repo_link_metrics_rows = [
    row
    for row in doc_repo_link_metrics_rows
    if row is not None
]
doc_repo_link_metrics_df = pd.DataFrame([row for row in doc_repo_link_metrics_rows]).dropna()
doc_repo_link_metrics_df

Pandas Apply:   0%|          | 0/8455 [00:00<?, ?it/s]

Unnamed: 0,document_id,n_authors,n_devs,n_non_author_devs,n_author_devs,total_team_size,document_type,is_open_access,domain,data_source,duration_since_publication_years,total_citations,fwci
102,5571,5,5,1,4,6,software article,1,Physical Sciences,joss,0.268493,1,0.000
107,5576,3,6,3,3,6,software article,1,Physical Sciences,joss,0.268493,0,0.000
109,5578,5,2,0,2,5,software article,1,Physical Sciences,joss,0.260274,1,0.000
110,5579,1,1,0,1,1,software article,1,Physical Sciences,joss,0.263014,0,0.000
113,5583,4,5,3,2,7,software article,1,Social Sciences,joss,0.268493,0,0.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8439,45641,3,1,0,1,3,research article,1,Life Sciences,plos,1.547945,1,1.092
8440,45642,4,1,0,1,4,research article,1,Physical Sciences,plos,2.295890,8,1.089
8441,45643,11,1,0,1,11,research article,1,Life Sciences,plos,0.687671,0,0.000
8442,45644,6,1,0,1,6,research article,1,Life Sciences,plos,4.172603,35,4.733


In [10]:
# Create dataframe
doc_repo_link_metrics_df = pd.DataFrame([row for row in doc_repo_link_metrics_rows if row is not None]).dropna()
doc_repo_link_metrics_df
doc_repo_link_metrics_df = doc_repo_link_metrics_df.loc[doc_repo_link_metrics_df.total_citations > 1]
doc_repo_link_metrics_df = doc_repo_link_metrics_df.loc[doc_repo_link_metrics_df.total_team_size > 1]
doc_repo_link_metrics_df

Unnamed: 0,document_id,n_authors,n_devs,n_non_author_devs,n_author_devs,total_team_size,document_type,is_open_access,domain,data_source,duration_since_publication_years,total_citations,fwci
171,5647,27,30,13,17,40,software article,1,Health Sciences,joss,0.323288,5,3.247
179,5655,9,5,1,4,10,software article,1,Physical Sciences,joss,0.391781,3,2.176
199,5676,4,5,2,3,6,software article,1,Social Sciences,joss,0.424658,2,5.687
220,5697,5,4,2,2,7,software article,1,Physical Sciences,joss,0.432877,4,0.000
227,5706,4,6,3,3,7,software article,1,Physical Sciences,joss,0.430137,5,2.834
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8436,45638,7,1,0,1,7,research article,1,Physical Sciences,plos,1.800000,7,2.072
8438,45640,2,1,0,1,2,research article,1,Life Sciences,plos,1.191781,2,0.768
8440,45642,4,1,0,1,4,research article,1,Physical Sciences,plos,2.295890,8,1.089
8442,45644,6,1,0,1,6,research article,1,Life Sciences,plos,4.172603,35,4.733


In [12]:
# Create dummies for document_type, domain, and data_source
doc_repo_link_metrics_df_with_dummies = pd.get_dummies(
    doc_repo_link_metrics_df,
    columns=["document_type", "domain", "data_source"],
)

# Create model
X = doc_repo_link_metrics_df_with_dummies.drop(
    columns=[
        "document_id",
        "total_citations",
        "fwci",
        # "cited_by_percentile_year_midpoint",
    ],
)

# Add constant
X = sm.add_constant(X)
X = X.astype(float)

# Fit model
y = doc_repo_link_metrics_df_with_dummies["total_citations"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit(maxiter=100)
model.summary()

0,1,2,3
Dep. Variable:,total_citations,No. Observations:,5902.0
Model:,GLM,Df Residuals:,5893.0
Model Family:,Poisson,Df Model:,8.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-209680.0
Date:,"Tue, 29 Oct 2024",Deviance:,393800.0
Time:,10:53:04,Pearson chi2:,1260000.0
No. Iterations:,100,Pseudo R-squ. (CS):,1.0
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
n_authors,-0.0027,0.000,-17.770,0.000,-0.003,-0.002
n_devs,0.0465,0.000,128.453,0.000,0.046,0.047
n_non_author_devs,0.0219,0.000,80.775,0.000,0.021,0.022
n_author_devs,0.0246,0.001,41.929,0.000,0.023,0.026
total_team_size,0.0192,0.000,104.435,0.000,0.019,0.020
is_open_access,0.8180,0.003,279.805,0.000,0.812,0.824
duration_since_publication_years,0.2385,0.001,241.265,0.000,0.237,0.240
document_type_research article,0.4458,0.002,238.279,0.000,0.442,0.450
document_type_software article,0.3722,0.002,150.016,0.000,0.367,0.377


In [13]:
# Fit model
y = doc_repo_link_metrics_df_with_dummies["fwci"]
model = sm.GLM(y, X, family=sm.families.Poisson()).fit(maxiter=100)
model.summary()

0,1,2,3
Dep. Variable:,fwci,No. Observations:,5902.0
Model:,GLM,Df Residuals:,5893.0
Model Family:,Poisson,Df Model:,8.0
Link Function:,Log,Scale:,1.0
Method:,IRLS,Log-Likelihood:,-29144.0
Date:,"Tue, 29 Oct 2024",Deviance:,43867.0
Time:,10:53:05,Pearson chi2:,129000.0
No. Iterations:,100,Pseudo R-squ. (CS):,0.9621
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
n_authors,0.0007,0.000,1.613,0.107,-0.000,0.001
n_devs,0.0547,0.001,59.256,0.000,0.053,0.057
n_non_author_devs,0.0153,0.001,21.059,0.000,0.014,0.017
n_author_devs,0.0394,0.001,26.477,0.000,0.036,0.042
total_team_size,0.0160,0.000,33.228,0.000,0.015,0.017
is_open_access,0.2390,0.008,30.747,0.000,0.224,0.254
duration_since_publication_years,0.0704,0.003,23.749,0.000,0.065,0.076
document_type_research article,0.2011,0.005,41.705,0.000,0.192,0.211
document_type_software article,0.0380,0.007,5.194,0.000,0.024,0.052


- To enrich our pre-existing dataset, we apply our trained predictive model across pairs of authors and developer accounts.
	- again, these pairs are all combinations of author and developer account within an individual paper
	- specifics, how many unique author-developer account pairs are we able to find
	- table of author-developer account pairs for by data source / by field
	- we next use this enriched dataset to understand software development dynamics within research teams, and characterize the authors who are and who aren’t code contributors.

## Software Development Dynamics Within Research Teams

- We begin by measuring the distributions of different coding and non-coding contributors across all of the article-code-repository pairs within our dataset.
	- explain more, what are the different types of contributions? (coding contributor, coding-with-authorship contributor, non-coding-author, etc.)
	- what are the basics / what do we see across the board? What are the distributions of each of these contributor types
	- compare against analysis built on CRediT statements?

- Next we investigate if these distributions change over time, or, by “research team size”.
	- define research team size, in our case this is the total number of author-developers + non-coding authors + non-credited developers
	- plot the medians of the contributor type distributions over time (by publication year)
	- create subplots of different bins of research team size (i.e. <= 3 members, >3 <= 5, >5 <= 10, >10) and show distributions again.
	- results in summary

- We further investigate how these distributions are affected by article type and research domain.
	- refresher on article type (research articles, software articles, and pre-prints)
	- explain research domains
	- subplots of both
	- results in summary

## Characteristics of Scientific Code Contributors

- Next we investigate the differences between coding and non-coding article authors.
	- specifics, author position in authorship list is a commonly used tool in scientometrics
	- similarly, metrics of “scientific impact” such as h-index, i10 index, and two-year mean citedness are also available to us.
	- plot / table of the distributions between coding and non-coding authors
	- ANOVA / Chi2 tests to see if these differences are significant
	- results in summary

- Just as before, we next investigate if these results are affected by article type and research domain.
	- subplot + stats tests for differences by each article type
	- subplot + stats tests for differences by each domain
	- results in summary

# Discussion