---
title: "Code Contribution and Authorship"
author:
  - name: "Eva Maxfield Brown"
    email: evamxb@uw.edu
    orcid: 0000-0003-2564-0373
    affliation:
      name: University of Washington Information School
      city: Seattle
      state: Washington
      country: USA
  - name: "Nicholas Weber"
    email: nmweber@uw.edu
    orcid: 0000-0002-6008-3763
    affliation:
      name: University of Washington Information School
      city: Seattle
      state: Washington
      country: USA

abstract: |
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur eget porta erat. Morbi consectetur est vel gravida pretium. Suspendisse ut dui eu ante cursus gravida non sed sem. Nullam sapien tellus, commodo id velit id, eleifend volutpat quam. Phasellus mauris velit, dapibus finibus elementum vel, pulvinar non tellus. Nunc pellentesque pretium diam, quis maximus dolor faucibus id. Nunc convallis sodales ante, ut ullamcorper est egestas vitae. Nam sit amet enim ultrices, ultrices elit pulvinar, volutpat risus.

## Basics
bibliography: main.bib

## Number sections (required for section cross ref)
number-sections: true

## Citation Style Language
# See https://github.com/citation-style-language/styles for more options
# We default to PNAS (Proceedings of the National Academy of Sciences)
# csl: support/acm-proceedings.csl

## Specific for target format
format:
  html:
    code-tools: true
    code-fold: true
    code-summary: "Show the code"
    standalone: true
    embed-resources: true
    toc: true
    toc-location: left
    reference-location: margin
    citation-location: margin

  pdf:
    toc: false

---

# Introduction

In [1]:
from pathlib import Path

import IPython.display
from sqlalchemy import text
from sqlmodel import Session, create_engine, select

from rs_graph.db import models as db_models

# Define the base CTE for unique document-repository links
unique_doc_repo_links_sql = """
WITH unique_doc_repo_links AS (
  SELECT drl.id, drl.document_id, drl.repository_id, drl.dataset_source_id
  FROM document_repository_link drl
  INNER JOIN (
    SELECT document_id
    FROM document_repository_link
    GROUP BY document_id
    HAVING COUNT(*) = 1
  ) d ON drl.document_id = d.document_id
  INNER JOIN (
    SELECT repository_id
    FROM document_repository_link
    GROUP BY repository_id
    HAVING COUNT(*) = 1
  ) r ON drl.repository_id = r.repository_id
)
""".strip()


def execute_count_query(
  session: Session,
  query: str,
):
  # Exec and return first
  result = session.exec(text(query.strip()))
  return result.first()[0]


# Get db engine for production database
db_path = Path("publications/qss-code-authors/rs-graph-temp.db").resolve().absolute()
db_conn = create_engine(f"sqlite:///{db_path}")

- Contemporary scientific research has become increasingly dependent on specialized software tools and computational methods.
  - define scientific software (scripts, tools, infrastructure)
  - importance in enabling large scale experiments and acting as a direct log of processing and analysis
  - scientific code sharing is on the rise

- Despite increased reliance on computational methodologies, the developers of scientific software have historically not been given traditional academic credit for their work: authorship on research articles.
  - qualitative research which talks about acknowledgements sections instead of authorship
  - lack of authorship can affect career prospects

- While new credit systems aim to be more inclusive towards more contribution types, they still suffer from two key problems.
	- Contributor Roles Taxonomy (CRediT) allows for specific “software” contribution
  - Others have used CREDIT to understand distribution of labor…
	- they are still based around an author list (it’s hard to change existing human practices, especially biased ones)
	- they aren’t verifiable, they are self-reported

- To address these problems, we create a novel predictive model that enables matching scientific article authors and source code developer accounts.
	- a predictive model is the best choice for entity matching because while authors have ORCIDs, developer accounts do not***
	- further, developer account information may be slightly different from publication information (preferred name / legal name), username’s, etc
	- a fine-tuned transformer model enables us to connect accounts which have similar enough information, hopefully providing us with many more author-code-contributor matches than would be possible on exact name or email address matching alone

- Our predictive model serves two primary purposes: identifying authors who directly contribute to an article’s associated codebase, and, revealing developers who were not included on the article’s authorship list.
	- while predictive, it is grounded in the commit logs of source code repositories, no longer self reported
	- individuals who have been left off can at least for now be identified by their developer account

- Further, by applying our model across a large corpora of paired research articles and source code repositories, we enable objective insight into the software development dynamics of research teams.
	- much like studies of CRediT, we can investigate both how many article authors contribute code
	- similarly, we can investigate who contributes code (by author position and external characteristics
	- again, this is via commit logs and contribution histories, rather than self-reported data

- To summarize, this paper makes the following contributions:
	- we train, evaluate, and make publicly available a predictive model to match article authors with developer accounts together
	- we create a large dataset of linked articles and source code repositories with accompanying bibliometric and repository information, and, further match article authors with repository developers
	- demonstration of the value of our predictive model through preliminary analysis of research team software development dynamics and code contributor characteristics

- The rest of this paper is organized as follows:
	- …

# Data and Methods

## Linking Scientific Articles and Associated Source Code Repositories

- Our trained predictive model and our preliminary analyses are based on datasets of linked bibliographic and source code repository information from multiple journals and publication platforms.
	- Each data source (the journals and publication platforms) either requires or recommends the sharing of code repositories related to a piece of work at the time of publication.
	- In turn, this allows us to mine article information for their either required, or recommended “data or code availability” links.
	- our data sources are:
    - PLOS: research articles
    - JOSS: software articles
    - SoftwareX: software articles
    - Papers with Code / ArXiv: pre-prints

- Using each data source, we process the pairs of scientific articles and associated source code repositories, in order to extract the authorship and source code repository contributor lists as well as other bibliometric and repository information.
	- we use open alex to extract bibliometric information
	- we use the github API to extract repository information

In [2]:
# SQL Statements for Totals
total_doc_repo_pairs_query = f"""
{unique_doc_repo_links_sql}

SELECT COUNT(DISTINCT id)
FROM unique_doc_repo_links
"""

total_authors_query = f"""
{unique_doc_repo_links_sql}

SELECT COUNT(DISTINCT researcher.id)
FROM researcher
JOIN document_contributor ON researcher.id = document_contributor.researcher_id
WHERE document_contributor.document_id IN (
  SELECT document_id
  FROM unique_doc_repo_links
)
"""

total_devs_query = f"""
{unique_doc_repo_links_sql}

SELECT COUNT(DISTINCT developer_account.id)
FROM developer_account
JOIN repository_contributor ON developer_account.id = repository_contributor.developer_account_id
WHERE repository_contributor.repository_id IN (
  SELECT repository_id
  FROM unique_doc_repo_links
)
"""

with Session(db_conn) as session:
  total_article_repo_pairs = execute_count_query(session, total_doc_repo_pairs_query)
  total_authors = execute_count_query(session, total_authors_query)
  total_devs = execute_count_query(session, total_devs_query)

total_article_repo_pairs, total_authors, total_devs

OperationalError: (sqlite3.OperationalError) unable to open database file
(Background on this error at: https://sqlalche.me/e/20/e3q8)

- Our final dataset contains the bibliometric and code repository information for hundreds of thousands of scientific-article-source-code-repository pairs from multiple article types and fields.
	- specifics about the whole dataset (size, number of unique article-repository pairs, number of unique authors, number of unique developer accounts)
	- table of data descriptive statistics broken out by data source and providing things like:
    - Number of unique article-repository pairs
    - Number of unique authors
    - Number of unique contributors
    - Number of unique article-repository pairs in each domain (top 4)
    - Number of unique article-repository pairs by year

In [3]:
# SQL Statements for Dataset Counts
data_source_doc_repo_pairs_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT unique_doc_repo_links.id)
FROM unique_doc_repo_links
WHERE unique_doc_repo_links.dataset_source_id = {source_id}
"""

data_source_authors_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT researcher.id)
FROM researcher
JOIN document_contributor ON researcher.id = document_contributor.researcher_id
JOIN unique_doc_repo_links ON document_contributor.document_id = unique_doc_repo_links.document_id
WHERE unique_doc_repo_links.dataset_source_id = {source_id}
"""

data_source_devs_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT developer_account.id)
FROM developer_account
JOIN repository_contributor ON developer_account.id = repository_contributor.developer_account_id
JOIN unique_doc_repo_links ON repository_contributor.repository_id = unique_doc_repo_links.repository_id
WHERE unique_doc_repo_links.dataset_source_id = {source_id}
"""

# Get counts for each data source
with Session(db_conn) as session:
  data_sources = session.exec(select(db_models.DatasetSource)).all()

  data_source_stats = []
  for data_source in data_sources:
    source_id = data_source.id
    n_article_repo_pairs = execute_count_query(
      session,
      data_source_doc_repo_pairs_query_template.format(udrl_sql=unique_doc_repo_links_sql, source_id=source_id),
    )
    n_authors = execute_count_query(
      session,
      data_source_authors_query_template.format(udrl_sql=unique_doc_repo_links_sql, source_id=source_id),
    )
    n_devs = execute_count_query(
      session,
      data_source_devs_query_template.format(udrl_sql=unique_doc_repo_links_sql, source_id=source_id),
    )

    data_source_stats.append({
      "data_source": data_source.name,
      "n_article_repo_pairs": n_article_repo_pairs,
      "n_authors": n_authors,
      "n_devs": n_devs,
    })

<!-- ```{python}
# SQL Statements for Field Counts
field_doc_repo_pairs_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT unique_doc_repo_links.id)
FROM unique_doc_repo_links
JOIN document_topic ON unique_doc_repo_links.document_id = document_topic.document_id
JOIN topic ON document_topic.topic_id = topic.id
WHERE topic.field_name = '{field_name}'
AND document_topic.id = (
  SELECT dt.id
  FROM document_topic dt
  WHERE dt.document_id = unique_doc_repo_links.document_id
  ORDER BY dt.score DESC
  LIMIT 1
)
"""

field_authors_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT researcher.id)
FROM researcher
JOIN document_contributor ON researcher.id = document_contributor.researcher_id
JOIN unique_doc_repo_links ON document_contributor.document_id = unique_doc_repo_links.document_id
JOIN document_topic ON document_contributor.document_id = document_topic.document_id
JOIN topic ON document_topic.topic_id = topic.id
WHERE topic.field_name = '{field_name}'
AND document_topic.id = (
  SELECT dt.id
  FROM document_topic dt
  WHERE dt.document_id = unique_doc_repo_links.document_id
  ORDER BY dt.score DESC
  LIMIT 1
)
"""

field_devs_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT developer_account.id)
FROM developer_account
JOIN repository_contributor ON developer_account.id = repository_contributor.developer_account_id
JOIN unique_doc_repo_links ON repository_contributor.repository_id = unique_doc_repo_links.repository_id
JOIN document_topic ON unique_doc_repo_links.document_id = document_topic.document_id
JOIN topic ON document_topic.topic_id = topic.id
WHERE topic.field_name = '{field_name}'
AND document_topic.id = (
  SELECT dt.id
  FROM document_topic dt
  WHERE dt.document_id = unique_doc_repo_links.document_id
  ORDER BY dt.score DESC
  LIMIT 1
)
"""

# Get counts for each field
with Session(db_conn) as session:
  fields = [field[0] for field in session.exec(text("SELECT DISTINCT field_name FROM topic"))]

  field_stats = []
  for field in fields:
    n_article_repo_pairs = execute_count_query(
      session,
      field_doc_repo_pairs_query_template.format(udrl_sql=unique_doc_repo_links_sql, field_name=field),
    )
    n_authors = execute_count_query(
      session,
      field_authors_query_template.format(udrl_sql=unique_doc_repo_links_sql, field_name=field),
    )
    n_devs = execute_count_query(
      session,
      field_devs_query_template.format(udrl_sql=unique_doc_repo_links_sql, field_name=field),
    )

    field_stats.append({
      "field": field,
      "n_article_repo_pairs": n_article_repo_pairs,
      "n_authors": n_authors,
      "n_devs": n_devs,
    })

field_counts_df = pd.DataFrame(field_stats).set_index("field").sort_values("n_article_repo_pairs", ascending=False)
field_counts_df
``` -->

In [4]:
# SQL Statements for Domain Counts
domain_doc_repo_pairs_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT unique_doc_repo_links.id)
FROM unique_doc_repo_links
JOIN document_topic ON unique_doc_repo_links.document_id = document_topic.document_id
JOIN topic ON document_topic.topic_id = topic.id
WHERE topic.domain_name = '{domain_name}'
AND document_topic.id = (
  SELECT dt.id
  FROM document_topic dt
  WHERE dt.document_id = unique_doc_repo_links.document_id
  ORDER BY dt.score DESC
  LIMIT 1
)
"""

domain_authors_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT researcher.id)
FROM researcher
JOIN document_contributor ON researcher.id = document_contributor.researcher_id
JOIN unique_doc_repo_links ON document_contributor.document_id = unique_doc_repo_links.document_id
JOIN document_topic ON document_contributor.document_id = document_topic.document_id
JOIN topic ON document_topic.topic_id = topic.id
WHERE topic.domain_name = '{domain_name}'
AND document_topic.id = (
  SELECT dt.id
  FROM document_topic dt
  WHERE dt.document_id = unique_doc_repo_links.document_id
  ORDER BY dt.score DESC
  LIMIT 1
)
"""

domain_devs_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT developer_account.id)
FROM developer_account
JOIN repository_contributor ON developer_account.id = repository_contributor.developer_account_id
JOIN unique_doc_repo_links ON repository_contributor.repository_id = unique_doc_repo_links.repository_id
JOIN document_topic ON unique_doc_repo_links.document_id = document_topic.document_id
JOIN topic ON document_topic.topic_id = topic.id
WHERE topic.domain_name = '{domain_name}'
AND document_topic.id = (
  SELECT dt.id
  FROM document_topic dt
  WHERE dt.document_id = unique_doc_repo_links.document_id
  ORDER BY dt.score DESC
  LIMIT 1
)
"""

# Get counts for each domain
with Session(db_conn) as session:
  domains = [domain[0] for domain in session.exec(text("SELECT DISTINCT domain_name FROM topic"))]

  domain_stats = []
  for domain in domains:
    n_article_repo_pairs = execute_count_query(
      session,
      domain_doc_repo_pairs_query_template.format(udrl_sql=unique_doc_repo_links_sql, domain_name=domain),
    )
    n_authors = execute_count_query(
      session,
      domain_authors_query_template.format(udrl_sql=unique_doc_repo_links_sql, domain_name=domain),
    )
    n_devs = execute_count_query(
      session,
      domain_devs_query_template.format(udrl_sql=unique_doc_repo_links_sql, domain_name=domain),
    )

    domain_stats.append({
      "domain": domain,
      "n_article_repo_pairs": n_article_repo_pairs,
      "n_authors": n_authors,
      "n_devs": n_devs,
    })

In [5]:
# SQL Statements for Doc Type Counts
doc_type_doc_repo_pairs_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT unique_doc_repo_links.id)
FROM unique_doc_repo_links
JOIN document ON unique_doc_repo_links.document_id = document.id
WHERE document.document_type = '{doc_type}'
"""

doc_type_authors_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT researcher.id)
FROM researcher
JOIN document_contributor ON researcher.id = document_contributor.researcher_id
JOIN document ON document_contributor.document_id = document.id
WHERE document.document_type = '{doc_type}'
AND document.id IN (
  SELECT document_id FROM unique_doc_repo_links
)
"""

doc_type_devs_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT developer_account.id)
FROM developer_account
JOIN repository_contributor ON developer_account.id = repository_contributor.developer_account_id
JOIN unique_doc_repo_links ON repository_contributor.repository_id = unique_doc_repo_links.repository_id
JOIN document ON unique_doc_repo_links.document_id = document.id
WHERE document.document_type = '{doc_type}'
"""

# Get counts for each document type
with Session(db_conn) as session:
  doc_types = [doc_type[0] for doc_type in session.exec(text("SELECT DISTINCT document_type FROM document"))]

  doc_type_stats = []
  for doc_type in doc_types:
    n_article_repo_pairs = execute_count_query(
      session,
      doc_type_doc_repo_pairs_query_template.format(udrl_sql=unique_doc_repo_links_sql, doc_type=doc_type),
    )
    authors_query = doc_type_authors_query_template.format(udrl_sql=unique_doc_repo_links_sql, doc_type=doc_type)
    n_authors = execute_count_query(
      session,
      doc_type_authors_query_template.format(udrl_sql=unique_doc_repo_links_sql, doc_type=doc_type),
    )
    n_devs = execute_count_query(
      session,
      doc_type_devs_query_template.format(udrl_sql=unique_doc_repo_links_sql, doc_type=doc_type),
    )

    doc_type_stats.append({
      "doc_type": doc_type,
      "n_article_repo_pairs": n_article_repo_pairs,
      "n_authors": n_authors,
      "n_devs": n_devs,
    })

In [6]:
# SQL Statements for Access Counts
access_doc_repo_pairs_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT unique_doc_repo_links.id)
FROM unique_doc_repo_links
JOIN document ON unique_doc_repo_links.document_id = document.id
WHERE document.is_open_access = '{oa_status_int}'
"""

access_authors_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT researcher.id)
FROM researcher
JOIN document_contributor ON researcher.id = document_contributor.researcher_id
JOIN document ON document_contributor.document_id = document.id
WHERE document.is_open_access = '{oa_status_int}'
AND document.id IN (
  SELECT document_id FROM unique_doc_repo_links
)
"""

access_devs_query_template = """
{udrl_sql}

SELECT COUNT(DISTINCT developer_account.id)
FROM developer_account
JOIN repository_contributor ON developer_account.id = repository_contributor.developer_account_id
JOIN unique_doc_repo_links ON repository_contributor.repository_id = unique_doc_repo_links.repository_id
JOIN document ON unique_doc_repo_links.document_id = document.id
WHERE document.is_open_access = '{oa_status_int}'
"""

# Get counts for each access status
with Session(db_conn) as session:
  access_statuses = [
    {"name": "Open Access", "value": 1},
    {"name": "Closed Access", "value": 0},
  ]

  access_stats = []
  for access_status in access_statuses:
    n_article_repo_pairs = execute_count_query(
      session,
      access_doc_repo_pairs_query_template.format(
        udrl_sql=unique_doc_repo_links_sql,
        oa_status_int=access_status["value"],
    ),
    )
    n_authors = execute_count_query(
      session,
      access_authors_query_template.format(udrl_sql=unique_doc_repo_links_sql, oa_status_int=access_status["value"]),
    )
    n_devs = execute_count_query(
      session,
      access_devs_query_template.format(udrl_sql=unique_doc_repo_links_sql, oa_status_int=access_status["value"]),
    )

    access_stats.append({
      "access_status": access_status["name"],
      "n_article_repo_pairs": n_article_repo_pairs,
      "n_authors": n_authors,
      "n_devs": n_devs,
    })

In [7]:
# Construct multi-row span HTML table
# Columns should be: "n_article_repo_pairs", "n_authors", "n_devs"
# Rows should be: "By Data Source", "By Domain", "By Document Type", "By Access Status", and "Total"

stats_piece_inital_row_template = """
<tr>
    <td rowspan="{n_rows}">{row_name}</td>
    <td>{value_name}</td>
    <td>{n_article_repo_pairs}</td>
    <td>{n_authors}</td>
    <td>{n_devs}</td>
</tr>
""".strip()

stats_piece_subsequent_row_template = """
<tr>
    <td>{value_name}</td>
    <td>{n_article_repo_pairs}</td>
    <td>{n_authors}</td>
    <td>{n_devs}</td>
</tr>
""".strip()

stats_portions_html = []
for stats_portion, stats_name, value_key in [
  (data_source_stats, "<b>By Data Source</b>", "data_source"),
  (domain_stats, "<b>By Domain</b>", "domain"),
  (doc_type_stats, "<b>By Document Type</b>", "doc_type"),
  (access_stats, "<b>By Access Status</b>", "access_status"),
  (
    [
        {
        "empty": "",
        "n_article_repo_pairs": f"<b>{total_article_repo_pairs}</b>",
        "n_authors": f"<b>{total_authors}</b>",
        "n_devs": f"<b>{total_devs}</b>",
        }
    ],
    "<b>Total</b>",
    "empty",
),
]:
    # Order by article-repo pairs
    stats_portion = sorted(stats_portion, key=lambda x: x["n_article_repo_pairs"], reverse=True)

    stats_portion_html = []
    for i, stats_piece in enumerate(stats_portion):
        if i == 0:
            stats_portion_html.append(stats_piece_inital_row_template.format(
                n_rows=len(stats_portion),
                row_name=stats_name,
                value_name=stats_piece[value_key],
                n_article_repo_pairs=stats_piece["n_article_repo_pairs"],
                n_authors=stats_piece["n_authors"],
                n_devs=stats_piece["n_devs"],
            ))
        else:
            stats_portion_html.append(stats_piece_subsequent_row_template.format(
                value_name=stats_piece[value_key],
                n_article_repo_pairs=stats_piece["n_article_repo_pairs"],
                n_authors=stats_piece["n_authors"],
                n_devs=stats_piece["n_devs"],
            ))

    stats_portions_html.append("\n".join(stats_portion_html))

# Concat and wrap in table
stats_table_html = f"""
<table>
    <tr>
        <th><b>Category</b></th>
        <th><b>Subset</b></th>
        <th><b># Article-Repository Pairs</b></th>
        <th><b># Authors</b></th>
        <th><b># Developers</b></th>
    </tr>
    {" ".join(stats_portions_html)}
</table>
""".strip()

IPython.display.HTML(stats_table_html)

Category,Subset,# Article-Repository Pairs,# Authors,# Developers
By Data Source,pwc,52276,139573,70193
By Data Source,plos,6137,30383,8873
By Data Source,joss,2437,7499,11976
By Domain,Physical Sciences,48942,132564,72539
By Domain,Life Sciences,4897,21880,8339
By Domain,Social Sciences,4101,14903,7423
By Domain,Health Sciences,2815,16468,4392
By Document Type,article,30638,106793,51813
By Document Type,preprint,28396,84576,42266
By Document Type,book-chapter,1582,5884,2878


## Manual Matching of Article Authors and Source Code Repository Contributors

In [8]:
print("CREATE TABLE OF DATASET STATS")

CREATE TABLE OF DATASET STATS


- Before we can train and validate a predictive entity matching model, we must first create a large annotated dataset of article authors and source code repository contributor pairs.
	- describe the task (we have info about an author identity and a developer identity, are they the same identity)
	- add figure for more detail

- We had two annotators each label 3000 pairs of article author and source code repository contributor information.
	- we use the subset of our dataset of joss authors and contributors.
	- we use JOSS as we believe a software article sample will provide us with the highest rate of positive identity matches for training (or a somewhat balanced dataset)
	- we create author-developer-account annotation pairs using data from individual single paper-repository pairs.
	- that is, developers and authors were only paired for annotation if they were paired together meaning that we would never annotate a author-developer-account pair that had developer information with an author from an unrelated paper
	- After each annotator completed labeling all 3000 author-code-contributor pairs, annotators then resolved any differences between their labels.

- Our final annotated dataset used for model training consists of the author names and source code repository contributor information from the 3000 labeled author-code-contributor pairs.
	- basic numbers, number of “positive” and “negative” matches
	- note however that some developer accounts do not have a complete set of information available
	- table of number of developer accounts with each feature and by their match

# A Predictive Model for Matching Article Authors and Source Code Contributors

- To optimize our predictive model for author-contributor matching, we evaluate a variety of Transformer-based architectures and input features.
	- multiple transformer base models available and there isn’t clear information as to which is “best” for entity matching
	- we have minimal info for authors, just their name, but we have a few features for developer accounts and it isn’t clear which are most important or useful
	- explain potential problems and benefits of certain features

- To ensure that our trained model is as accurate as possible, we trained and evaluated multiple combinations of pre-trained Transformer base models and different developer account information feature sets.
	- explain the feature sets a bit more (username only, username + name, etc.)
	- explain the testing strategy (10% of unique authors and developers are used for testing)

- After testing all base-model and feature set combinations, we find that our best performing model is fine-tuned from: Y and uses Z features.
	- specifics of best model
	- table of model configurations and results
	- minor observations about feature sets that perform poorly

- Finally, we additionally make our best performing model publicly available for reuse.
	- We provide a structured python library for interaction with the model at link
	- Direct access to the model files can be found on huggingface.

# Preliminary Analysis Code Contributor Authorship and Development Dynamics of Research Teams

- To enrich our pre-existing dataset, we apply our trained predictive model across pairs of authors and developer accounts.
	- again, these pairs are all combinations of author and developer account within an individual paper
	- specifics, how many unique author-developer account pairs are we able to find
	- table of author-developer account pairs for by data source / by field
	- we next use this enriched dataset to understand software development dynamics within research teams, and characterize the authors who are and who aren’t code contributors.

## Software Development Dynamics Within Research Teams

- We begin by measuring the distributions of different coding and non-coding contributors across all of the article-code-repository pairs within our dataset.
	- explain more, what are the different types of contributions? (coding contributor, coding-with-authorship contributor, non-coding-author, etc.)
	- what are the basics / what do we see across the board? What are the distributions of each of these contributor types
	- compare against analysis built on CRediT statements?

- Next we investigate if these distributions change over time, or, by “research team size”.
	- define research team size, in our case this is the total number of author-developers + non-coding authors + non-credited developers
	- plot the medians of the contributor type distributions over time (by publication year)
	- create subplots of different bins of research team size (i.e. <= 3 members, >3 <= 5, >5 <= 10, >10) and show distributions again.
	- results in summary

- We further investigate how these distributions are affected by article type and research domain.
	- refresher on article type (research articles, software articles, and pre-prints)
	- explain research domains
	- subplots of both
	- results in summary

## Characteristics of Scientific Code Contributors

- Next we investigate the differences between coding and non-coding article authors.
	- specifics, author position in authorship list is a commonly used tool in scientometrics
	- similarly, metrics of “scientific impact” such as h-index, i10 index, and two-year mean citedness are also available to us.
	- plot / table of the distributions between coding and non-coding authors
	- ANOVA / Chi2 tests to see if these differences are significant
	- results in summary

- Just as before, we next investigate if these results are affected by article type and research domain.
	- subplot + stats tests for differences by each article type
	- subplot + stats tests for differences by each domain
	- results in summary

# Discussion