# Funding Data Analysis and Pairwise Comparison Generator

This notebook processes OSO funding data to generate pairwise project comparisons for training.

## Overview
The notebook performs the following steps:
1. Fetches funding data from BigQuery
2. Generates pairwise project comparisons within funding rounds
3. Deduplicates and averages comparison weights
4. Exports training data for further use

## Prerequisites
- OSO Production BigQuery dataset access (see docs [here](https://docs.opensource.observer/docs/get-started/bigquery))
- Google Cloud credentials file (oso_gcp_credentials.json)
- Input graph file (unweighted_graph.json)

In [1]:
from google.cloud import bigquery
from collections import defaultdict
from itertools import combinations
import json
import networkx as nx
import numpy as np
import os
import pandas as pd

## Step 1: Data Preparation

First, we'll load our dependency graph data which contains the repository URLs we want to analyze.
The graph is stored in a JSON format and contains node-link data that we'll convert to a NetworkX graph.

In [2]:
# Location of the dependency graph dataset
GRAPH_JSON_PATH = '../graph/unweighted_graph.json'

# Local directories for exporting the competition datasets (private)
PRIVATE_DATA_DIR = '../datasets/competition'
if not os.path.exists(PRIVATE_DATA_DIR):
    os.mkdir(PRIVATE_DATA_DIR)

FUNDING_DATA_CSV_PATH = os.path.join(PRIVATE_DATA_DIR, 'funding-data.csv')
PREAGG_TRAINING_DATA_CSV_PATH = os.path.join(PRIVATE_DATA_DIR, 'training-data-preagg.csv')
TRAINING_DATA_CSV_PATH = os.path.join(PRIVATE_DATA_DIR, 'training-data.csv')

In [3]:
with open(GRAPH_JSON_PATH, 'r') as f:
    graph_data = json.load(f)

G = nx.node_link_graph(graph_data)
repo_urls = [x for x in G.nodes]

In [4]:
print(len(repo_urls))

4303


## Step 2: BigQuery Data Fetch

We'll fetch OSO funding data from BigQuery using the following query structure:
- Aggregates funding by quarter, funder, and project
- Joins with repository data to get primary GitHub URLs
- For funded projects with multiple repos, uses the most popular one (by stars)

Prerequisites:
- Subscribe to the OSO Production dataset on BigQuery (see docs [here](https://docs.opensource.observer/docs/get-started/bigquery))
- Ensure your credentials file is properly configured

In [5]:
def get_funding_query(list_of_urls):
    
    # replace with your path to credentials
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '../oso_gcp_credentials.json'

    # replace with your project name
    client = bigquery.Client(project='opensource-observer')

    url_string = "'" + "','".join(list_of_urls) + "'"
    
    query = f"""
    WITH funding AS (
        SELECT
            date_trunc(time, QUARTER) AS quarter,        
            from_project_name as funder,
            grant_pool_name,
            to_project_name as project,
            to_project_id,
            CAST(sum(amount) AS INT) AS total_funding_usd
        FROM `oso_production.oss_funding_v0`
        GROUP BY 1, 2, 3, 4, 5
    ),
    repos as (
        SELECT
            project_id,
            MAX_BY(artifact_url, star_count) AS project_repo
        FROM `oso_production.repositories_v0`
        WHERE artifact_url IN ({url_string})
        GROUP BY project_id    
    )
    SELECT
        repos.project_repo,
        funding.* EXCEPT (to_project_id),
    FROM funding
    JOIN repos
        ON funding.to_project_id = repos.project_id
    """

    results = client.query(query)
    return results.to_dataframe()

df_funding = get_funding_query(repo_urls)
df_funding.tail(5)

Unnamed: 0,project_repo,quarter,funder,grant_pool_name,project,total_funding_usd
1537,https://github.com/ethereum/go-ethereum,2024-01-01 00:00:00+00:00,optimism,retropgf3,go-ethereum,1739137
1538,https://github.com/marak/colors.js,2021-04-01 00:00:00+00:00,opencollective,contributions,marak,2601
1539,https://github.com/ethers-io/ethers.js,2021-04-01 00:00:00+00:00,gitcoin,GG-09,ethers-io,5650
1540,https://github.com/marak/colors.js,2021-01-01 00:00:00+00:00,opencollective,contributions,marak,5201
1541,https://github.com/ipfs/js-ipfs,2024-01-01 00:00:00+00:00,optimism,retropgf3,ipfs,790515


In [6]:
df_funding.to_csv(FUNDING_DATA_CSV_PATH, index='project_repo')

## Step 3: Pairwise Comparison Generation

This section creates and processes pairwise comparisons between projects based on their funding amounts.

### Process Overview
- Identify all projects that received funding
- Generate all possible pairs of projects
- Calculate relative weights based on funding amounts


### Example
If Project A received 75 and Project B received 25 in a round:
- weight_a = 0.75 (75/100)
- weight_b = 0.25 (25/100)

### Output Format
The final DataFrame contains:
- project_a: GitHub repository URL
- project_b: GitHub repository URL
- weight_a: Average relative funding weight for project_a
- weight_b: Average relative funding weight for project_b
- total_amount_usd: Total funding between the two projects (in USD)
- funder: Name of the funding platform (eg, Gitcoin, Optimism, Open Collective)
- quarter: Year (YYYY) and month (MM) of funding

In [7]:
# Filter funders with more than one unique project per quarter
funding_rounds = (
    df_funding.groupby(['funder', 'quarter'])['project']
    .nunique()
    .loc[lambda x: x > 1]
)

preagg_data = []

# Process each funder and quarter combination
for (funder, quarter), _ in funding_rounds.items():
    # Filter data for the specific funder and quarter
    dff = df_funding.query("funder == @funder and quarter == @quarter")
    
    # Get unique projects and create all combinations
    projects = dff['project'].unique()
    comparisons = combinations(projects, 2)
    
    # Process each project pair
    for project_a, project_b in comparisons:
        
        # Extract repositories and funding amounts
        repo_a = dff.loc[dff['project'] == project_a, 'project_repo'].iloc[0]
        repo_b = dff.loc[dff['project'] == project_b, 'project_repo'].iloc[0]
        amount_a = dff.loc[dff['project'] == project_a, 'total_funding_usd'].sum()
        amount_b = dff.loc[dff['project'] == project_b, 'total_funding_usd'].sum()
        
        # Compute weights and total amount
        amount_total = amount_a + amount_b
        weight_a = amount_a / amount_total
        weight_b = 1 - weight_a
        
        # Append the results to preagg_data
        preagg_data.append({
            'project_a': repo_a,
            'project_b': repo_b,
            'weight_a': weight_a,            
            'weight_b': weight_b,
            'total_amount_usd': amount_total,
            'funder': funder,
            'quarter': quarter.strftime('%Y-%m')
        })

df_preagg = pd.DataFrame(preagg_data)
df_preagg.tail()

Unnamed: 0,project_a,project_b,weight_a,weight_b,total_amount_usd,funder,quarter
25214,https://github.com/chainsafe/lodestar,https://github.com/bluealloy/revm,0.43708,0.56292,398133,optimism,2024-10
25215,https://github.com/chainsafe/lodestar,https://github.com/ethereum/solc-js,0.593528,0.406472,293189,optimism,2024-10
25216,https://github.com/libp2p/go-libp2p,https://github.com/bluealloy/revm,0.744457,0.255543,877024,optimism,2024-10
25217,https://github.com/libp2p/go-libp2p,https://github.com/ethereum/solc-js,0.845647,0.154353,772080,optimism,2024-10
25218,https://github.com/bluealloy/revm,https://github.com/ethereum/solc-js,0.65285,0.34715,343290,optimism,2024-10


In [8]:
df_preagg.to_csv(PREAGG_TRAINING_DATA_CSV_PATH)

## Step 4: Average for final training data

Finally, we take a simple average of the two pairs across all funding rounds where they appeared together.

In [9]:
averager = {}

def add_pair(a, b, weight):
    if a not in averager:
        averager[a] = {}
    if b not in averager[a]:
        averager[a][b] = []
    averager[a][b].append(weight)

for index, row in df_preagg.iterrows():
    a_or_b = row["project_a"] > row["project_b"]
    if a_or_b:
        add_pair(row["project_a"], row["project_b"], row["weight_a"])
    else:
        add_pair(row["project_b"], row["project_a"], row["weight_b"])

In [10]:
deduped = []

for _, (a, inner) in enumerate(averager.items()):
    for _, (b, weights) in enumerate(inner.items()):
        weight_a = float(np.average(weights))
        weight_b = 1-weight_a
        deduped.append([a, b, weight_a, weight_b])

df_deduped = pd.DataFrame(deduped, columns=["project_a", "project_b", "weight_a", "weight_b"])

In [11]:
df_deduped.to_csv(TRAINING_DATA_CSV_PATH)