# Funding Data Analysis and Pairwise Comparison Generator

This notebook processes OSO funding data to generate pairwise project comparisons for training.

## Overview
The notebook performs the following steps:
1. Fetches funding data from BigQuery
2. Generates pairwise project comparisons within funding rounds
3. Deduplicates and averages comparison weights
4. Exports training data for further use

## Prerequisites
- OSO Production BigQuery dataset access (see docs [here](https://docs.opensource.observer/docs/get-started/bigquery))
- Google Cloud credentials file (oso_gcp_credentials.json)
- Input graph file (unweighted_graph.json)

In [1]:
from google.cloud import bigquery
from collections import defaultdict
from itertools import combinations
import json
import networkx as nx
import numpy as np
import os
import pandas as pd

## Step 1: Data Preparation

First, we'll load our dependency graph data which contains the repository URLs we want to analyze.
The graph is stored in a JSON format and contains node-link data that we'll convert to a NetworkX graph.

In [2]:
GRAPH_JSON_PATH = '../graph/unweighted_graph.json'
TRAINING_DATA_CSV_PATH = '../graph/training-data.csv'

In [3]:
with open(GRAPH_JSON_PATH, 'r') as f:
    graph_data = json.load(f)

G = nx.node_link_graph(graph_data)
repo_urls = [x for x in G.nodes]

## Step 2: BigQuery Data Fetch

We'll fetch OSO funding data from BigQuery using the following query structure:
- Aggregates funding by quarter, funder, and project
- Joins with repository data to get primary GitHub URLs
- For funded projects with multiple repos, uses the most popular one (by stars)

Prerequisites:
- Subscribe to the OSO Production dataset on BigQuery (see docs [here](https://docs.opensource.observer/docs/get-started/bigquery))
- Ensure your credentials file is properly configured

In [4]:
def get_funding_query(list_of_urls):
    
    # replace with your path to credentials
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '../oso_gcp_credentials.json'

    # replace with your project name
    client = bigquery.Client(project='opensource-observer')

    url_string = "'" + "','".join(list_of_urls) + "'"
    
    query = f"""
    WITH funding AS (
        SELECT
            date_trunc(time, QUARTER) AS quarter,        
            from_project_name as funder,
            grant_pool_name,
            to_project_name as project,
            to_project_id,
            CAST(sum(amount) AS INT) AS total_funding_usd
        FROM `oso_production.oss_funding_v0`
        GROUP BY 1, 2, 3, 4, 5
    ),
    repos as (
        SELECT
            project_id,
            MAX_BY(artifact_url, star_count) AS top_git_repo
        FROM `oso_production.repositories_v0`
        WHERE artifact_url IN ({url_string})
        GROUP BY project_id    
    )
    SELECT
        funding.* EXCEPT (to_project_id),
        repos.top_git_repo
    FROM funding
    JOIN repos
        ON funding.to_project_id = repos.project_id
    """

    results = client.query(query)
    return results.to_dataframe()

df = get_funding_query(repo_urls)
df.tail(5)

Unnamed: 0,quarter,funder,grant_pool_name,project,total_funding_usd,top_git_repo
1537,2021-01-01 00:00:00+00:00,opencollective,contributions,chzyer,26,https://github.com/chzyer/readline
1538,2022-10-01 00:00:00+00:00,opencollective,contributions,getsentry,32052,https://github.com/getsentry/sentry-javascript
1539,2021-07-01 00:00:00+00:00,gitcoin,CGrants - Direct,teku-consensys,3,https://github.com/consensys/teku
1540,2019-04-01 00:00:00+00:00,opencollective,contributions,dcodeio,9003,https://github.com/dcodeio/long.js
1541,2023-04-01 00:00:00+00:00,optimism,retropgf2,ethers-io,313712,https://github.com/ethers-io/ethers.js


## Step 3: Pairwise Comparison Generation and Deduplication

This section creates and processes pairwise comparisons between projects based on their funding amounts.

### Process Overview
1. For each funding round (defined by funder + quarter):
   - Identify all projects that received funding
   - Generate all possible pairs of projects
   - Calculate relative weights based on funding amounts

2. Data Consistency Rules:
   - Project pairs are always stored with alphabetically larger URL first
   - When the same pair appears in multiple rounds, weights are averaged
   - Weights for each pair sum to 1.0

### Example
If Project A received 75 and Project B received 25 in a round:
- weight_a = 0.75 (75/100)
- weight_b = 0.25 (25/100)

### Output Format
The final DataFrame contains:
- project_a: GitHub repository URL (alphabetically larger)
- project_b: GitHub repository URL (alphabetically smaller)
- weight_a: Average relative funding weight for project_a
- weight_b: Average relative funding weight for project_b

In [5]:
averager = {}

def add_pair(a, b, weight):
    if a not in averager:
        averager[a] = {}
    if b not in averager[a]:
        averager[a][b] = []
    averager[a][b].append(weight)

funding_rounds = df.groupby(['funder', 'quarter'])['project'].nunique()
funding_rounds = funding_rounds[funding_rounds > 1]

for funder, quarter in funding_rounds.keys():
    dff = df[(df['funder'] == funder) & (df['quarter'] == quarter)]
    projects = list(dff['project'].unique())
    comparisons = combinations(projects, 2)
    
    for (project_a, project_b) in comparisons:
        amount_a = df[df['project'] == project_a]['total_funding_usd'].sum()
        amount_b = df[df['project'] == project_b]['total_funding_usd'].sum()
        repo_a = df[df['project'] == project_a]['top_git_repo'].unique()[0]
        repo_b = df[df['project'] == project_b]['top_git_repo'].unique()[0]
        amount_total = amount_a + amount_b
        weight = amount_a / amount_total
        
        if repo_a > repo_b:
            add_pair(repo_a, repo_b, weight)
        else:
            add_pair(repo_b, repo_a, 1 - weight)

deduped = [[a, b, float(np.average(weights)), 1-float(np.average(weights))] 
           for a, inner in averager.items() 
           for b, weights in inner.items()]

deduped_df = pd.DataFrame(deduped, columns=["project_a", "project_b", "weight_a", "weight_b"])
deduped_df.tail()

Unnamed: 0,project_a,project_b,weight_a,weight_b
3405,https://github.com/ethereum/solc-js,https://github.com/erigontech/erigon,0.334443,0.665557
3406,https://github.com/ethereum/solc-js,https://github.com/chainsafe/lodestar,0.091744,0.908256
3407,https://github.com/ethereum/solc-js,https://github.com/bluealloy/revm,0.114692,0.885308
3408,https://github.com/erigontech/erigon,https://github.com/chainsafe/lodestar,0.167372,0.832628
3409,https://github.com/erigontech/erigon,https://github.com/bluealloy/revm,0.204968,0.795032


## Step 4: Export Results

Save the deduplicated pairwise comparisons to a CSV file as training data.

In [6]:
deduped_df.to_csv(TRAINING_DATA_CSV_PATH)