# Project Velocity Discussion Notebook


Project Velocity is defined in chaoss here: https://chaoss.community/kb/metric-project-velocity/ 

Discussed in the context of CNCF: https://www.cncf.io/blog/2017/06/05/30-highest-velocity-open-source-projects/

In the general case the graph axises are definied as: <br>
- X-Axis: Logarithmic scale for Code Changes
- Y-Axis: Logarithmic scale of Sum of Number of Issues and Number of Review
- Dot-size: Committers (we are using distinct contributors)
- Dots are project

Hover values from CNCF viz: <br>
- Project Name 
- Num of Commits 
- Project Commit Authors
- Size (square root of authors)

This notebook does all of the preprocessing down to the visualization for we can try different strategies before putting it into 8Knot

In [1]:
import psycopg2
import pandas as pd 
import sqlalchemy as salc
import json
import plotly.express as px
import datetime as dt
import plotly
import math
import os

paths = ["../../comm_cage.json", "comm_cage.json", "../../config.json", "../config.json", "config.json", "../../copy_cage-padres.json"]

for path in paths:
    if os.path.exists(path):
        with open(path) as config_file:
            config = json.load(config_file)
        break
else:
    raise FileNotFoundError(f"None of the config files found: {paths}")

In [2]:
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])

dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

CNCF incubating projects to demo these visualizations on

In [3]:
cncf_incubating = [
    'https://github.com/cloud-custodian/cloud-custodian',
    'https://github.com/kubeedge/kubeedge',
    'https://github.com/dragonflyoss/Dragonfly2',
    'https://github.com/cert-manager/cert-manager',
    'https://github.com/falcosecurity/falco',
    'https://github.com/in-toto/in-toto',
    'https://github.com/kyverno/kyverno',
    'https://github.com/notaryproject/notary',
    'https://github.com/keycloak/keycloak',
    'https://github.com/cubeFS/cubefs',
    'https://github.com/longhorn/longhorn',
    'https://github.com/cri-o/cri-o',
    'https://github.com/cilium/cilium',
    'https://github.com/containernetworking/cni',
    'https://github.com/crossplane/crossplane',
    'https://github.com/volcano-sh/volcano',
    'https://github.com/grpc/grpc',
    'https://github.com/projectcontour/contour',
    'https://github.com/emissary-ingress/emissary',
    'https://github.com/istio/istio',
    'https://github.com/cloudevents/spec',
    'https://github.com/nats-io/nats-server',
    'https://github.com/backstage/backstage',
    'https://github.com/buildpacks/pack',
    'https://github.com/kubevela/kubevela',
    'https://github.com/kubevirt/kubevirt',
    'https://github.com/operator-framework/operator-sdk',
    'https://github.com/keptn/keptn',
    'https://github.com/openkruise/kruise',
    'https://github.com/dapr/dapr',
    'https://github.com/kedacore/keda',
    'https://github.com/knative/community',
    'https://github.com/cortexproject/cortex',
    'https://github.com/OpenObservability/OpenMetrics',
    'https://github.com/thanos-io/thanos',
    'https://github.com/open-telemetry/community',
    'https://github.com/chaos-mesh/chaos-mesh',
    'https://github.com/litmuschaos/litmus'
] 

Get repo_ids for augur data access 

In [4]:
url_query = str(cncf_incubating)
url_query = url_query[1:-1]

repo_query = salc.sql.text(f"""
        SET SCHEMA 'augur_data';
        SELECT DISTINCT
            r.repo_id,
            r.repo_name
        FROM
            repo r
        JOIN repo_groups rg 
        ON r.repo_group_id = rg.repo_group_id
        WHERE
            r.repo_git in({url_query})
        """)



engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

with engine.connect() as conn:
    rows = conn.execute(repo_query)
    
# t = engine.execute(repo_query)
results = rows.all()

#t = engine.execute(repo_query)
# results = t.all()
repo_ids = [ row[0] for row in results]
repo_names = [ row[1] for row in results]
print(repo_ids)
print(repo_names)

[150726, 165501, 191141, 191143, 191146, 191148, 191149, 191151, 191152, 191153, 191154, 191155, 191157, 191159, 191161, 191163, 191166, 191170, 191171, 191172, 191173, 191174, 191175, 215010]
['volcano', 'istio', 'cilium', 'keda', 'keycloak', 'community', 'community', 'kruise', 'cortex', 'keptn', 'backstage', 'kyverno', 'kubeedge', 'falco', 'chaos-mesh', 'longhorn', 'litmus', 'grpc', 'spec', 'cloud-custodian', 'kubevela', 'dapr', 'in-toto', 'kubevirt']


Query for contributions with related contributor information. This query gets the following contributor actions: 
- Commits 
- Issues: open, close, comment 
- Pull Requests: open, close, merge, review, comment


In [5]:
repo_statement = str(repo_ids)
repo_statement = repo_statement[1:-1]
print(repo_statement)

contrib_query = salc.sql.text(f"""
                SELECT
                        repo_id as id,
                        repo_name,
                        cntrb_id,
                        created_at,
                        login,
                        action,
                        rank
                    FROM
                        augur_data.explorer_contributor_actions
                    WHERE
                        repo_id in({repo_statement})
                """)
df = pd.read_sql(contrib_query, con=engine)

df = df.reset_index()
df.drop("index", axis=1, inplace=True)

150726, 165501, 191141, 191143, 191146, 191148, 191149, 191151, 191152, 191153, 191154, 191155, 191157, 191159, 191161, 191163, 191166, 191170, 191171, 191172, 191173, 191174, 191175, 215010


In [6]:
df

Unnamed: 0,id,repo_name,cntrb_id,created_at,login,action,rank
0,191170,grpc,01003fcc-8400-0000-0000-000000000000,2020-09-11 16:17:42-05:00,yashykt,pull_request_open,3662
1,165501,istio,01000983-5d00-0000-0000-000000000000,2025-06-02 20:02:53-05:00,howardjohn,pull_request_comment,33
2,165501,istio,0100175f-0e00-0000-0000-000000000000,2025-06-02 20:00:34-05:00,keithmattix,pull_request_comment,17
3,165501,istio,01000983-5d00-0000-0000-000000000000,2025-06-02 19:57:43-05:00,howardjohn,pull_request_comment,34
4,191159,falco,010a01df-6200-0000-0000-000000000000,2025-06-02 19:57:03-05:00,supertylerc,issue_comment,8
...,...,...,...,...,...,...,...
2271177,191146,keycloak,010012ad-1d00-0000-0000-000000000000,2014-05-22 07:33:14-05:00,mposolda,pull_request_open,9453
2271178,191146,keycloak,010012ad-1d00-0000-0000-000000000000,2014-04-29 09:14:06-05:00,mposolda,pull_request_open,9458
2271179,191146,keycloak,010012ad-1d00-0000-0000-000000000000,2014-03-27 15:30:46-05:00,mposolda,pull_request_open,9467
2271180,191146,keycloak,010012ad-1d00-0000-0000-000000000000,2014-03-10 09:34:40-05:00,mposolda,pull_request_open,9474


df_cntrbs holds values for number of unique contributors in each repo

In [7]:
df_cntrbs = pd.DataFrame(df.groupby('repo_name')['cntrb_id'].nunique()).rename(columns={"cntrb_id": "num_unique_contributors"})
df_cntrbs.head()

Unnamed: 0_level_0,num_unique_contributors
repo_name,Unnamed: 1_level_1
backstage,4076
chaos-mesh,788
cilium,4192
cloud-custodian,1618
community,1245


df_actions holds a transformed version of the df to have the actions as columns, repo_name as the index, and the row values as the corresponding counts 

In [8]:
# group actions and repos to get the counts of the actions by repo 
df_actions = pd.DataFrame(df.groupby('repo_name')['action'].value_counts())
df_actions = df_actions.rename(columns={"action": "count"}).reset_index()
# pivot df to reformat the actions to be columns and repo_id to be rows 
df_actions = df_actions.pivot(index="repo_name", columns="action", values="count")

In [9]:
df_actions.head()

action,commit,issue_closed,issue_comment,issue_opened,pull_request_closed,pull_request_comment,pull_request_merged,pull_request_open,pull_request_review_APPROVED,pull_request_review_CHANGES_REQUESTED,pull_request_review_COMMENTED,pull_request_review_DISMISSED
repo_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
backstage,,6891.0,31704.0,6976.0,3708.0,52677.0,18554.0,22119.0,21409.0,1711.0,26860.0,155.0
chaos-mesh,39.0,1305.0,3788.0,1668.0,352.0,14241.0,2394.0,2774.0,4207.0,176.0,4711.0,348.0
cilium,37655.0,2587.0,37719.0,10855.0,1551.0,122710.0,6169.0,28991.0,40384.0,6100.0,43970.0,451.0
cloud-custodian,116.0,2968.0,6568.0,4397.0,1020.0,8034.0,4358.0,5561.0,3476.0,160.0,8191.0,55.0
community,281.0,1953.0,9264.0,2194.0,168.0,7082.0,2108.0,2400.0,2601.0,50.0,3035.0,7.0


df_consolidated combines the actions and unique contributors and then specific columns for visualization use are added on 

In [10]:
df_consolidated = pd.concat([df_actions, df_cntrbs], axis=1).reset_index()

In [11]:
df_consolidated.head()

Unnamed: 0,repo_name,commit,issue_closed,issue_comment,issue_opened,pull_request_closed,pull_request_comment,pull_request_merged,pull_request_open,pull_request_review_APPROVED,pull_request_review_CHANGES_REQUESTED,pull_request_review_COMMENTED,pull_request_review_DISMISSED,num_unique_contributors
0,backstage,,6891.0,31704.0,6976.0,3708.0,52677.0,18554.0,22119.0,21409.0,1711.0,26860.0,155.0,4076
1,chaos-mesh,39.0,1305.0,3788.0,1668.0,352.0,14241.0,2394.0,2774.0,4207.0,176.0,4711.0,348.0,788
2,cilium,37655.0,2587.0,37719.0,10855.0,1551.0,122710.0,6169.0,28991.0,40384.0,6100.0,43970.0,451.0,4192
3,cloud-custodian,116.0,2968.0,6568.0,4397.0,1020.0,8034.0,4358.0,5561.0,3476.0,160.0,8191.0,55.0,1618
4,community,281.0,1953.0,9264.0,2194.0,168.0,7082.0,2108.0,2400.0,2601.0,50.0,3035.0,7.0,1245


## Part where discussion is needed 

The x and y axis are the atributes that we need to decide on. Choass defines them as the following:
- X-Axis: Logarithmic scale for Code Changes
- Y-Axis: Logarithmic scale of Sum of Number of Issues and Number of Reviews

With the data we have lets figure out what the best aplication is for the most informative visualization 

In [12]:
df_consolidated["prs_opened+issues_closed"] = df_consolidated["issue_closed"] + df_consolidated["pull_request_open"]
df_consolidated["log_prs_o+issues_c"] = df_consolidated["prs_opened+issues_closed"].apply(math.log)
df_consolidated["log_num_commits"] = df_consolidated["commit"].apply(math.log)
df_consolidated["log_num_contrib"] = df_consolidated["num_unique_contributors"].apply(math.log)

In [13]:
df_consolidated.head()

Unnamed: 0,repo_name,commit,issue_closed,issue_comment,issue_opened,pull_request_closed,pull_request_comment,pull_request_merged,pull_request_open,pull_request_review_APPROVED,pull_request_review_CHANGES_REQUESTED,pull_request_review_COMMENTED,pull_request_review_DISMISSED,num_unique_contributors,prs_opened+issues_closed,log_prs_o+issues_c,log_num_commits,log_num_contrib
0,backstage,,6891.0,31704.0,6976.0,3708.0,52677.0,18554.0,22119.0,21409.0,1711.0,26860.0,155.0,4076,29010.0,10.275396,,8.312871
1,chaos-mesh,39.0,1305.0,3788.0,1668.0,352.0,14241.0,2394.0,2774.0,4207.0,176.0,4711.0,348.0,788,4079.0,8.313607,3.663562,6.669498
2,cilium,37655.0,2587.0,37719.0,10855.0,1551.0,122710.0,6169.0,28991.0,40384.0,6100.0,43970.0,451.0,4192,31578.0,10.360216,10.536221,8.340933
3,cloud-custodian,116.0,2968.0,6568.0,4397.0,1020.0,8034.0,4358.0,5561.0,3476.0,160.0,8191.0,55.0,1618,8529.0,9.051227,4.75359,7.388946
4,community,281.0,1953.0,9264.0,2194.0,168.0,7082.0,2108.0,2400.0,2601.0,50.0,3035.0,7.0,1245,4353.0,8.378621,5.638355,7.126891


Visualization, more specific styling will be applied on 8Knot when axises are choosen 

In [14]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="log_prs_o+issues_c", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [15]:
fig.show()

## Application of different strategies

Note: axis will be edited when implemented in dash as well as the ability to use a date range slider 

### 1: Y axis Total PR and Issues opened 

In [16]:
df_consolidated["prs_open_plus_issues_open"] = df_consolidated["issue_opened"] + df_consolidated["pull_request_open"]
df_consolidated["log_prs_open+issues_open"] = df_consolidated["prs_open_plus_issues_open"].apply(math.log)

Visualization with logs

In [17]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="log_prs_open+issues_open", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [18]:
fig.show()

Visualization of raw values for y axis

In [19]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="prs_open_plus_issues_open", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [20]:
fig.show()

### 2: PR and issue actions

PR: open, closed, merged <br>
Issues: open , closed

In [21]:
df_consolidated["prs_issues_actions"] = (df_consolidated["issue_opened"] + 
df_consolidated["issue_closed"] + df_consolidated["pull_request_open"] + df_consolidated["pull_request_merged"] + 
df_consolidated["pull_request_closed"])
df_consolidated["log_prs_issues_actions"] = df_consolidated["prs_issues_actions"].apply(math.log)

In [22]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="log_prs_issues_actions", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [23]:
fig.show()

Visualization of raw values for y axis

In [24]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="prs_issues_actions", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [25]:
fig.show()

### 3: PR and Issues actions with weights 
PR: open, closed, merged <br>
Issues: open , closed

Note: these weights to be workshoped if this stategy is choosen, the weights in relation to each other is most important 

In [26]:
i_o_weight = .3
i_c_weight = .4
pr_o_weight = .5
pr_m_weight = .7
pr_c_weight = .2

In [27]:
df_consolidated["prs_issues_actions_weighted"] = (df_consolidated["issue_opened"]*i_o_weight + 
df_consolidated["issue_closed"]*i_c_weight + df_consolidated["pull_request_open"]*pr_o_weight
+ df_consolidated["pull_request_merged"]*pr_m_weight + df_consolidated["pull_request_closed"]*pr_c_weight)
df_consolidated["log_prs_issues_actions_weighted"] = df_consolidated["prs_issues_actions_weighted"].apply(math.log)

Visualization of log values for y axis 

In [28]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="log_prs_issues_actions_weighted", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [29]:
fig.show()

Visualization of raw values

In [30]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="prs_issues_actions_weighted", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [31]:
fig.show()

## Conclusion 

This strategy of developing more complex visualization in a notebook for discussion is something we will continue moving forward. The visualization in 8Knot is strategy #3 with the ability to toggle between the log or raw values, user inputted weights, and date range selector. 