# Project Velocity Discussion Notebook


Project Velocity is defined in chaoss here: https://chaoss.community/kb/metric-project-velocity/ 

Discussed in the context of CNCF: https://www.cncf.io/blog/2017/06/05/30-highest-velocity-open-source-projects/

In the general case the graph axises are definied as: <br>
- X-Axis: Logarithmic scale for Code Changes
- Y-Axis: Logarithmic scale of Sum of Number of Issues and Number of Review
- Dot-size: Committers (we are using distinct contributors)
- Dots are project

Hover values from CNCF viz: <br>
- Project Name 
- Num of Commits 
- Project Commit Authors
- Size (square root of authors)

This notebook does all of the preprocessing down to the visualization for we can try different strategies before putting it into 8Knot

In [None]:
import psycopg2
import pandas as pd 
import sqlalchemy as salc
import json
import plotly.express as px
import datetime as dt
import plotly
import math
import os

paths = ["../../comm_cage.json", "comm_cage.json", "../../config.json", "../config.json", "config.json", "../../copy_cage-padres.json"]

for path in paths:
    if os.path.exists(path):
        with open(path) as config_file:
            config = json.load(config_file)
        break
else:
    raise FileNotFoundError(f"None of the config files found: {paths}")

In [None]:
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])

dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

CNCF incubating projects to demo these visualizations on

In [None]:
cncf_incubating = [
    'https://github.com/cloud-custodian/cloud-custodian',
    'https://github.com/kubeedge/kubeedge',
    'https://github.com/dragonflyoss/Dragonfly2',
    'https://github.com/cert-manager/cert-manager',
    'https://github.com/falcosecurity/falco',
    'https://github.com/in-toto/in-toto',
    'https://github.com/kyverno/kyverno',
    'https://github.com/notaryproject/notary',
    'https://github.com/keycloak/keycloak',
    'https://github.com/cubeFS/cubefs',
    'https://github.com/longhorn/longhorn',
    'https://github.com/cri-o/cri-o',
    'https://github.com/cilium/cilium',
    'https://github.com/containernetworking/cni',
    'https://github.com/crossplane/crossplane',
    'https://github.com/volcano-sh/volcano',
    'https://github.com/grpc/grpc',
    'https://github.com/projectcontour/contour',
    'https://github.com/emissary-ingress/emissary',
    'https://github.com/istio/istio',
    'https://github.com/cloudevents/spec',
    'https://github.com/nats-io/nats-server',
    'https://github.com/backstage/backstage',
    'https://github.com/buildpacks/pack',
    'https://github.com/kubevela/kubevela',
    'https://github.com/kubevirt/kubevirt',
    'https://github.com/operator-framework/operator-sdk',
    'https://github.com/keptn/keptn',
    'https://github.com/openkruise/kruise',
    'https://github.com/dapr/dapr',
    'https://github.com/kedacore/keda',
    'https://github.com/knative/community',
    'https://github.com/cortexproject/cortex',
    'https://github.com/OpenObservability/OpenMetrics',
    'https://github.com/thanos-io/thanos',
    'https://github.com/open-telemetry/community',
    'https://github.com/chaos-mesh/chaos-mesh',
    'https://github.com/litmuschaos/litmus'
] 

Get repo_ids for augur data access 

In [None]:
url_query = str(cncf_incubating)
url_query = url_query[1:-1]

repo_query = salc.sql.text(f"""
        SET SCHEMA 'augur_data';
        SELECT DISTINCT
            r.repo_id,
            r.repo_name
        FROM
            repo r
        JOIN repo_groups rg 
        ON r.repo_group_id = rg.repo_group_id
        WHERE
            r.repo_git in({url_query})
        """)



engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

with engine.connect() as conn:
    rows = conn.execute(repo_query)
    
# t = engine.execute(repo_query)
results = rows.all()

#t = engine.execute(repo_query)
# results = t.all()
repo_ids = [ row[0] for row in results]
repo_names = [ row[1] for row in results]
print(repo_ids)
print(repo_names)

Query for contributions with related contributor information. This query gets the following contributor actions: 
- Commits 
- Issues: open, close, comment 
- Pull Requests: open, close, merge, review, comment


In [None]:
repo_statement = str(repo_ids)
repo_statement = repo_statement[1:-1]
print(repo_statement)

contrib_query = salc.sql.text(f"""
                SELECT
                        repo_id as id,
                        repo_name,
                        cntrb_id,
                        created_at,
                        login,
                        action,
                        rank
                    FROM
                        augur_data.explorer_contributor_actions
                    WHERE
                        repo_id in({repo_statement})
                """)
df = pd.read_sql(contrib_query, con=engine)

df = df.reset_index()
df.drop("index", axis=1, inplace=True)

In [None]:
df

df_cntrbs holds values for number of unique contributors in each repo

In [None]:
df_cntrbs = pd.DataFrame(df.groupby('repo_name')['cntrb_id'].nunique()).rename(columns={"cntrb_id": "num_unique_contributors"})
df_cntrbs.head()

df_actions holds a transformed version of the df to have the actions as columns, repo_name as the index, and the row values as the corresponding counts 

In [None]:
# group actions and repos to get the counts of the actions by repo 
df_actions = pd.DataFrame(df.groupby('repo_name')['action'].value_counts())
df_actions = df_actions.rename(columns={"action": "count"}).reset_index()
# pivot df to reformat the actions to be columns and repo_id to be rows 
df_actions = df_actions.pivot(index="repo_name", columns="action", values="count")

In [None]:
df_actions.head()

df_consolidated combines the actions and unique contributors and then specific columns for visualization use are added on 

In [None]:
df_consolidated = pd.concat([df_actions, df_cntrbs], axis=1).reset_index()

In [None]:
df_consolidated.head()

## Part where discussion is needed 

The x and y axis are the atributes that we need to decide on. Choass defines them as the following:
- X-Axis: Logarithmic scale for Code Changes
- Y-Axis: Logarithmic scale of Sum of Number of Issues and Number of Reviews

With the data we have lets figure out what the best aplication is for the most informative visualization 

In [None]:
df_consolidated["prs_opened+issues_closed"] = df_consolidated["issue_closed"] + df_consolidated["pull_request_open"]
df_consolidated["log_prs_o+issues_c"] = df_consolidated["prs_opened+issues_closed"].apply(math.log)
df_consolidated["log_num_commits"] = df_consolidated["commit"].apply(math.log)
df_consolidated["log_num_contrib"] = df_consolidated["num_unique_contributors"].apply(math.log)

In [None]:
df_consolidated.head()

Visualization, more specific styling will be applied on 8Knot when axises are choosen 

In [None]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="log_prs_o+issues_c", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [None]:
fig.show()

## Application of different strategies

Note: axis will be edited when implemented in dash as well as the ability to use a date range slider 

### 1: Y axis Total PR and Issues opened 

In [None]:
df_consolidated["prs_open_plus_issues_open"] = df_consolidated["issue_opened"] + df_consolidated["pull_request_open"]
df_consolidated["log_prs_open+issues_open"] = df_consolidated["prs_open_plus_issues_open"].apply(math.log)

Visualization with logs

In [None]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="log_prs_open+issues_open", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [None]:
fig.show()

Visualization of raw values for y axis

In [None]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="prs_open_plus_issues_open", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [None]:
fig.show()

### 2: PR and issue actions

PR: open, closed, merged <br>
Issues: open , closed

In [None]:
df_consolidated["prs_issues_actions"] = (df_consolidated["issue_opened"] + 
df_consolidated["issue_closed"] + df_consolidated["pull_request_open"] + df_consolidated["pull_request_merged"] + 
df_consolidated["pull_request_closed"])
df_consolidated["log_prs_issues_actions"] = df_consolidated["prs_issues_actions"].apply(math.log)

In [None]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="log_prs_issues_actions", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [None]:
fig.show()

Visualization of raw values for y axis

In [None]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="prs_issues_actions", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [None]:
fig.show()

### 3: PR and Issues actions with weights 
PR: open, closed, merged <br>
Issues: open , closed

Note: these weights to be workshoped if this stategy is choosen, the weights in relation to each other is most important 

In [None]:
i_o_weight = .3
i_c_weight = .4
pr_o_weight = .5
pr_m_weight = .7
pr_c_weight = .2

In [None]:
df_consolidated["prs_issues_actions_weighted"] = (df_consolidated["issue_opened"]*i_o_weight + 
df_consolidated["issue_closed"]*i_c_weight + df_consolidated["pull_request_open"]*pr_o_weight
+ df_consolidated["pull_request_merged"]*pr_m_weight + df_consolidated["pull_request_closed"]*pr_c_weight)
df_consolidated["log_prs_issues_actions_weighted"] = df_consolidated["prs_issues_actions_weighted"].apply(math.log)

Visualization of log values for y axis 

In [None]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="log_prs_issues_actions_weighted", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [None]:
fig.show()

Visualization of raw values

In [None]:
fig = px.scatter(df_consolidated, 
                 x="log_num_commits", 
                 y="prs_issues_actions_weighted", 
                 color="repo_name",
                 size='log_num_contrib', 
                 hover_data=['repo_name'],
                 title="Project Velocity")

In [None]:
fig.show()

## Conclusion 

This strategy of developing more complex visualization in a notebook for discussion is something we will continue moving forward. The visualization in 8Knot is strategy #3 with the ability to toggle between the log or raw values, user inputted weights, and date range selector. 