# Most Vulnerable Projects - Data analysis

This notebook aims to find the most vulnerable projects based on control flow graph (CFG) metrics extracted from code changes. It processes a dataset of code changes, dot-formatted CFGs, and computes various metrics to identify projects with potentially risky code changes.

This Notebook answers these questions:

1. From the selected projects, what are the top 10 that have the most commits on fixing bugs, issues, and vulnerabilities?

    1. What are their common characteristics? For example, are they all related to machine learning, to mobile development, etc?

    1. Is there anything that all the top projects don't have in common?

    1. Do any of those projects have incivil commits? By incivil comments, we mean commit messages containing offensive, rude, or hostile language (e.g., as detected by a toxicity classifier or manual review).

    1. Regarding the CFGs of those projects, is the depth of the CFG correlated in any way to the existence of bugs, vulnerabilities, and issues?


In [28]:
# preamble
import pandas as pd
import os
import re
from IPython.display import display, SVG
import warnings
warnings.filterwarnings('ignore', category=pd.errors.ParserWarning)

In [29]:
%load_ext autoreload
%autoreload 2

import sys
import os
# Ensure the `utils` module is on the Python path
sys.path.insert(0, os.path.abspath('utils'))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [30]:
data_dir = "../assets/data-samples/"
csv_files = [f for f in os.listdir(data_dir) if f.endswith('.csv')]
print(f"Found {len(csv_files)} CSV files:")
for f in sorted(csv_files):
    size_mb = os.path.getsize(os.path.join(data_dir, f)) / (1024**2)
    print(f"  - {f} ({size_mb:.1f} MB)")

Found 3 CSV files:
  - job-113755.csv (1101.0 MB)
  - job-113756.csv (95.6 MB)
  - job-113757.csv (1.2 MB)


In [31]:
# Load the CSV files into a single DataFrame
if len(csv_files) == 0:
    raise FileNotFoundError(f"No CSV files found in directory: {data_dir}")
sample_file = os.path.join(data_dir, csv_files[-1])
# Read the CSV with low_memory=False to avoid DtypeWarning from mixed types across chunks
# and explicitly set common text columns to string dtype to ensure consistency
df = pd.read_csv(sample_file, dtype={
  'project_name': 'string',
  'project_description': 'string',
  'project_url': 'string',
  'project_creation_date': 'string',
  'project_database': 'string',
  'project_languages': 'string',
  'project_oss': 'string',
  'project_topics': 'string',
  'commit_url': 'string',
  'file_path': 'string',
  'method_name': 'string',
  'cfg_dot': 'string',
  'cfg_state': 'string'
})

initial_len = len(df)
print(f"Loaded {initial_len} CFG entries from {os.path.basename(sample_file)}")
print(df.info())
print(df.head())

Loaded 602500 CFG entries from job-113755.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 602500 entries, 0 to 602499
Data columns (total 16 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   project_name           602500 non-null  string 
 1   project_description    356329 non-null  string 
 2   project_url            602500 non-null  string 
 3   project_creation_date  602500 non-null  string 
 4   project_database       0 non-null       string 
 5   project_interfaces     0 non-null       float64
 6   project_oss            0 non-null       string 
 7   project_languages      602500 non-null  string 
 8   project_topics         0 non-null       string 
 9   commit_url             602500 non-null  string 
 10  files_changed_count    602500 non-null  int64  
 11  commit_message         602500 non-null  object 
 12  file_path              602500 non-null  string 
 13  method_name            602500 non-null  str

In [32]:
from utils.is_column_empty import is_effectively_empty

cols_to_drop = [col for col in df.columns if is_effectively_empty(df[col])]

if cols_to_drop:
    print(f"Removing effectively empty columns: {cols_to_drop}")
    df = df.drop(columns=cols_to_drop)
    
print(f"DataFrame now has {df.shape[1]} columns after removing empty ones.")
# print the datatyps of the remaining columns
print(df.dtypes)

Removing effectively empty columns: ['project_database', 'project_interfaces', 'project_oss', 'project_topics']
DataFrame now has 12 columns after removing empty ones.
project_name             string[python]
project_description      string[python]
project_url              string[python]
project_creation_date    string[python]
project_languages        string[python]
commit_url               string[python]
files_changed_count               int64
commit_message                   object
file_path                string[python]
method_name              string[python]
cfg_dot                  string[python]
cfg_state                string[python]
dtype: object


In [33]:
# --- make the project language column more usable ---
from utils.parse_language_series import split_semicolon_series
df['project_languages'] = split_semicolon_series(df['project_languages'])

print(f"Unique programming languages found: {df['project_languages'].explode().nunique()}")
print(df.info())

Unique programming languages found: 20
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 602500 entries, 0 to 602499
Data columns (total 12 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   project_name           602500 non-null  string
 1   project_description    356329 non-null  string
 2   project_url            602500 non-null  string
 3   project_creation_date  602500 non-null  string
 4   project_languages      602500 non-null  object
 5   commit_url             602500 non-null  string
 6   files_changed_count    602500 non-null  int64 
 7   commit_message         602500 non-null  object
 8   file_path              602500 non-null  string
 9   method_name            602500 non-null  string
 10  cfg_dot                584612 non-null  string
 11  cfg_state              602500 non-null  string
dtypes: int64(1), object(2), string(9)
memory usage: 55.2+ MB
None


## Top 10 projects with most commits fixing bugs, issues, and vulnerabilities

**Question:** From the selected projects, what are the top 10 that have the most commits on fixing bugs, issues, and vulnerabilities?

To answer this question, we will take this approach:

- Remove the rows with `PRE` in the `cfg_state` column, as they do not represent actual code changes.
- Group the DataFrame by `project_name` and count the number of commits for each project.
- Sort the projects by the number of commits in descending order and select the top 10.

In [34]:
from IPython.display import display

# Remove "PRE" rows (non-change states) and count unique commits per project
df_changes = df[~df['cfg_state'].str.upper().eq('PRE')].copy()
removed_count = len(df) - len(df_changes)
print(f"Removed {removed_count} PRE rows -> {len(df_changes)} rows remaining (of {len(df)})")

# Unique commit count per project (a commit may touch multiple files/methods; count unique commit_url)
commit_counts = df_changes.groupby('project_name')['commit_url'].nunique().reset_index(name='num_commits')

# Additional useful aggregations for context:
agg = df_changes.groupby('project_name').agg(
  project_description=('project_description', lambda s: s.dropna().iloc[0] if s.notna().any() else pd.NA),
  project_languages=('project_languages', 'first'),
  project_url=('project_url', 'first'),
  total_files_changed=('files_changed_count', 'sum'),
  unique_files_changed=('file_path', 'nunique'),
  unique_methods_changed=('method_name', 'nunique')
).reset_index()

# Merge commit counts with aggregated metadata and sort
project_stats = commit_counts.merge(agg, on='project_name')
project_stats = project_stats.sort_values('num_commits', ascending=False)

# Top 10 projects by number of unique commits fixing issues/bugs/vulns
top_n = 10
top_projects = project_stats.head(top_n).reset_index(drop=True)

# Add percent of total unique commits to give context
total_unique_commits = commit_counts['num_commits'].sum()
top_projects['pct_of_total_commits'] = (top_projects['num_commits'] / total_unique_commits * 100).round(2)

display(top_projects)

Removed 76017 PRE rows -> 526483 rows remaining (of 602500)


Unnamed: 0,project_name,num_commits,project_description,project_languages,project_url,total_files_changed,unique_files_changed,unique_methods_changed,pct_of_total_commits
0,ceylon/ceylon-compiler,2126,"Ceylon compiler (ceylonc: Java backend), Ceylo...","[Perl, Ruby, Shell, C, Java, CSS, C++, JavaScr...",https://github.com/ceylon/ceylon-compiler,67186108,2222,10292,36.75
1,aokpx-private/platform_packages_apps_Calendar,1040,,[Java],https://github.com/aokpx-private/platform_pack...,8629626,141,1116,17.98
2,dana-i2cat/opennaas-routing-nfv,679,,"[Java, CSS, JavaScript, Shell]",https://github.com/dana-i2cat/opennaas-routing...,63677127,5939,5540,11.74
3,rfkrocktk/red5-server,630,"RTMP, VOD, SharedObjects, live video broadcast...","[Java, Shell, JavaScript]",https://github.com/rfkrocktk/red5-server,746845,556,2425,10.89
4,eclipse/webtools.jsf,595,Eclipse Web Tools Platform Project project rep...,"[Java, CSS]",https://github.com/eclipse/webtools.jsf,4414942,2481,7629,10.29
5,ebayopensource/turmeric-runtime,66,Turmeric SOA - Runtime framework.,"[Java, Shell, Perl]",https://github.com/ebayopensource/turmeric-run...,364959,291,1760,1.14
6,mibto/mez,58,Zeiterfassung Metzler,[Java],https://github.com/mibto/mez,10702,80,397,1.0
7,Ourobor/petulant-batman,49,Group Project for an SE Class,[Java],https://github.com/Ourobor/petulant-batman,1429,31,132,0.85
8,venukumar/bartsy-venue-android,47,,[Java],https://github.com/venukumar/bartsy-venue-android,18323,22,288,0.81
9,ovitas/compass2,30,Compass 2 for Sesam 4,"[Java, JavaScript]",https://github.com/ovitas/compass2,434,27,142,0.52


> [!WARNING]
> **Potential Improvements:**
> The last column `pct_of_total_commits` could be combined with full context. For example the above figure we can see that [ceylon/ceylon-compiler](https://github.com/ceylon/ceylon-compiler ) accounts for **36%** of the total commits fixing bugs, issues, and vulnerabilities. However, without knowing the total number of commits in a project, this percentage alone does not provide a complete picture of the project's overall activity or the significance of bug-fixing commits relative to its total development efforts.

## Characteristics of Top 10 Projects

**Question:** What are their common characteristics? For example, are they all related to machine learning, to mobile development, etc?
**Question:** Is there anything that all the top projects don't have in common?

To analyze the common characteristics of the top 10 projects with the most commits fixing bugs, issues, and vulnerabilities, we can examine various attributes such as programming languages used, project domains (from the `project_description`) and other metadata available in the dataset.

## Incivil Commits in Top Projects

**Question:** Do any of those projects have incivil commits? By incivil comments, we mean commit messages containing offensive, rude, or hostile language (e.g., as detected by a toxicity classifier or manual review).

To determine if any of the top projects have incivil commits, we can analyze the commit messages associated with each commit in the dataset. I used a pre-trained toxicity classifier (e.g., from Hugging Face Transformers `unitary/toxic-bert` model) to identify potentially incivil language in commit messages.

First, we load the toxicity classifier and define a function to classify commit messages

The settings that I used for the toxicity classifier are as follows:

- **Model:** `unitary/toxic-bert`
- **Threshold:** 0.7 (A commit message is considered incivil if the toxicity score exceeds this threshold)
- **Device:** MPS (Apple Silicon) if available, otherwise CPU

In [35]:
from transformers import pipeline
import torch

#  Pre-check MPS — safer and more explicit
device = 0 if torch.backends.mps.is_available() else -1
print(f"Using device: {'MPS' if device == 0 else 'CPU'}")

classifier = pipeline(
    "text-classification",
    model="unitary/toxic-bert",
    device=device,
    top_k=None,
    truncation=True,          # ← critical: avoids errors on long commit messages
    batch_size=8,             # ← enables batching for faster apply()
)

def is_toxic(text, threshold=0.7):
    try:
        if not isinstance(text, str) or not text.strip():
            return False
        # classifier returns list of list of dicts: [[{label, score}, ...], ...]
        preds = classifier(text)[0]  # list of label-score dicts for one input
        toxic_score = next((r['score'] for r in preds if r['label'] == 'toxic'), 0.0)
        return toxic_score > threshold
    except Exception as e:
        # Optional: log problematic entries (e.g., encoding issues)
        # print(f"Skipped text (len={len(text)}): {e}")
        return False  # conservative: treat failures as non-toxic
    
top_project_names = set(top_projects['project_name'])
df_top_projects = df[df['project_name'].isin(top_project_names)].copy()
df_top10_unique_commits = df_top_projects.drop_duplicates(subset=['commit_url']).copy()

print(f"Filtered to {len(df_top_projects)} rows ({len(df_top10_unique_commits)} unique commits) from top {len(top_project_names)} projects")

# Now apply toxicity classifier just once per unique commit (more efficient + avoids redundancy)
from tqdm.auto import tqdm
tqdm.pandas()
df_top10_unique_commits['is_toxic'] = df_top10_unique_commits['commit_message'].progress_apply(is_toxic)

# print top five toxic commits for inspection
display(df_top10_unique_commits[df_top10_unique_commits['is_toxic']].head(5)[['project_name', 'commit_url', 'commit_message']])

toxic_counts = df_top10_unique_commits.groupby('project_name')['is_toxic'].agg(
    toxic_commit_count='sum',
    total_commits='size'  # same as nunique(commit_url) here, since deduped
).reset_index()

top_projects_enhanced = top_projects.merge(toxic_counts, on='project_name')
top_projects_enhanced['toxic_ratio'] = (
    top_projects_enhanced['toxic_commit_count'] / top_projects_enhanced['num_commits']
).round(4)


display(top_projects_enhanced[['project_name', 'num_commits', 'toxic_commit_count', 'toxic_ratio']])

Using device: MPS


Device set to use mps:0


Filtered to 570691 rows (5320 unique commits) from top 10 projects


  0%|          | 0/5320 [00:00<?, ?it/s]

Unnamed: 0,project_name,commit_url,commit_message
85286,ceylon/ceylon-compiler,https://github.com/ceylon/ceylon-compiler/comm...,fixed stupid typos


Unnamed: 0,project_name,num_commits,toxic_commit_count,toxic_ratio
0,ceylon/ceylon-compiler,2126,1,0.0005
1,aokpx-private/platform_packages_apps_Calendar,1040,0,0.0
2,dana-i2cat/opennaas-routing-nfv,679,0,0.0
3,rfkrocktk/red5-server,630,0,0.0
4,eclipse/webtools.jsf,595,0,0.0
5,ebayopensource/turmeric-runtime,66,0,0.0
6,mibto/mez,58,0,0.0
7,Ourobor/petulant-batman,49,0,0.0
8,venukumar/bartsy-venue-android,47,0,0.0
9,ovitas/compass2,30,0,0.0


> [!NOTE]
> There are no incivil commits in the top 10 projects based on the toxicity analysis of commit messages using the `unitary/toxic-bert` model.

> [!TIP]
> Potential Improvements:
> - Consider using multiple toxicity detection models to cross-validate results and improve accuracy.
> - Manually review a sample of commit messages flagged as incivil to ensure the model's accuracy and relevance to the context of software development. 

## Correlation Between CFG Depth and Bugs/Vulnerabilities

**Question:** Regarding the CFGs of those projects, is the depth of the CFG correlated in any way to the existence of bugs, vulnerabilities, and issues?

