This notebook is used to find the best model for the dataset using auto_sklearn.

Tested in a clean environment with Python 3.9.16

These are the packages we used

In [80]:
# To perform data manipulation
%pip install pandas==1.5.2

# To monitor progress in long tasks
%pip install tqdm==4.64.1

# Pyarrow is used by pandas to save/load parquet files
%pip install pyarrow==10.0.1

# To process natural text
%pip install nltk==3.8.1

# Machine learning algorithms
%pip install scikit-learn==1.2.0
%pip install xgboost==1.7.3
%pip install lightgbm==3.3.4
%pip install catboost==1.1.1

# To visualize confusion matrix and plot other data
%pip install seaborn==0.12.2
%pip install matplotlib==3.6.2

You should consider upgrading via the '/home/daniel/.pyenv/versions/3.9.16/envs/testcaseanalyzer/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/home/daniel/.pyenv/versions/3.9.16/envs/testcaseanalyzer/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/home/daniel/.pyenv/versions/3.9.16/envs/testcaseanalyzer/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/home/daniel/.pyenv/versions/3.9.16/envs/testcaseanalyzer/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/home/daniel/.pyenv/versions/3.9.16/envs/testcaseanalyzer

Now let's import the necessary modules and configure a random_state for reproducibility

In [81]:
from time import time # To measure training times

import pandas as pd # For data processing
import numpy as np # For data processing

import nltk # For text processing
nltk.download('stopwords') # Stop words
nltk.download('punkt') # Tokenizer
english_stop_words = set(nltk.corpus.stopwords.words('english')) # Saving the stopwords in a list

from tqdm import tqdm # To monitor progress in large tasks
tqdm.pandas() # Configure tqdm to work with pandas

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer # To make sklearn work with text features
from sklearn.model_selection import train_test_split # To split train/test data
from sklearn.utils import resample # To balance the dataset

from sklearn.metrics import classification_report # To determine the quality of our models
from sklearn.metrics import confusion_matrix  # To determine the quality of our models
import seaborn as sns # To see the confusion matrix
import matplotlib.pyplot as plt # To see the confusion matrix

# We will try all the following models
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import (RandomForestClassifier,
                              AdaBoostClassifier,
                              GradientBoostingClassifier)
from sklearn.linear_model import PassiveAggressiveClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

random_state_value=42

[nltk_data] Downloading package stopwords to /home/daniel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/daniel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Let's load our dataset, making sure there are no duplicated rows

In [82]:
df_commits_annotated = pd.read_parquet('commits_annotated.parquet.gzip')
print(f'Number of rows: {df_commits_annotated.shape[0]}')
print(f'Number of duplicated: {df_commits_annotated.revision_hash.duplicated().sum()}')
df_commits_annotated.head()

Number of rows: 452284
Number of duplicated: 0


Unnamed: 0,index,_id,revision_hash,message,isBugfix
0,0,5bef24dc83ad2f56ca1ca7cd,64185d21b51d1ed32c419f443f804ae2e09ee5f8,The the patrol.,False
1,1,5bef24dd83ad2f56ca1ca7d3,37ab026c576c8841f378cc2376ca02c478567e84,Cleanup in commons-rdf-rdf4j to close PMD and ...,False
2,2,5bef24dd83ad2f56cb1ca7cd,092c465f996f3a4a6acf3f65aeb76f768e702289,COMMONSRDF-49: Make AbstractRDFParser serializ...,False
3,3,5bef24dd83ad2f56cc1ca7cd,6123c7308ed533b870370c6a234ce140368ccc4e,ensure site build works with newest checkstyle...,False
4,4,5bef24dd83ad2f56cc1ca7d4,25af8121afb778efa20d73e1f10c073ad1917e59,[maven-release-plugin] prepare for next develo...,False


We only need columns 'message' and 'isBugfix'. Let's drop the others

In [83]:
df_commits_annotated.drop(columns=['_id','revision_hash'], inplace=True)
df_commits_annotated.head()

Unnamed: 0,index,message,isBugfix
0,0,The the patrol.,False
1,1,Cleanup in commons-rdf-rdf4j to close PMD and ...,False
2,2,COMMONSRDF-49: Make AbstractRDFParser serializ...,False
3,3,ensure site build works with newest checkstyle...,False
4,4,[maven-release-plugin] prepare for next develo...,False


Now let's try to balance the dataset. As it can be seen, it is highly imbalanced.

In [84]:
df_commits_count = df_commits_annotated['isBugfix'].value_counts()
df_commits_count

False    449379
True       2905
Name: isBugfix, dtype: int64

Let's downsample the majority class. It's okay, since we still have almost 3K "True" values.

In [85]:
df_majority = df_commits_annotated[df_commits_annotated.isBugfix==False]
df_minority = df_commits_annotated[df_commits_annotated.isBugfix==True]

count_minority = df_commits_count[1] # Number of elements in minority class

df_majority_downsampled = resample(df_majority,
                                    replace=False,
                                    n_samples=count_minority,
                                    random_state=random_state_value)

df_commits_annotated = pd.concat([df_majority_downsampled, df_minority])

df_commits_annotated['isBugfix'].value_counts()

False    2905
True     2905
Name: isBugfix, dtype: int64

Now let's preprocess our text. First, we will lowercase and remove stopwords. This will be our 'input_feature' column

In [86]:
def lower_case_and_remove_stopwords(text):
    return ' '.join([word for word in nltk.tokenize.word_tokenize(text) if word not in (english_stop_words)])

df_commits_annotated['input_feature'] = df_commits_annotated['message']\
    .str.lower().progress_apply(lower_case_and_remove_stopwords)
df_commits_annotated.head()


100%|██████████| 5810/5810 [00:02<00:00, 2692.78it/s]


Unnamed: 0,index,message,isBugfix,input_feature
151859,151859,Update installation for impatient\n,False,update installation impatient
332572,332572,Added example for using the reverse mapping to...,False,added example using reverse mapping tool git-s...
196656,196656,PDFBOX-2883: remove COSDocument constructors u...,False,pdfbox-2883 : remove cosdocument constructors ...
395182,395182,Added #else formatting test.\nPR:\nObtained fr...,False,added # else formatting test . pr : obtained :...
141552,141552,KAFKA-3740: Part I: expose StreamConfig proper...,False,kafka-3740 : part : expose streamconfig proper...


Now our 'isBugfix' column needs to be converted from boolean to int: True=1, False=0. This will be our 'target' column

In [87]:
df_commits_annotated['target'] = df_commits_annotated['isBugfix']\
    .progress_apply(lambda x: 1 if x else 0)
df_commits_annotated.head()

100%|██████████| 5810/5810 [00:00<00:00, 459461.26it/s]


Unnamed: 0,index,message,isBugfix,input_feature,target
151859,151859,Update installation for impatient\n,False,update installation impatient,0
332572,332572,Added example for using the reverse mapping to...,False,added example using reverse mapping tool git-s...,0
196656,196656,PDFBOX-2883: remove COSDocument constructors u...,False,pdfbox-2883 : remove cosdocument constructors ...,0
395182,395182,Added #else formatting test.\nPR:\nObtained fr...,False,added # else formatting test . pr : obtained :...,0
141552,141552,KAFKA-3740: Part I: expose StreamConfig proper...,False,kafka-3740 : part : expose streamconfig proper...,0


Now let's split the dataset into train, validation and test, transforming the textual column into vectors in the way.

We will try with two vectorizers: CountVectorizer and TfidfVectorizer, for which parameters were copied from the results of [another notebook](find_vectorizer_parameters.ipynb)


In [88]:
y = df_commits_annotated['target'] # target
test_valid_percentage = 0.15 # This means 15% test, 15% validation, 70% training

# Count Vectorizer
vect_count = CountVectorizer()
X_count = vect_count.fit_transform(df_commits_annotated['input_feature'])
X_train_count, X_test_count, y_train_count, y_test_count = train_test_split(X_count, y, test_size=test_valid_percentage, random_state=random_state_value)
X_train_sub_count, X_valid_count, y_train_sub_count, y_valid_count = train_test_split(X_train_count, y_train_count, test_size=test_valid_percentage/(1-test_valid_percentage), random_state=random_state_value)

print('X_train_sub_count shape:', X_train_sub_count.shape)
print('X_valid_count shape:', X_valid_count.shape)
print('X_test_count shape:', X_test_count.shape)

# TF-IDF Vectorizer
vect_tfidf = TfidfVectorizer(encoding='latin-1', max_df=0.23622995216793718,
                max_features=920, min_df=2, ngram_range=(1, 3),
                sublinear_tf=True, token_pattern='\\w{1,}')

X_tfidf = vect_tfidf.fit_transform(df_commits_annotated['input_feature'])

X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X_tfidf, y, test_size=test_valid_percentage, random_state=random_state_value)
X_train_sub_tfidf, X_valid_tfidf, y_train_sub_tfidf, y_valid_tfidf = train_test_split(X_train_tfidf, y_train_tfidf, test_size=test_valid_percentage/(1-test_valid_percentage), random_state=random_state_value)

print('X_train_sub_tfidf shape:', X_train_sub_tfidf.shape)
print('X_valid_tfidf shape:', X_valid_tfidf.shape)
print('X_test_tfidf shape:', X_test_tfidf.shape)

X_train_sub_count shape: (4066, 16830)
X_valid_count shape: (872, 16830)
X_test_count shape: (872, 16830)
X_train_sub_tfidf shape: (4066, 920)
X_valid_tfidf shape: (872, 920)
X_test_tfidf shape: (872, 920)


Configuring the classifiers. Starting with the one found by sklearn in [another notebook](find_best_model_with_automl.ipynb) and trying some other known ones as suggested by [this article](https://towardsdatascience.com/boosting-showdown-scikit-learn-vs-xgboost-vs-lightgbm-vs-catboost-in-sentiment-classification-f7c7f46fd956).

In [89]:
dt = DecisionTreeClassifier(max_depth=12, random_state=random_state_value)
rf = RandomForestClassifier(n_estimators=500,
                            max_features=0.06,
                            n_jobs=6,
                            random_state=random_state_value)
base_estim = DecisionTreeClassifier(max_depth=1, max_features=0.06)
ab = AdaBoostClassifier(base_estimator=base_estim,
                        n_estimators=500,
                        learning_rate=0.5,
                        random_state=random_state_value)
gb = GradientBoostingClassifier(n_estimators=2000,
                                 subsample=0.67,
                                 max_features=0.06,
                                 validation_fraction=0.1,
                                 n_iter_no_change=15,
                                 verbose=0,
                                 random_state=random_state_value)
xgb = XGBClassifier(n_estimators=2000,
                    tree_method='hist',
                    subsample=0.67,
                    colsample_level=0.06,
                    verbose=0,
                    n_jobs=6,
                    random_state=random_state_value)
cb = CatBoostClassifier(n_estimators=2000,
                        colsample_bylevel=0.06,
                        max_leaves=31,
                        subsample=0.67,
                        verbose=0,
                        thread_count=6,
                        random_state=random_state_value)
svc = LinearSVC()


Let's create a list to store all models and model names, to allow an easy comparison. We will also store which ones require early stop in the `fit()` method, and which ones can use text as input

In [90]:
# models = [pac, dt, rf, ab, gb, xgb, cb, svc, cbt]
models = [
    {
        'clf': dt,
        'name': 'DecisionTree',
        'earlyStop': False,
    },
    {
        'clf': rf,
        'name': 'RandomForest',
        'earlyStop': False,
    },
    {
        'clf': ab,
        'name': 'AdaBoost',
        'earlyStop': False,
    },
    {
        'clf': gb,
        'name': 'GradientBoosting',
        'earlyStop': False,
    },
    {
        'clf': xgb,
        'name': 'XGB',
        'earlyStop': True,
    },
    {
        'clf': cb,
        'name': 'CatBoost',
        'earlyStop': True,
    },
    {
        'clf': svc,
        'name': 'LinearSVC',
        'earlyStop': False,
    },
]

Now we can train each model. First we create a DataFrame to store the results and then we loop through all of them, recording time and accuracy. We will do the same for both vectorized data: with CountVectorizer and TfidfVectorizer

On my machine (a core i7-3770 CPU @ 3.40GHz with no GPU and 12Gb of RAM) this takes around 2 minutes.

In [92]:
results = []

# First, with CountVectorizer
for m in models:
    clf = m['clf']
    start_time = time()
    if m['earlyStop']:
        clf.fit(X_train_sub_count,
                y_train_sub_count,
                eval_set = [(X_valid_count, y_valid_count)],
                early_stopping_rounds=15,
                verbose=0)
    else:
        clf.fit(X_train_count, y_train_count)

    run_time = time() - start_time
    accuracy = np.mean(clf.predict(X_test_count) == y_test_count)
    y_pred = clf.predict(X_test_count)
    report = classification_report(y_test_count, y_pred)


    results.append({
        'model': m['name'],
        'vectorizer': 'CountVectorizer',
        'accuracy': accuracy,
        'runtime': 'run_time',
        'report': report
    })

# Next, with TfidfVectorizer
for m in models:
    clf = m['clf']
    start_time = time()
    if m['earlyStop']:
        clf.fit(X_train_sub_tfidf,
                y_train_sub_tfidf,
                eval_set = [(X_valid_tfidf, y_valid_tfidf)],
                early_stopping_rounds=15,
                verbose=0)
    else:
        clf.fit(X_train_tfidf, y_train_tfidf)

    run_time = time() - start_time
    accuracy = np.mean(clf.predict(X_test_tfidf) == y_test_tfidf)
    y_pred = clf.predict(X_test_tfidf)
    report = classification_report(y_test_tfidf, y_pred)

    results.append({
        'model': m['name'],
        'vectorizer': 'TfIdfVectorizer',
        'accuracy': accuracy,
        'runtime': 'run_time',
        'report': report
    })

dfResults = pd.DataFrame(results, columns=['model','vectorizer','accuracy','runtime','report'])



Parameters: { "colsample_level", "verbose" } are not used.





Parameters: { "colsample_level", "verbose" } are not used.



Let's print detailed classification scores

In [99]:
dfResults = dfResults.sort_values(by=['accuracy'], ascending=False)
for i,r in dfResults.iterrows():
    print('=================================')
    print(r['model'],' ',r['vectorizer'])
    print('=================================')

    print(r['report'])


XGB   CountVectorizer
              precision    recall  f1-score   support

           0       0.99      0.95      0.97       455
           1       0.95      0.99      0.97       417

    accuracy                           0.97       872
   macro avg       0.97      0.97      0.97       872
weighted avg       0.97      0.97      0.97       872

LinearSVC   CountVectorizer
              precision    recall  f1-score   support

           0       0.98      0.96      0.97       455
           1       0.96      0.98      0.97       417

    accuracy                           0.97       872
   macro avg       0.97      0.97      0.97       872
weighted avg       0.97      0.97      0.97       872

GradientBoosting   CountVectorizer
              precision    recall  f1-score   support

           0       0.98      0.94      0.96       455
           1       0.94      0.98      0.96       417

    accuracy                           0.96       872
   macro avg       0.96      0.96      0.96