# Another (Stricter) Pipeline

The most basic, direct prediction style pipeline was created.

And while we already saw some interesting results, effectively inverting that of previous work, i.e. high recall but terrible precision. We now know what we need to do.

We need to improve precision by reducing false positives (FPs) by discriminating more. 

To do so, we're going to create a pipeline that starts by generating its own inclusion criteria only. And we'll only validate for inclusion if all criteria are met. 

Let's see what happens!

In [1]:
# local imports
from datasets import *
from signatures import Relevance
from metrics import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# configuring our local gemma3 model
lm = dspy.LM('ollama_chat/gemma3', api_base='http://localhost:11434', api_key='')
dspy.configure(lm=lm)
# testing out the LM
lm("Say 'Hello world!'", temperature=0.7) 

['Hello world!\n']

In [3]:
# define a new signature for inclusion criteria generation
class InclusionCriteria(dspy.Signature):
    """Output a set of inclusion criteria for the Systematic Review."""

    systematic_review_title: str = dspy.InputField()
    inclusion_criteria: dict[str, str] = dspy.OutputField(desc="Inclusion criteria and their descrptions.")

In [4]:
# getting our SYNERGY data that will constitute our devset
non_biomeds_df = get_devset()

In [5]:
# the simplest pipelines
just_predict = dspy.Predict(InclusionCriteria)
just_cof = dspy.ChainOfThought(InclusionCriteria)

In [6]:
test_title = non_biomeds_df['SR_title'][0]
test_title

'A Systematic Literature Review on Fault Prediction Performance in Software Engineering.'

In [7]:
just_predict(systematic_review_title=test_title)

Prediction(
    inclusion_criteria={'Study Design': 'The study must be a systematic review or meta-analysis.', 'Population': 'Studies must focus on software development teams or individuals involved in software fault prediction.', 'Intervention': 'The intervention is the fault prediction method or technique being evaluated.', 'Outcome': 'Studies must report on quantitative measures of fault prediction performance, such as precision, recall, F1-score, or accuracy.', 'Publication Type': 'The review must include peer-reviewed journal articles or conference proceedings.'}
)

In [8]:
just_cof(systematic_review_title=test_title)

Prediction(
    reasoning="This systematic review aims to comprehensively assess the performance of various fault prediction methods within the software engineering domain. It's crucial to evaluate the effectiveness of these methods across different programming languages, development methodologies, and datasets to provide actionable insights for practitioners and researchers. The inclusion criteria will focus on studies that provide quantifiable metrics for evaluating fault prediction performance, allowing for a robust comparison of different approaches.",
    inclusion_criteria={'Programming Language': 'Studies must evaluate fault prediction models applied to at least one programming language (e.g., Java, Python, C++, JavaScript).', 'Dataset Type': 'Studies must utilize a defined dataset for training and testing the fault prediction models. Datasets should be publicly available whenever possible.', 'Evaluation Metrics': 'Studies must report quantitative evaluation metrics for assessin

In [45]:
class CheckCriteriaMatch(dspy.Signature):
    """Verify which criteria are satisfied by the title and abstract of a candidate citation"""

    inclusion_criteria: dict[str, str] = dspy.InputField()
    citation_title: str = dspy.InputField()
    citation_abstract: str = dspy.InputField()
    satisfied: dict[str, int] = dspy.OutputField(desc="Criteria names and whether they're satisfied.")

In [46]:
test_ic = just_cof(systematic_review_title=test_title)
criteria_match = dspy.ChainOfThought(CheckCriteriaMatch)
criteria_match(inclusion_criteria=test_ic,
               citation_title=non_biomeds_df['title'][0],
               citation_abstract=non_biomeds_df['abstract'][0]) 

Prediction(
    reasoning="The systematic review aims to evaluate fault prediction methods across various software engineering contexts, focusing on quantifiable metrics and diverse model types. This citation investigates an FPGA-based Computer Vision System for offset error computation in web printing machines. While the abstract describes a system designed to improve printing quality by addressing registration errors, it doesn't directly align with the systematic review's criteria of evaluating fault prediction *models* across programming languages, datasets, and methodologies. The focus is on a specific application (web printing) and a particular hardware implementation (FPGA), rather than a general fault prediction study. Therefore, none of the inclusion criteria are satisfied.",
    satisfied={'Programming Language': 0, 'Dataset Type': 0, 'Evaluation Metrics': 0, 'Model Types': 0, 'Development Methodology': 0, 'Study Design': 0}
)

In [19]:
# custom module for pipeline
class ClassifyByInclusion(dspy.Module):
    def __init__(self):
        self.generate_inclusion_criteria = dspy.ChainOfThought(InclusionCriteria)
        self.evaluate_criteria = dspy.ChainOfThought(CheckCriteriaMatch)
    def forward(self, sr_title: str,  citation_title: str, citation_abstract: str):
        inclusion_criteria = self.generate_inclusion_criteria(
            systematic_review_title=sr_title
        ).inclusion_criteria
        return self.evaluate_criteria(inclusion_criteria=inclusion_criteria,
                                      citation_title=citation_title,
                                      citation_abstract=citation_abstract)

In [12]:
test = ClassifyByInclusion()

In [26]:
pred = test(test_title, non_biomeds_df['title'][0], non_biomeds_df['abstract'][0])

## Evaluation

We have our slightly more sophisticated pipeline in the form of a module that generates a set of inclusion criteria before trying to evaluate a prospective citation against them. 

Now it's time to evaluate it!

We just have to first recreate our devset!

In [21]:
# initialise our random under sampler
rus = RandomUnderSampler(random_state=42, sampling_strategy={0: 67, 1: 8})

In [22]:
# create a development set of inputs and labels by randomly undersampling
# each systematic review so that it contains 8 positive labels and 67 negative
# ones
Xs, Ys = [], [] 
for _, group in non_biomeds_df.groupby(by='SR_id'):
    grp_Xs, grp_Ys = rus_dataset(group, ['SR_title', 'title', 'abstract'], 'relevant', rus)
    Xs.append(grp_Xs)
    Ys.append(grp_Ys)
Xs = np.concatenate(Xs)
Ys = np.concatenate(Ys)

In [23]:
# create our development set of DSPy Example objects
devset = [dspy.Example(sr_title=x[0], citation_title=x[1], citation_abstract=x[2], relevant=y)\
          .with_inputs('sr_title', 'citation_title', 'citation_abstract')
          for x,y in zip(Xs, Ys)]

In [37]:
# std lib imports
from functools import reduce

In [51]:
# a metric that will only return True if all criteria are met
def validate_criteria_based_answer(example, 
                                   pred,
                                   trace=None) -> str | bool:
    # all criteria must be met
    pred.relevant = reduce(lambda x, y: x + y, pred.satisfied.values()) == len(pred.satisfied)
    return confusion_validate(example, pred, trace=trace)
    
def validate_criteria_based_answer2(example, 
                                   pred,
                                   trace=None) -> str | bool:
    # all criteria must be met
    pred.relevant = reduce(lambda x, y: x + y, pred.satisfied.values()) > len(pred.satisfied) // 2 
    return confusion_validate(example, pred, trace=trace)

In [43]:
f1_evaluate(ClassifyByInclusion(), devset, validate_criteria_based_answer)

F1 Score: 0.194 100%|████████████████████████████████████████████████████████████'| 300/300 [12:23<00:00, ' ' 2.48s/it]'

Confusion Matrix: Counter({'TN': 216, 'FP': 52, 'FN': 23, 'TP': 9})
Precision: 0.148
Recall: 0.281
F1: 0.194
MCC: 0.067
Specificity: 0.827





In [52]:
f1_evaluate(ClassifyByInclusion(), devset, validate_criteria_based_answer2)

F1 Score: 0.267 100%|████████████████████████████████████████████████████████████'| 300/300 [11:52<00:00, ' ' 2.38s/it]'

Confusion Matrix: Counter({'FP': 169, 'TN': 99, 'TP': 31, 'FN': 1})
Precision: 0.155
Recall: 0.969
F1: 0.267
MCC: 0.221
Specificity: 0.437





## Pipeline rewrite

Turns out Delgado-Chaves had both inclusion and exclusion criteria in their set of criteria. 

Not just one.

So we're going to take the opportunity to rewrite this pipeline, while also changing the eval. So that we evaluate one SR at a time, and get separate F1 scores.

In [177]:
# define a new signature for inclusion/exclusion criteria generation
class InclusionExclusionCriteria(dspy.Signature):
    """Output a set of inclusion/exclusion criteria a prospective citation must meet to be included in the systematic review."""

    systematic_review_title: str = dspy.InputField()
    criteria: dict[str, str] = dspy.OutputField(desc="Inclusion/exclusion criteria and their descrptions.")

In [178]:
class CheckCriteriaMatch(dspy.Signature):
    """Verify which criteria are satisfied by the title and abstract of a candidate citation."""

    criteria: dict[str, str] = dspy.InputField()
    citation_title: str = dspy.InputField()
    citation_abstract: str = dspy.InputField()
    satisfied: dict[str, int] = dspy.OutputField(desc="Criteria names and whether they're satisfied.")

In [183]:
# custom module for pipeline
class ClassifyByInclusionExclusion(dspy.Module):
    def __init__(self, sr_title: str):
        self.systematic_review_title = sr_title
        self.generate_criteria = dspy.ChainOfThought(InclusionExclusionCriteria)
        self.evaluate_criteria = dspy.ChainOfThought(CheckCriteriaMatch)
    def forward(self, citation_title: str, citation_abstract: str):
        criteria = self.generate_criteria(
            systematic_review_title=self.systematic_review_title
        ).criteria
        print(criteria)
        return self.evaluate_criteria(criteria=criteria,
                                      citation_title=citation_title,
                                      citation_abstract=citation_abstract)

In [180]:
# a metric that will only return True if all criteria are met
def validate_criteria_based_answer(example, 
                                   pred,
                                   trace=None) -> str | bool:
    # all criteria must be met
    pred.relevant = all(pred.satisfied.values())
    return confusion_validate(example, pred, trace=trace)

In [184]:
devset = {}
for sr_title, group in non_biomeds_df.groupby(by='SR_title'):
    sr_id = group['SR_id'].iloc[0]
    
    Xs, Ys  = rus_dataset(group, ['title', 'abstract'], 'relevant', rus)

    devset[sr_id] = [sr_title]
    devset[sr_id].append([dspy.Example(citation_title=x[0], citation_abstract=x[1], relevant=y)\
                          .with_inputs('citation_title', 'citation_abstract')
                          for x, y in zip(Xs, Ys)])

In [185]:
for sr_id, data in devset.items():
    sr_title, data = data[0], data[1:][0]
    print(f"Systematic Review: {sr_id}")
    f1_evaluate(ClassifyByInclusionExclusion(sr_title), data, validate_criteria_based_answer)

Systematic Review: Hall_2012


F1 Score: nan   0%|                                                                         '| 0/75 [00:00<?, ' '?it/s]'

{'study_design': 'The study must be a formal, peer-reviewed systematic review, meta-analysis, or large-scale empirical study evaluating fault prediction performance.', 'data_type': 'The study must utilize software defect data, bug reports, or code metrics as input for the fault prediction models.', 'prediction_method': 'The study must employ a quantifiable fault prediction method, including but not limited to: machine learning algorithms (e.g., regression, classification, neural networks), statistical models, or rule-based systems.', 'performance_metrics': 'The study must report relevant performance metrics for evaluating the fault prediction models, such as precision, recall, F1-score, AUC, or accuracy.', 'dataset_size': 'The study should ideally utilize datasets of sufficient size to allow for meaningful model training and evaluation.', 'reporting_quality': 'The study must provide a detailed description of the methodology, including data preprocessing steps, model parameters, evaluat

F1 Score: nan   5%|███▍                                                             '| 4/75 [00:02<00:50, ' ' 1.40it/s]'

{'study_design': 'The study must be a formal, peer-reviewed systematic review, meta-analysis, or large-scale empirical study evaluating fault prediction performance.', 'data_type': 'The study must utilize software defect data, bug reports, or code metrics as input for the fault prediction models.', 'prediction_method': 'The study must employ a quantifiable fault prediction method, including but not limited to: machine learning algorithms (e.g., regression, classification, neural networks), statistical models, or rule-based systems.', 'performance_metrics': 'The study must report relevant performance metrics for evaluating the fault prediction models, such as precision, recall, F1-score, AUC, or accuracy.', 'dataset_size': 'The study should ideally utilize datasets of sufficient size to allow for meaningful model training and evaluation.', 'reporting_quality': 'The study must provide a detailed description of the methodology, including data preprocessing steps, model parameters, evaluat

F1 Score: nan   7%|████▎                                                            '| 5/75 [00:05<01:26, ' ' 1.24s/it]'

{'study_design': 'The study must be a formal, peer-reviewed systematic review, meta-analysis, or large-scale empirical study evaluating fault prediction performance.', 'data_type': 'The study must utilize software defect data, bug reports, or code metrics as input for the fault prediction models.', 'prediction_method': 'The study must employ a quantifiable fault prediction method, including but not limited to: machine learning algorithms (e.g., regression, classification, neural networks), statistical models, or rule-based systems.', 'performance_metrics': 'The study must report relevant performance metrics for evaluating the fault prediction models, such as precision, recall, F1-score, AUC, or accuracy.', 'dataset_size': 'The study should ideally utilize datasets of sufficient size to allow for meaningful model training and evaluation.', 'reporting_quality': 'The study must provide a detailed description of the methodology, including data preprocessing steps, model parameters, evaluat

F1 Score: nan   8%|█████▏                                                           '| 6/75 [00:08<02:00, ' ' 1.75s/it]'

{'study_design': 'The study must be a formal, peer-reviewed systematic review, meta-analysis, or large-scale empirical study evaluating fault prediction performance.', 'data_type': 'The study must utilize software defect data, bug reports, or code metrics as input for the fault prediction models.', 'prediction_method': 'The study must employ a quantifiable fault prediction method, including but not limited to: machine learning algorithms (e.g., regression, classification, neural networks), statistical models, or rule-based systems.', 'performance_metrics': 'The study must report relevant performance metrics for evaluating the fault prediction models, such as precision, recall, F1-score, AUC, or accuracy.', 'dataset_size': 'The study should ideally utilize datasets of sufficient size to allow for meaningful model training and evaluation.', 'reporting_quality': 'The study must provide a detailed description of the methodology, including data preprocessing steps, model parameters, evaluat

F1 Score: nan   8%|█████▏                                                           '| 6/75 [00:10<02:03, ' ' 1.79s/it]'


KeyboardInterrupt: 

In [96]:
gen_criteria = dspy.ChainOfThought(InclusionExclusionCriteria)

In [104]:
gen_criteria(systematic_review_title=test_title).criteria

{'study_design': 'The study must be a systematic review, meta-analysis, or a rigorous literature review that explicitly describes the methods used to identify, select, and analyze relevant studies.',
 'fault_prediction_focus': 'The study must focus on predicting software faults, defects, or bugs.',
 'software_domain': 'The study must investigate fault prediction in any software development domain (e.g., embedded systems, web applications, mobile apps, etc.).',
 'data_sources': 'The study must utilize a quantifiable dataset for training and testing the fault prediction models. The dataset should be publicly available or clearly described.',
 'model_types': 'The study must evaluate at least one fault prediction model, including but not limited to: machine learning models (e.g., logistic regression, support vector machines, neural networks), statistical models, or rule-based systems.',
 'performance_metrics': 'The study must report relevant performance metrics for evaluating the fault pre

## Separating Inclusion/Exclusion Generation

In [119]:
# define a new signature for inclusion criteria generation
class InclusionCriteria(dspy.Signature):
    """Output a set of inclusion criteria a prospective citation must meet to be included in the systematic review."""

    systematic_review_title: str = dspy.InputField()
    criteria: dict[str, str] = dspy.OutputField(desc="Inclusion criteria and their descrptions.")
    
# define a new signature for inclusion criteria generation
class ExclusionCriteria(dspy.Signature):
    """Output a set of exclusion criteria a prospective citation must meet to be included in the systematic review."""

    systematic_review_title: str = dspy.InputField()
    criteria: dict[str, str] = dspy.OutputField(desc="Exclusion criteria and their descrptions.")

class CriteriaBasedRelevance(dspy.Signature):
    """Classify a prospective citation's relevance to a systematic review based on inclusion/exclusion criteria satisfiability."""

    inclusion_satisfiability: dict[str, bool] = dspy.InputField()
    exclusion_satisfiability: dict[str, bool] = dspy.InputField()

    relevant: bool = dspy.OutputField()
    confidence: float = dspy.OutputField()

In [141]:
# custom module for pipeline
class ClassifyByInclusionExclusion(dspy.Module):
    def __init__(self, sr_title: str):
        self.systematic_review_title = sr_title
        self.generate_inclusion_criteria = dspy.ChainOfThought(InclusionCriteria)
        self.generate_exclusion_criteria = dspy.ChainOfThought(ExclusionCriteria)
        self.evaluate_criteria = dspy.ChainOfThought(CheckCriteriaMatch)
    def forward(self, citation_title: str, citation_abstract: str):
        inclusion_criteria = self.generate_inclusion_criteria(
            systematic_review_title=self.systematic_review_title
        ).criteria
        exclusion_criteria = self.generate_exclusion_criteria(
            systematic_review_title=self.systematic_review_title
        ).criteria
        
        pred = dspy.Prediction()
        pred.inclusion_satisfiability = self.evaluate_criteria(criteria=inclusion_criteria,
                                                          citation_title=citation_title,
                                                          citation_abstract=citation_abstract)
        
        pred.exclusion_satisfiability = self.evaluate_criteria(criteria=exclusion_criteria,
                                                          citation_title=citation_title,
                                                          citation_abstract=citation_abstract)
        return pred

In [121]:
for sr_id, data in devset.items():
    sr_title, data = data[0], data[1:][0]
    print(f"Systematic Review: {sr_id}")
    f1_evaluate(ClassifyByInclusionExclusion(sr_title), data, confusion_validate)

Systematic Review: Hall_2012


F1 Score: 0.200 100%|██████████████████████████████████████████████████████████████'| 75/75 [12:01<00:00, ' ' 9.62s/it]'


Confusion Matrix: Counter({'FP': 55, 'TN': 12, 'TP': 7, 'FN': 1})
Precision: 0.113
Recall: 0.875
F1: 0.200
MCC: 0.044
Specificity: 0.267
Systematic Review: Smid_2020


F1 Score: 0.192 100%|██████████████████████████████████████████████████████████████'| 75/75 [11:06<00:00, ' ' 8.89s/it]'


Confusion Matrix: Counter({'FP': 58, 'TN': 9, 'TP': 7, 'FN': 1})
Precision: 0.108
Recall: 0.875
F1: 0.192
MCC: 0.008
Specificity: 0.227
Systematic Review: Radjenovic_2013


F1 Score: 0.235 100%|██████████████████████████████████████████████████████████████'| 75/75 [11:31<00:00, ' ' 9.22s/it]'


Confusion Matrix: Counter({'FP': 52, 'TN': 15, 'TP': 8})
Precision: 0.133
Recall: 1.000
F1: 0.235
MCC: 0.173
Specificity: 0.307
Systematic Review: Sep_2021


F1 Score: 0.193 100%|██████████████████████████████████████████████████████████████'| 75/75 [11:24<00:00, ' ' 9.12s/it]'

Confusion Matrix: Counter({'FP': 67, 'TP': 8})
Precision: 0.107
Recall: 1.000
F1: 0.193
MCC: nan
Specificity: 0.107





In [165]:
# a metric that will only return True if all criteria are met
def validate_dual_criteria_based_answer(example, 
                                        pred, 
                                        trace=None) -> str | bool: 
    satisfied_inclusions = pred.inclusion_satisfiability.satisfied.values()
    satisfied_exclusions = pred.exclusion_satisfiability.satisfied.values()

    # majority 
    n_satisfied_inclusions = len(list(filter(lambda x: x, satisfied_inclusions)))
    pred.relevant = (n_satisfied_inclusions > len(satisfied_inclusions)//2) and not any(satisfied_exclusions)
    #print(f"Inclusions: {satisfied_inclusions}")
    #print(f"Exclusions: {satisfied_exclusions}")
    #print(f"Is relevant? {pred.relevant}")
    #print()

    return confusion_validate(example, pred, trace=trace)

In [166]:
for sr_id, data in devset.items():
    sr_title, data = data[0], data[1:][0]
    print(f"Systematic Review: {sr_id}")
    f1_evaluate(ClassifyByInclusionExclusion(sr_title), data, validate_dual_criteria_based_answer)

Systematic Review: Hall_2012


F1 Score: 0.125 100%|█████████████████████████████████████████████████████████████'| 75/75 [00:00<00:00, ' '187.82it/s]'


Confusion Matrix: Counter({'TN': 60, 'FP': 7, 'FN': 7, 'TP': 1})
Precision: 0.125
Recall: 0.125
F1: 0.125
MCC: 0.021
Specificity: 0.907
Systematic Review: Smid_2020


F1 Score: nan 100%|███████████████████████████████████████████████████████████████'| 75/75 [00:00<00:00, ' '245.47it/s]'


Confusion Matrix: Counter({'TN': 65, 'FN': 8, 'FP': 2})
Precision: 0.000
Recall: 0.000
F1: nan
MCC: -0.057
Specificity: 0.973
Systematic Review: Radjenovic_2013


F1 Score: nan 100%|███████████████████████████████████████████████████████████████'| 75/75 [00:00<00:00, ' '278.10it/s]'


Confusion Matrix: Counter({'TN': 58, 'FP': 9, 'FN': 8})
Precision: 0.000
Recall: 0.000
F1: nan
MCC: -0.128
Specificity: 0.880
Systematic Review: Sep_2021


F1 Score: nan 100%|███████████████████████████████████████████████████████████████'| 75/75 [00:00<00:00, ' '332.87it/s]'

Confusion Matrix: Counter({'TN': 67, 'FN': 8})
Precision: nan
Recall: 0.000
F1: nan
MCC: nan
Specificity: 1.000





## Not great...
What if we do the whole: inclusion satisfiability + exclusion satisfiability

In [134]:
# define a new signature for inclusion/exclusion criteria generation
class InclusionExclusionCriteria(dspy.Signature):
    """Output a set of inclusion/exclusion criteria a prospective citation must meet to be included in the systematic review."""

    systematic_review_title: str = dspy.InputField()
    inclusion_criteria: dict[str, str] = dspy.OutputField(desc="Inclusion criteria and their descrptions.")
    exclusion_criteria: dict[str, str] = dspy.OutputField(desc="Exclusion criteria and their descrptions.")

# custom module for pipeline
class ClassifyByInclusionExclusion(dspy.Module):
    def __init__(self, sr_title: str):
        self.systematic_review_title = sr_title
        self.generate_criteria = dspy.ChainOfThought(InclusionExclusionCriteria)
        self.evaluate_criteria = dspy.ChainOfThought(CheckCriteriaMatch)
    def forward(self, citation_title: str, citation_abstract: str):
        criteria = self.generate_criteria(
            systematic_review_title=self.systematic_review_title
        ).criteria
        return self.evaluate_criteria(criteria=criteria,
                                      citation_title=citation_title,
                                      citation_abstract=citation_abstract)

In [135]:
pred = dspy.ChainOfThought(InclusionExclusionCriteria)(systematic_review_title=test_title)

In [137]:
pred.inclusion_criteria

{'Dataset Size': 'Studies utilizing datasets of sufficient size to allow for meaningful statistical analysis (generally > 1000 software units, but this will be assessed on a case-by-case basis).',
 'Fault Prediction Technique': 'Studies evaluating any established or emerging fault prediction technique, including but not limited to static analysis, machine learning, and data mining approaches.',
 'Performance Metrics': 'Studies reporting quantifiable performance metrics such as precision, recall, F1-score, AUC, or similar measures of accuracy and effectiveness.',
 'Evaluation Methodology': 'Studies employing a controlled evaluation methodology, including baseline comparisons and/or cross-validation techniques.',
 'Software Domain': 'Studies focusing on a variety of software domains (e.g., web applications, mobile applications, embedded systems) to ensure broad applicability.',
 'Publication Type': 'Studies published in peer-reviewed journals or conference proceedings.'}

In [138]:
pred.exclusion_criteria

{'Lack of Quantitative Data': 'Studies lacking sufficient quantitative data to assess performance (e.g., purely qualitative case studies).',
 'Unclear Methodology': 'Studies with poorly defined or unclear methodologies, making it difficult to replicate or interpret the results.',
 'Small Dataset Size': 'Studies utilizing datasets too small to allow for meaningful statistical analysis (< 100 software units).',
 'Non-Reproducible Results': 'Studies where the methodology is not clearly described, preventing replication of the findings.',
 'Grey Literature Only': 'Studies found only in grey literature (e.g., internal reports, blog posts) without a published peer-reviewed version.'}