## Irrelevant QA piars flaging script

This script is designed to automatically flag irrelevant question-answer (QA) pairs in a dataset based on predefined criteria. It specifically identifies QA pairs that reference non-essential document components such as appendices, tables, and sections, which are often considered irrelevant in the context of generating useful QA data for language models.

*Key Components and Functionality:*

### Criteria for Irrelevance:

The script uses a list of IRRELEVANT_KEYWORDS containing terms that typically indicate irrelevant content, such as "table," "appendix," "section," and other document-specific references.
An additional list, EXCLUSION_KEYWORDS, includes terms like "purpose," "role," "function," "use," and "document." These terms are generally relevant but are checked after ensuring the QA pair doesn't include any IRRELEVANT_KEYWORDS.

### Loading QA Pairs:

The function load_qa_pairs() loads QA pairs from a JSON file. Each pair includes a "prompt" (question) and a "response" (answer).

### Irrelevance Check:

The function check_irrelevance() examines both the question and answer for the presence of any keywords from the IRRELEVANT_KEYWORDS list.
It prioritizes identifying references to specific document parts by checking exact word matches. If any IRRELEVANT_KEYWORDS are found, the QA pair is flagged as irrelevant.
The script also considers EXCLUSION_KEYWORDS but only if no IRRELEVANT_KEYWORDS are detected, ensuring relevant QA pairs are not incorrectly flagged.

### Flagging Irrelevant QA Pairs:

The function flag_irrelevant_qa_pairs() processes the loaded QA pairs, applying the irrelevance check to each. It appends an is_irrelevant flag to each pair, indicating whether it should be considered irrelevant.

### Saving Results:

The flagged QA pairs, along with their irrelevance status, are saved to a new JSON file using the save_flagged_qa_pairs() function. This file can be reviewed to ensure that the flagging criteria are working as intended.

### Statistics:

After processing, the script outputs the total number of QA pairs and the number of flagged pairs, providing an overview of the dataset's content relevance.

In [1]:
import json
from tqdm import tqdm

# Criteria for flagging irrelevant QA pairs
IRRELEVANT_KEYWORDS = [
    "table", "tables", "annex", "annexes", "text", "texts", "formula", "formulas",
    "section", "sections", "subsection", "subsections", "appendix", "appendices",
    "equation", "equations", "figure", "figures", "page number", "page numbers",
    "provide information on", "publishing history", "change history", "Subsection",
    "Section", "subsection", "Appendix", "appendix", "Table", "table", "the document", 
    "this document", "mentioned in the"
]

# Additional contextual exclusions to avoid flagging relevant QA pairs
EXCLUSION_KEYWORDS = [
    "purpose", "role", "function", "use", "document", "suitable", "portable", "information"
]

def load_qa_pairs(input_file):
    """Load QA pairs from a JSON file."""
    with open(input_file, 'r', encoding='utf-8') as f:
        return json.load(f)

def check_irrelevance(question, answer):
    """Check if the QA pair is irrelevant based on specific keywords and phrases."""
    combined_text = (question + " " + answer).lower()
    combined_words = set(combined_text.split())
    
    # Prioritize irrelevance keywords; exact word matching
    if any(keyword in combined_words for keyword in IRRELEVANT_KEYWORDS):
        return True  # Prioritize document-specific content flagging
    
    # Apply exclusions only after checking for document-specific keywords
    if any(exclusion_keyword in combined_words for exclusion_keyword in EXCLUSION_KEYWORDS):
        return False
        
    return False

def flag_irrelevant_qa_pairs(qa_pairs):
    """Flag irrelevant QA pairs based on the criteria."""
    flagged_qa_pairs = []
    for qa in tqdm(qa_pairs, desc="Flagging QA pairs"):
        question = qa["prompt"]
        answer = qa["response"]
        is_irrelevant = check_irrelevance(question, answer)
        flagged_qa_pairs.append({
            "prompt": question,
            "response": answer,
            "is_irrelevant": is_irrelevant
        })
    return flagged_qa_pairs

def save_flagged_qa_pairs(flagged_qa_pairs, output_file):
    """Save flagged QA pairs to a JSON file."""
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(flagged_qa_pairs, f, ensure_ascii=False, indent=4)
    print(f"Saved {len(flagged_qa_pairs)} flagged QA pairs to {output_file}")

# Main execution
if __name__ == "__main__":
    input_file = "/Users/zarinadossayeva/Desktop/WIL_LLM/CNSC_QA_pairs_JSON/CNSC_QA_pairs.json"
    output_file = "flagged_CNSC_QA_pairs_4Aug24_reduced_keywords.json"
    
    # Load the QA pairs from the input file
    qa_pairs = load_qa_pairs(input_file)   
    
    # Flag irrelevant QA pairs
    flagged_qa_pairs = flag_irrelevant_qa_pairs(qa_pairs)
    
    # Save the flagged QA pairs to the output file
    save_flagged_qa_pairs(flagged_qa_pairs, output_file)
    
    # Print statistics
    total_pairs = len(qa_pairs)
    flagged_pairs = sum(1 for qa in flagged_qa_pairs if qa["is_irrelevant"])
    print(f"Total QA pairs: {total_pairs}")
    print(f"Flagged QA pairs: {flagged_pairs}")

Flagging QA pairs: 100%|██████████████| 36744/36744 [00:00<00:00, 230538.93it/s]


Saved 36744 flagged QA pairs to flagged_CNSC_QA_pairs_4Aug24_reduced_keywords.json
Total QA pairs: 36744
Flagged QA pairs: 3501
