# Input Check
First, user input undergoes a content safety check. If the input text contains inappropriate content such as violence, pornography, or discriminatory language, the filter will intercept it before the model receives it. We use pre-trained harmful content classification models for automatic detection, such as the [toxic-BERT model](https://huggingface.co/unitary/toxic-bert#:~:text=Trained%20models%20%26%20code%20to,comments%2C%20Multilingual%20toxic%20comment%20classification) provided by Hugging Face (based on BERT and trained on the Jigsaw malicious comment challenge dataset). These models can identify abusive, threatening, and hateful statements in the input.

We use Hugging Face's pipeline to load a pre-trained text classification model to classify input text as either toxic or non-toxic. We encapsulate this into a function `filter_input(text)`, which returns a boolean value indicating whether the input is safe. If harmful content is detected (e.g., the model label is "toxic" and the confidence level is higher than a threshold), it returns `False` to indicate `unsafety`; otherwise, it returns `True`.

In [41]:
from transformers import pipeline

# Load pre-trained toxic detector (toxic-bert)
toxic_classifier = pipeline("text-classification", model="unitary/toxic-bert")

def filter_input(text: str):
    result = toxic_classifier(text)[0]    
    score = result.get('score', 0)      

    if score > 0.2:
        # The score is more persuasive than the label
        return False
    return True

Device set to use cuda:0


# Output Check
## Readability Score

Readability measures the friendliness of the text content to the target age group of readers. We use the [Flesch-Kincaid Grade Level Index](https://readable.com/readability/flesch-reading-ease-flesch-kincaid-grade-level/) to evaluate the readability of English text. This index calculates the corresponding US school grade based on average sentence length and average syllables per word. For example, a score of 8.0 indicates that an eighth-grade student (approximately 13-14 years old) in the US can understand the text.

In [None]:
import re

def readability_score(text: str):
    """
    Calculate the Flesch-Kincaid grade level index as readability score
    """
    sentences = re.split(r'[.!?]', text)
    sentences = [s for s in sentences if s.strip()]
    words = re.findall(r'\w+', text)
    if len(sentences) == 0 or len(words) == 0:
        return None
    word_count = len(words)
    sentence_count = len(sentences)
    vowels = "aeiouy"
    syllables = 0
    for word in words:
        w = word.lower()
        prev_vowel = False
        syll_count = 0
        for ch in w:
            if ch in vowels:
                if not prev_vowel:
                    syll_count += 1
                prev_vowel = True
            else:
                prev_vowel = False
        if w.endswith("e"):
            syll_count = max(1, syll_count - 1)
        if syll_count == 0:
            syll_count = 1
        syllables += syll_count
    # Flesch-Kincaid grade index
    grade = 0.39 * (word_count / sentence_count) + 11.8 * (syllables / word_count) - 15.59
    return grade

## Prohibited Topics Detection

Before performing other checks, we first screen the output text for prohibited topics such as sexual content, violence, drugs, self-harm, and hate speech. This is critical for child safety and should be the first line of defense in output filtering.

We use a [zero-shot classification model](https://huggingface.co/facebook/bart-large-mnli#:~:text=,shot%20classification%20pipeline)(which allows us to extend the list of topics without retraining) to detect whether the text relates to predefined prohibited topics. The model assigns confidence scores to each topic, and topics exceeding a threshold are flagged as detected.

In [None]:
from transformers import pipeline

# Initialize zero-shot classification pipeline
topic_classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define prohibited topics
PROHIBITED_TOPICS = [
    "sexual content",
    "violence",
    "drugs",
    "self-harm",
    "hate speech"
]

def detect_prohibited_topics(text: str, threshold: float = 0.6):
    """
    Parameters:
        text
        threshold
    Returns:
        detected_labels
        raw_scores
    """
    result = topic_classifier(text, candidate_labels=PROHIBITED_TOPICS, multi_label=True)
    labels = result["labels"]
    scores = result["scores"]
    
    detected_labels = [label for label, score in zip(labels, scores) if score >= threshold]
    raw_scores = dict(zip(labels, scores))
    
    return detected_labels, raw_scores

config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


## Toxicity Detection

The model-generated responses also need to undergo toxicity/inappropriate content detection. This layer works similarly to input filtering, continuing to use pre-trained harmful content classification models to determine whether the output text contains harmful elements (abuse, hate, pornography, etc.). We can directly reuse the previously mentioned `toxic_classifier` model to run a check on the output. If the output is deemed toxic by the model, the response should be intercepted, filtered, or regenerated.

In [39]:
def filter_output(text, toxic_threshold: float = 0.2) -> dict:
    """
    Perform multi-layer checks on the model output text.
    Priority: 1) prohibited topics 2) toxic 3) readability 4) Vocabulary difficulty
    Returns a dictionary containing all check results.
    """
    result = {}
    
    # 1. prohibited topics
    detected_topics, topic_scores = detect_prohibited_topics(text)
    result['prohibited_topics'] = detected_topics
    result['prohibited_topics_scores'] = topic_scores
    result['has_prohibited_topics'] = len(detected_topics) > 0
    
    # 2. toxic
    toxic_res = toxic_classifier(text)[0]
    result['toxic_score'] = toxic_res.get('score', 0)
    # Only flag as toxic if confidence is above threshold
    result['is_toxic'] = result['toxic_score'] > toxic_threshold
    
    # 3. readability
    result['readability'] = readability_score(text)
    
    # 4. Vocabulary difficulty
    result['difficult_words'] = list(check_vocabulary(text))
    
    return result

## Vocabulary Screening

To ensure that the output content is easy for children to understand, we introduce a filtering mechanism based on the [Dale-Chall word list](https://help.readable.com/en/article/dale-chall-words-list-w877fe/). The Dale-Chall readability formula uses a list of approximately 3,000 common words that 80% or more of fourth-grade students are familiar with. Words not in this list are considered "difficult words" and may need simplification or explanation.

The following code loads the Dale-Chall word list from a text file and implements the check_vocabulary(text) function to screen difficult words. The word list is stored in `dale_chall_words.txt` and is loaded once when the code runs. The function splits the input text into words and returns a set of words that are not in the Dale-Chall vocabulary, indicating they may be too difficult for the target age group.

In [None]:
def load_dale_chall_words(filepath='dale_chall_words.txt'):
    """
    Load the Dale-Chall easy words list from a text file
    Returns a set of words for efficient lookup
    """
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            words = {line.strip().lower() for line in f if line.strip()}
        return words
    except FileNotFoundError:
        print(f"{filepath} not found")
        return set()

child_vocab = load_dale_chall_words()

def check_vocabulary(text: str):
    """
    Check if any words are not in the Dale-Chall easy words list
    """
    words = re.findall(r'\w+', text)
    difficult = [w for w in words if w.lower() not in child_vocab]
    return set(difficult)

# Comprehensive Filtering Process

The filter_output function performs checks in the following priority order:

1. **Prohibited Topics Detection** (Highest Priority): Screens for sensitive topics like sexual content, violence, drugs, self-harm, and hate speech using zero-shot classification. If any prohibited topics are detected with confidence above 0.6, the response should be immediately blocked.

2. **Toxicity Detection** (High Threshold): Checks for toxic language, abuse, and hateful content using the toxic-BERT model. To reduce false positives, we use a **high confidence threshold of 0.2 (Actually 0.2 is pretty high)**.

3. **Readability Score**: Evaluates text complexity using the Flesch-Kincaid Grade Level Index. Lower scores indicate easier reading levels suitable for younger children.

4. **Vocabulary Screening**: Identifies words not in the Dale-Chall easy word list (3,000 common words familiar to 4th graders).

In [42]:
# Test the functionality of the pipeline with English examples
if __name__ == "__main__":
    test_cases = [
        "A bad boy kill his neighbor's dog and steal his bike.",
        "He is a good boy who loves to help others.",
        "The sun is shining brightly today.",
        "The quick brown fox jumps over the lazy dog.",
        "The cat hates the dog but pretend to be friendly.",
        "The cat want to kill the dog but pretend to be friendly.",
        "The little girl was afraid of the dark forest.",
        "The boy and his dog went on an adventure in the mountains."
    ]
    
    for i, text in enumerate(test_cases, 1):
        print(f"Test Case {i}:")
        print("Text:", text)
        print("Is Input Safe:", filter_input(text))
        
        # Use filter_output for comprehensive checks
        output_results = filter_output(text)
        print("Prohibited Topics Detected:", output_results['prohibited_topics'])
        print("Has Prohibited Topics:", output_results['has_prohibited_topics'])
        print(f"Toxic Score: {output_results['toxic_score']:.3f}")
        print(f"Is Toxic (threshold > 0.2): {output_results['is_toxic']}")
        print("Readability Score:", output_results['readability'])
        print("Difficult Words Count:", len(output_results['difficult_words']))
        print("-" * 50)

Test Case 1:
Text: A bad boy kill his neighbor's dog and steal his bike.
Is Input Safe: False
Prohibited Topics Detected: ['violence']
Has Prohibited Topics: True
Toxic Score: 0.841
Is Toxic (threshold > 0.2): True
Readability Score: 1.8733333333333313
Difficult Words Count: 2
--------------------------------------------------
Test Case 2:
Text: He is a good boy who loves to help others.
Is Input Safe: True
Prohibited Topics Detected: []
Has Prohibited Topics: False
Toxic Score: 0.001
Is Toxic (threshold > 0.2): False
Readability Score: 2.4700000000000024
Difficult Words Count: 2
--------------------------------------------------
Test Case 3:
Text: The sun is shining brightly today.
Is Input Safe: True
Prohibited Topics Detected: []
Has Prohibited Topics: False
Toxic Score: 0.001
Is Toxic (threshold > 0.2): False
Readability Score: 4.450000000000003
Difficult Words Count: 1
--------------------------------------------------
Test Case 4:
Text: The quick brown fox jumps over the lazy dog

# Analysis Summary

This section provides a comprehensive analysis of the test results from our multi-layer content filtering pipeline. The pipeline performs four key checks: prohibited topics detection, toxicity detection, readability scoring, and vocabulary screening.

## Key Insight: Score-Based Toxicity Detection

Our testing revealed that the toxic-BERT model's **confidence scores are more reliable than its labels**. While the model often assigns a "toxic" label to benign content, the confidence scores accurately distinguish between harmful and safe content. Therefore, we use **threshold-based scoring** (0.2 for output, higher threshold for stricter filtering) rather than relying solely on labels.

## Test Results Summary

| # | Text | Input Safe | Prohibited Topics | Prohibited | Toxic Score | Is Toxic (>0.2) | Readability | Difficult Words |
|---|------|------------|-------------------|----------------|-------------|-----------------|-------------|-----------------|
| 1 | A bad boy kill his neighbor's dog and steal his bike. | False | violence |  Yes | 0.841 |  Yes | 1.87 | 2 |
| 2 | He is a good boy who loves to help others. | Truee | None |  No | 0.001 |  No | 2.47 | 2 |
| 3 | The sun is shining brightly today. | True | None |  No | 0.001 |  No | 4.45 | 1 |
| 4 | The quick brown fox jumps over the lazy dog. | True | None |  No | 0.061 |  No | 2.34 | 1 |
| 5 | The cat hates the dog but pretend to be friendly. | True | None |  No | 0.048 |  No | 3.65 | 2 |
| 6 | The cat want to kill the dog but pretend to be friendly. | False | violence, hate speech |  Yes | 0.417 |  Yes | 2.86 | 1 |
| 7 | The little girl was afraid of the dark forest. | True | None |  No | 0.008 |  No | 2.34 | 0 |
| 8 | The boy and his dog went on an adventure in the mountains. | True | None |  No | 0.001 |  No | 3.84 | 1 |

## Key Findings

### 1. Prohibited Topics Detection (Zero-Shot Classification)
- **Test Case 1**: Flagged for **violence** (killing, stealing) - Correctly identified violent content with high confidence.
- **Test Case 6**: Flagged for **violence and hate speech** (kill, hate) - Successfully detected hostile intent even in a hypothetical context.
- **Cases 2-5, 7-8**: Clean - No prohibited topics detected, confirming safety for children.
- **Effectiveness**: The zero-shot classifier demonstrates excellent sensitivity to context and intent, successfully distinguishing between benign and harmful content.

### 2. Input Safety (Toxicity with Threshold 0.2)
- **Toxic-BERT Score Distribution**:
  - **High risk (>0.2)**: Cases 1 (0.841), 6 (0.417) - Correctly identified harmful content
  - **Low risk (<0.1)**: Cases 2, 3, 7, 8 (0.001-0.008) - Clean, child-friendly content
  - **Moderate (0.048-0.061)**: Cases 4, 5 - Contains words with potential negative connotations but not harmful
  
- **Threshold Effectiveness**: The 0.2 threshold effectively separates harmful content from benign text, even when words like "hate" or "kill" appear in non-threatening contexts.

### 3. Readability Scores (Flesch-Kincaid Grade Level)
- **Score Range**: 1.87 to 4.45
- **Grade 1-2 (Easiest)**: Cases 1, 4, 7 (2.34-2.86) - Very simple sentence structures, suitable for early readers
- **Grade 2-3**: Cases 2, 5, 8 (2.47-3.84) - Appropriate for primary school students  
- **Grade 4-5**: Case 3 (4.45) - Slightly more complex but still accessible to upper elementary
- **Analysis**: All test cases fall within an appropriate range for children (K-5th grade), making them suitable for the target age group with minimal adaptation needed.

### 4. Vocabulary Screening (Dale-Chall 3000 Word List)
- **Perfect Match**: Case 7 (0 difficult words) - All vocabulary within the Dale-Chall easy word list
- **Minimal Challenge**: Cases 3, 4, 6, 8 (1 word) - Nearly all words are familiar to 4th graders
- **Moderate Challenge**: Cases 1, 2, 5 (2 words) - A few words may need explanation or context
- **Common Difficult Words**: "neighbor's", "loves", "others", "shining", "brightly", "adventure", "mountains", "hates", "pretend"
- **Observation**: Most "difficult" words are common in children's literature and conversations, suggesting the Dale-Chall list may be conservative for modern elementary education.