# JSON Post-Processing for Clean Conversations

This notebook performs a second cleaning step on the JSON files generated earlier.  
The goal is to:
1. Detect redundant or irrelevant chatbot messages.  
2. Remove them while keeping the meaningful conversation intact.  
3. Save the cleaned JSON files in a new folder.  

### 1. Import required libraries
We use `os` and `json` for file handling, and `AzureOpenAI` for making API calls.

In [1]:
import os
import json
from openai import AzureOpenAI

### 2. Define input and output directories
We read the raw JSON files from `./Json Files` and store the cleaned versions in `cleanprocessed_json`.


In [2]:
folderpath = r"./Json Files"
output_dir = os.path.join(folderpath, "cleanprocessed_json")
os.makedirs(output_dir, exist_ok=True)

### 3. Initialize Azure OpenAI client
We configure the client with API key, version, and endpoint to access the model for message filtering.

In [3]:
client = AzureOpenAI(
    api_key="YOUR-API-KEY",
    api_version="2024-02-15-preview",
    azure_endpoint="https://fallbackmodel.openai.azure.com/"
)

### 4. Define message filter
The function `clean_and_deidentify_text` analyzes a **bot message** and decides whether it is redundant or irrelevant to the mental health context.  
It returns `True` if the message should be removed, and `False` otherwise.

In [4]:
def clean_and_deidentify_text(text):
    prompt = f"""
        You are a strict content filter for a mental health chatbot.
    Your task is to analyze a single chatbot message and decide if it is REDUNDANT or IRRELEVANT to a conversation focused on the user's emotional well-being.

    Return ONLY:
    - "true" → if the message:
        - Is a **generic greeting** (e.g., "Hello", "Hi", "Hey", "Hey there!", "Hi there!", "Good to see you", "Hope you're doing well", "How are you today?", "Nice to see you again", "Welcome back", "How can I help?", etc.).
        - Contains any **opening pleasantries**, even if part of a longer sentence (e.g. I am Healo, your personal AI therapist and companion for emotional health. (:sparkles) Just saying this is a safe and confidential space for you to share anything. Remember I am always here for you. You've got this! (:sparkles),So tell me how's it going? Anything on your mind lately?,So, how have you been? Anything interesting on your mind these days?, "Hey! How are you doing today? I’m here to help").
        - Refers to **reminders**, **rendering GIFs**, **Pomodoro timers**, **tests**, **interlink cards**, **personality reports**, **traffic alerts**, or **any automated functionality unrelated to emotional well-being**.
        - Mentions or refers to **images**, **generic external links**, or **non-emotional function cards**.
        - Is **automatically generated** or a **repetitive template message** that doesn’t offer emotional insight.
        - Contains meta-chat like: "Before we proceed, could you please share what you have in mind or what you'd like to do?"

    - "false" → if the message provides emotional support, asks about the user's feelings, encourages self-reflection, offers coping tools or genuine mental health insight.

    Only analyze **bot messages**. Never return true for user messages.

    Only return **true** or **false** — no explanations.

    Original Text:
    {text}
    """
    try:
        response = client.chat.completions.create(
            model="gpt-4o-fallback",
            messages=[
                {"role": "system", "content": "You are a helpful assistant for data cleaning."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.5,
            max_tokens=2000
        )

        answer = response.choices[0].message.content.strip().lower()
        return answer == "true"

    except Exception as e:
        print(f"Error llamando a la API: {e}")
        return False

### 5. Clean a single JSON file
The function `clean_json_file`:
1. Loads the JSON file.  
2. Iterates over its messages.  
3. Removes bot messages marked as redundant.  
4. Returns the cleaned conversation object.  


In [5]:
def clean_json_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        data = json.load(f)

    cleaned_messages = []

    for message in data.get("messages", []):
        if message["role"] == "bot":
            if clean_and_deidentify_text(message["content"]):
                continue
            else:
                cleaned_messages.append(message)
        else:
            cleaned_messages.append(message)

    cleaned_data = {
        "conversation_id": data.get("conversation_id"),
        "messages": cleaned_messages,
        "language": data.get("language"),
        "country": data.get("country"),
        "timestamp": data.get("timestamp"),
        "insight_score": data.get("insight_score"),
        "isPaid": data.get("isPaid")
    }

    return cleaned_data

### 6. Clean all JSON files in a folder
The function `clean_all_jsons`:
- Loops through all JSON files in the input folder.  
- Calls `clean_json_file` for each one.  
- Saves the cleaned version into the output directory. 

In [6]:
def clean_all_jsons(folderpath, output_dir):
    os.makedirs(output_dir, exist_ok=True)

    json_files = [f for f in os.listdir(folderpath) if f.endswith(".json")]
    json_files.sort()

    for idx, filename in enumerate(json_files):
        input_path = os.path.join(folderpath, filename)
        output_path = os.path.join(output_dir, filename)

        print(f"Processing {filename} ({idx}...")
        cleaned = clean_json_file(input_path)

        with open(output_path, 'w', encoding='utf-8') as f_out:
            json.dump(cleaned, f_out, indent=2, ensure_ascii=False)


### 7. Run the pipeline
We now run the cleaning process for all JSON files in the input folder.  
The cleaned files will be saved in `cleanprocessed_json`.

In [8]:
clean_all_jsons(folderpath, output_dir)

Processing conversation1.json (0...
Processing conversation10.json (1...
Processing conversation2.json (2...
Processing conversation3.json (3...
Processing conversation4.json (4...
Processing conversation5.json (5...
Processing conversation6.json (6...
Processing conversation7.json (7...
Processing conversation8.json (8...
Processing conversation9.json (9...
