# Filter, clean and summarize debate

This notebook serves as a helper in the preprocessing, to make file operations on the raw debates quicker, as some laptops don't like opening 20000 seperate files. Additionally, debates that are split into multiple child debates are removed. No flex but my working station can loop and load the ~20000 json files in 5 seconds (:

In [1]:
import os
import re
import json

Functions to clean debate texts:

In [2]:
def clean_html(text):
    return re.sub(r"<.*?>", "", text)

def normalize_whitespace(text):
    text = re.sub(r"\s+", " ", text)
    return text.strip()

def preprocess_text(text):
    text = clean_html(text)
    text = normalize_whitespace(text)
    return text

Function for cleaning and joining all json files

In [3]:
def preprocess_debate_files(path):
    all_debates_list = []
    
    for file in os.listdir(path):
        # ignore directories
        if os.path.isdir(file): 
            continue
        
        full_path = os.path.join(path, file)
        with open(full_path, 'r', encoding="utf-8") as f:
            data = json.load(f)
    
        debate_data = {}
    
        if "Overview" not in data:
            continue
    
        # Collecting overall data
        debate_data["Id"]       = data["Overview"]["Id"]
        debate_data["ExtId"]    = data["Overview"]["ExtId"]
        debate_data["Title"]    = data["Overview"]["Title"]
        debate_data["Date"]     = data["Overview"]["Date"]
        debate_data["Location"] = data["Overview"]["Location"]

        # Ignoring debates with no content
        if ("Items" not in data):
            continue
        
        # Collecting all contributions
        debate_data["Interactions"] = []
        
        for interaction in data["Items"]:
            if (interaction["MemberId"]) and (interaction["ItemType"] == "Contribution"):
                # Clean interactions text
                Value = preprocess_text(interaction["Value"])
                
                interaction_data = {
                    "ItemId"          : interaction["ItemId"],
                    "ExternalId"      : interaction["ExternalId"],
                    "MemberId"        : interaction["MemberId"],
                    "Value"           : Value,
                    "OrderInSection"  : interaction["OrderInSection"],
                    "AttributedTo"    : interaction["AttributedTo"]
                }
                debate_data["Interactions"].append(interaction_data)     

        # only gather debates with more than 1 contribution
        if (len(debate_data["Interactions"]) > 1):
            all_debates_list.append(debate_data)

    return all_debates_list

In [4]:
all_debates_list = preprocess_debate_files("data")

Number of debates left to analyse after for interaction validity

In [5]:
len(all_debates_list)

14719

Saving the data

In [6]:
with open("data.json", "w", encoding="utf-8") as f:
    json.dump(all_debates_list, f, ensure_ascii=False, indent=4)