<a href="https://colab.research.google.com/github/blue-create/langlens/blob/main/format_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose

This file shows the steps we took to process the raw data files, zipped xml files.

## Connect with Google drive to access data

In order to access the data, you first need to create a shortcut of the data folder to your own Gdrive. If you've been granted editing rights, you should be able to edit the content of the folder, i.e. add, move and delete data, create and rename folders, etc.

In [None]:
# connect with google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# redirect the working directory of this script to the data folder
%cd /content/drive/MyDrive/data/

/content/drive/MyDrive/data


In [None]:
# check number of files 
import os 

num_files = len(os.listdir("."))
print("Number of files in the folder: ", num_files)

Number of files in the folder:  2


## Unzip the files 

The first step was to unzip the files and to move the unzipped xml files to another folder called "unzipped". Next, we check whether any zip file in the "raw" data folder is missing a corresponding xml file in the "unzipped" folder. 

In [None]:
# unzip all files 

import zipfile
import shutil

# Path to the raw folder
raw_folder = "./Raw"

# Path to the unzipped folder
unzipped_folder = "./unzipped"

# Loop through all the files in the raw folder
for filename in os.listdir(raw_folder):
    if filename.endswith(".zip"):
        # Check if the file is a valid zip file
        if zipfile.is_zipfile(os.path.join(raw_folder, filename)):
            # Create a ZipFile object for the current zip file
            with zipfile.ZipFile(os.path.join(raw_folder, filename), "r") as zip_ref:
                # Extract all the contents of the zip file to the unzipped folder
                zip_ref.extractall(unzipped_folder)
        else:
            print(f"File {filename} is not a valid zip file and will not be extracted.")

In [None]:
# check whether a zip file is missing a xml file 

# Loop through all the files in the Raw folder
for filename in os.listdir(raw_folder):
    if filename.endswith(".zip"):
        # Get the first part of the filename before the .zip extension
        name = filename.split(".")[0]

        # Loop through all the files in the unzipped folder
        for xml_filename in os.listdir(unzipped_folder):
            if xml_filename.startswith(name) and xml_filename.endswith(".xml"):
                break
        else:
            # A corresponding XML file doesn't exist
            print(f"No corresponding XML file exists for {filename}")

No corresponding XML file exists for NBPC.zip


## Convert xml files to json files 

For further analysis, we decided to convert the xml files to json files for the following reasons: 

- we are dealing with big datasets 
- CSVs are slow to query and difficult to store efficiently 
- JSON supports more complex data structures

In [79]:
# load packages and set directories 

import pandas
import xml.etree.ElementTree as ET 
from tqdm import tqdm # for progress bar 

# Set up paths for input and output directories 
input_dir = "unzipped"
output_dir = "json"


In [None]:
# Iterate over all the xml files in the directory
for xml_file in tqdm(os.listdir(input_dir)): 
  # Check if the file is an XML file
  if xml_file.endswith(".xml"):
    # Parse the xml file
    tree = ET.parse(os.path.join(input_dir, xml_file))
    # Get the root 
    root = tree.getroot()
    # Create list for converted output 
    json_file = []
    # Loop through each article, get the data and append it to the output file json_file
    for artikel in root.findall('artikel'):

      # Access static data by their xpath
      artikel_id = artikel.find('metadaten/artikel-id').text
      name = artikel.find('metadaten/quelle/name').text
      jahrgang = artikel.find('metadaten/quelle/jahrgang').text
      datum = artikel.find('metadaten/quelle/datum').text

      # Access variable data by their xpath 
      ressort_elem = artikel.find('inhalt/titel-liste/ressort')
      # Store data unless data is not available and None, then store as None 
      ressort = ressort_elem.text if ressort_elem is not None else None 

      titel_elem = artikel.find('inhalt/titel-liste/titel')
      titel = titel_elem.text if titel_elem is not None else None 

      untertitel_elem = artikel.find('inhalt/titel-liste/untertitel')
      untertitel = untertitel_elem.text if untertitel_elem is not None else None

      # Create list for text inputs 
      text = []
      # Find the 'text' element
      text_elem = artikel.find('inhalt/text')
      try: 
          # Extract all the 'p' elements inside the 'text' element
          p_elems = text_elem.findall('p')
          # Loop over the 'p' elements and extract their text content
          for p_elem in p_elems:
              p_text = p_elem.text
              # Only add text if text is not empty 
              if p_text is not None: 
                text.append(p_text)

      # If no text element exists, pass 
      except: 
          pass 

      # Create temporary dict to store all information 
      temp_dict = {}
      temp_dict['artikel_id'] = artikel_id
      temp_dict['name'] = name
      temp_dict['jahrgang'] = jahrgang
      temp_dict['datum'] = datum
      temp_dict['ressort'] = ressort
      temp_dict['titel'] = titel
      temp_dict['untertitel'] = untertitel
      temp_dict['text'] = text

      # Add the article dict to the output list 
      json_file.append(temp_dict)

    # Extract the prefix of the file name
    prefix = xml_file.split("_")[0]
  
    # Create the output file name
    output_file = os.path.join(output_dir, f"{prefix}.json")
        
    # Check if the output file already exists
    if os.path.exists(output_file):
        # Read the existing JSON data from the output file
        with open(output_file, "r") as f:
            json_data = json.load(f)
    else:
        # Create a new empty JSON data dictionary
        json_data = {"data": []}
        
    # Append the XML data to the JSON data
    json_data["data"].append(json_file)
        
    # Write the updated JSON data to the output file
    with open(output_file, "w") as f:
        json.dump(json_data, f)
  


  1%|          | 30/5934 [07:16<36:36:34, 22.32s/it]

In [None]:
import xml.etree.ElementTree as ET
import os

# Set up the paths for the input and output files
input_file_path = "./unzipped/OSZ_1_10000.xml"
output_file_path = "./json/OSZ_1_10000.json"

# Parse the XML file
tree = ET.parse(input_file_path)
root = tree.getroot()

In [None]:
json_tab = []
for artikel in root.findall('artikel'):

    artikel_id = artikel.find('metadaten/artikel-id').text
    name = artikel.find('metadaten/quelle/name').text
    jahrgang = artikel.find('metadaten/quelle/jahrgang').text
    datum = artikel.find('metadaten/quelle/datum').text

    ressort_elem = artikel.find('inhalt/titel-liste/ressort')
    ressort = ressort_elem.text if ressort_elem is not None else None 

    titel_elem = artikel.find('inhalt/titel-liste/titel')
    titel = titel_elem.text if titel_elem is not None else None 

    untertitel_elem = artikel.find('inhalt/titel-liste/untertitel')
    untertitel = untertitel_elem.text if untertitel_elem is not None else None

    text = []
    # Find the 'text' element
    text_elem = artikel.find('inhalt/text')
    try: 
        # Extract all the 'p' elements inside the 'text' element
        p_elems = text_elem.findall('p')
        # Loop over the 'p' elements and extract their text content
        for p_elem in p_elems:
            p_text = p_elem.text
            # only add text if text is not empty 
            if p_text is not None: 
              text.append(p_text)
    except: 
        pass 

    temp_dict = {}
    temp_dict['artikel_id'] = artikel_id
    temp_dict['name'] = name
    temp_dict['jahrgang'] = jahrgang
    temp_dict['datum'] = datum
    temp_dict['ressort'] = ressort
    temp_dict['titel'] = titel
    temp_dict['untertitel'] = untertitel
    temp_dict['text'] = text

    json_tab.append(temp_dict)

    

In [None]:
# serialize the dictionary to JSON
json_data = json.dumps(json_tab, ensure_ascii=False)

# write the JSON data to a file
with open(output_file_path, "w") as outfile:
    outfile.write(json_data)

In [None]:
import os
import glob
import xmltodict
import json

# Set up the paths to the input and output directories
input_dir = "unzipped"
output_dir = "json"

# Get a list of all the XML files in the input directory
xml_files = glob.glob(os.path.join(input_dir, "*.xml"))

xml_dict_by_prefix = {}
# Loop over the XML files
for xml_file in tqdm(xml_files):
    # Parse the XML file to a Python dictionary using xmltodict
    with open(xml_file, "r") as f:
        xml_str = f.read()
        xml_dict = xmltodict.parse(xml_str)

    # Get the first 2-4 characters of the filename
    file_prefix = os.path.basename(xml_file)[:4]

    # Append the dictionary to the list corresponding to this file prefix
    if file_prefix in xml_dict_by_prefix:
        xml_dict_by_prefix[file_prefix].append(xml_dict)
    else:
        xml_dict_by_prefix[file_prefix] = [xml_dict]

# Save each group of dictionaries as a separate JSON file in the output directory
for file_prefix, xml_dict_list in xml_dict_by_prefix.items():
    output_filename = os.path.join(output_dir, file_prefix + ".json")
    with open(output_filename, "w") as f:
        json.dump(xml_dict_list, f)


old code

In [None]:
# old 
import pandas as pd

dataset = pd.DataFrame(columns=['Filename', 'Type', 'Content'])

for entry in sorted(os.listdir('.')):
    entry_path = os.path.join('.', entry)
    if os.path.isdir(entry_path):
        # Entry is a folder, add XML files in the folder to the dataset
        for file in sorted(os.listdir(entry_path)):
            if file.endswith('.xml'):
                file_path = os.path.join(entry_path, file)
                with open(file_path, 'r') as f:
                    content = f.read()
                dataset = dataset.append({'Filename': file, 'Type': 'XML', 'Content': content}, ignore_index=True)
    elif os.path.isfile(entry_path) and entry.endswith('.xml'):
        # Entry is an XML file, add it to the dataset
        with open(entry_path, 'r') as f:
            content = f.read()
        dataset = dataset.append({'Filename': entry, 'Type': 'XML', 'Content': content}, ignore_index=True)

print(dataset.head())


Empty DataFrame
Columns: [Filename, Type, Content]
Index: []


things to do: 
- convert xml to json file
- do some descriptive analysis: number of articles per newspaper, number of newspapers, number of topics per newspaper, etc. 
- do filtering: german newspapers only, DA related topics only 
- topic analysis: run the filtered dataset through a generic topic model 