# Purpose

This file shows the steps we took to process the raw data files, zipped xml files.

## Connect with Google drive to access data

In order to access the data, you first need to create a shortcut of the data folder to your own Gdrive. If you've been granted editing rights, you should be able to edit the content of the folder, i.e. add, move and delete data, create and rename folders, etc.

In [None]:
# connect with google drive
from google.colab import drive
drive.mount('/content/drive')

In [2]:
# redirect the working directory of this script to the data folder
#%cd /content/drive/MyDrive/data/
%cd /content/drive/MyDrive/Work/Frontline/data

/content/drive/.shortcut-targets-by-id/1WfnZsqpG1r110J63sMbfS5TpsDOkveiV/data


In [3]:
import os
# check number of files 
num_files = len(os.listdir("Raw"))
print("Number of files in the folder: ", num_files)

Number of files in the folder:  253


## Unzip the files 

The first step was to unzip the files and to move the unzipped xml files to another folder called "unzipped". Next, we check whether any zip file in the "raw" data folder is missing a corresponding xml file in the "unzipped" folder. 

In [None]:
# unzip all files 

import zipfile
import shutil

# Path to the raw folder
raw_folder = "./Raw"

# Path to the unzipped folder
unzipped_folder = "./unzipped"

# Loop through all the files in the raw folder
for filename in os.listdir(raw_folder):
    if filename.endswith(".zip"):
        # Check if the file is a valid zip file
        if zipfile.is_zipfile(os.path.join(raw_folder, filename)):
            # Create a ZipFile object for the current zip file
            with zipfile.ZipFile(os.path.join(raw_folder, filename), "r") as zip_ref:
                # Extract all the contents of the zip file to the unzipped folder
                zip_ref.extractall(unzipped_folder)
        else:
            print(f"File {filename} is not a valid zip file and will not be extracted.")

In [None]:
# check whether a zip file is missing a xml file 

# Loop through all the files in the Raw folder
for filename in os.listdir(raw_folder):
    if filename.endswith(".zip"):
        # Get the first part of the filename before the .zip extension
        name = filename.split(".")[0]

        # Loop through all the files in the unzipped folder
        for xml_filename in os.listdir(unzipped_folder):
            if xml_filename.startswith(name) and xml_filename.endswith(".xml"):
                break
        else:
            # A corresponding XML file doesn't exist
            print(f"No corresponding XML file exists for {filename}")

## Convert xml files to json files 

For further analysis, we decided to convert the xml files to json files for the following reasons: 

- we are dealing with big datasets 
- CSVs are slow to query and difficult to store efficiently 
- JSON supports more complex data structures

#### Methods
For this step, the following methods are defined:
* parsing xml files
* testing if exported files are complete


In [15]:
def parse_xml_file(xml_file):
  """ function to combine all xml files of a prefix and returns it as json
  Parameters:
    - prefix (str): prefix of the journal that is checked for completeness
  Returns:
    - json: containing the combined xml files of a prefix in json format
  """
  tree = ET.parse(os.path.join(input_dir, xml_file))
  # Get the root 
  root = tree.getroot()
  # Create list for converted output 
  json_file = []
  # Loop through each article, get the data and append it to the output file json_file
  for artikel in root.findall('artikel'):
    # Access static data by their xpath
    # Store data unless data is not available and None, then store as None 
    artikel_id = artikel.find('metadaten/artikel-id')
    artikel_id = artikel_id.text if artikel_id is not None else None 

    name = artikel.find('metadaten/quelle/name').text

    jahrgang = artikel.find('metadaten/quelle/jahrgang')
    jahrgang = jahrgang.text if jahrgang is not None else None

    datum = artikel.find('metadaten/quelle/datum')
    datum = datum.text if datum is not None else None


    # Access variable data by their xpath 
    ressort_elem = artikel.find('inhalt/titel-liste/ressort')
    # Store data unless data is not available and None, then store as None 
    ressort = ressort_elem.text if ressort_elem is not None else None 

    titel_elem = artikel.find('inhalt/titel-liste/titel')
    titel = titel_elem.text if titel_elem is not None else None 

    untertitel_elem = artikel.find('inhalt/titel-liste/untertitel')
    untertitel = untertitel_elem.text if untertitel_elem is not None else None

    # Create list for text inputs 
    text = []
    # Find the 'text' element
    text_elem = artikel.find('inhalt/text')
    try: 
        # Extract all the 'p' elements inside the 'text' element
        p_elems = text_elem.findall('p')
        # Loop over the 'p' elements and extract their text content
        for p_elem in p_elems:
            p_text = p_elem.text
            # Only add text if text is not empty 
            if p_text is not None: 
              text.append(p_text)

    # If no text element exists, pass 
    except: 
        pass 

    # Create temporary dict to store all information 
    temp_dict = {}
    temp_dict['artikel_id'] = str(artikel_id)
    temp_dict['name'] = name
    temp_dict['jahrgang'] = jahrgang
    temp_dict['datum'] = datum
    temp_dict['ressort'] = ressort
    temp_dict['titel'] = titel
    temp_dict['untertitel'] = untertitel
    temp_dict['text'] = text

    # Add the article dict to the output list 
    json_file.append(temp_dict)
  return json_file


In [5]:
def check_if_complete(prefix):
  """ function to compare if json contains all available articles
  Parameters:
    - prefix (str): prefix of the journal that is checked for completeness
  Returns:
    - tuple (boolean, DataFrame)
      - boolean indicates weather or not the json file is complete
      - DataFrame returns the json data if its complete and None if incomplete
  """
  try:
    df=pd.read_json(os.path.join("json",prefix+".json"))
    # compare size of dataframe to number of articles
    if len(df)==art_per_src[prefix]:
      return (True,df)
    else:
      print(f"Number of articles in {prefix} json should be {art_per_src[prefix]} but is {len(df)}.")
      return (False, None)
  except:
    return (False, None)

### Exporting Json Files
In this step all prefixes are looped through. Before parsing the xmls files, it is checked if a json file for that prefix already exists and if that files containes all articles. If the json files was previously exported and contains all articles, it is skipped. 

In [6]:
# load packages and set directories 
import os 
import pandas as pd
import xml.etree.ElementTree as ET 
from tqdm import tqdm # for progress bar 
import json
import re

# Set up paths for input and output directories 
input_dir = "unzipped"
output_dir = "json"


#### Testing

In [7]:
# create a dictionary saving the number of articles per prefix usingthe xml names eg. MIB_250001_260000.xml
# this dict is use in testing method check_if_complete, this cell needs to be executed before testing

art_per_src={}
# list of all prefixes
prefixes= sorted([i.split(".")[0] for i in os.listdir("Raw")])
# list of all xml files
xmls=os.listdir("unzipped")
for prefix in prefixes:
  # list of number of articles of prefix by title name
  n_art=sorted([int(re.split("_|\.",xml)[-2]) for xml in xmls if xml.startswith(prefix)])
  # save the largest number (total number of articles pf that prefix) or 0 if no xml of that prefix present
  n_art = n_art[-1] if len(n_art)>0 else 0
  art_per_src[prefix]=n_art
# eg. prefix ANN has 352'684 articles
art_per_src["AAN"]

352684

In [12]:
# list with all prefixes
prefixes= sorted([i.split(".")[0] for i in os.listdir("Raw")])

In [16]:
# looping through all prefixes
for prefix in tqdm(prefixes):
  # if a json files already exists and its complete, it is skipped
  if not check_if_complete(prefix)[0]:
    # list all xmls of a prefix
    xmls=[i for i in os.listdir(input_dir) if i.startswith(prefix)]
    # create an empty list for the json files
    json_temp=[]
    # loop through all xmlfile
    for xml in xmls:
      # parse each xml
      json_temp=json_temp+parse_xml_file(xml)
    #create output name  
    output_path=os.path.join(output_dir, f"{prefix}.json")
    with open(output_path, "w") as f:
      # save json 
      json.dump(json_temp, f)
  else:
    # if json file already exists andis complete, prefix is skipped
    print(f"{prefix} already exported")

 33%|███▎      | 1/3 [00:21<00:43, 21.88s/it]

AAN already exported


 67%|██████▋   | 2/3 [00:26<00:11, 11.87s/it]

AARB already exported


100%|██████████| 3/3 [00:41<00:00, 13.90s/it]

AAZ already exported





## to do list:


*  **convert xml to json file**

* **do filtering: german newspapers only, DA related topics only**

* do descriptive analysis: number of articles per newspaper, number of newspapers, number of topics per newspaper, etc.

* do a collocation analysis: see gitub repo "newspaper" under scripts

* topic analysis: run the filtered dataset through a generic topic model
