# Plaintext to XML conversion

Once we have decided on suitable corpus for conversion we need to convert the plaintext to XML format.
This particular XML format is used by tagging tools in NLP for Latvian language to create annotated corpora in .vert format suitable for compilation by NoSketch Engine.

## XML format 

This XML format is as follows, each document is enclosed in `<doc>` and finishes with `</doc>`.
Doc tag could also have attributes such as creator, title, date, etc.
For example `<doc id="1" title="Paul Clifford" year="1830" creator="Edward Bulwer-Lytton">It was a dark and stormy night</doc>`

 This is a simple process and can be done using the following steps:
1. Obtain list of plaintext files (could be inside zip file(s)) from desired source location
2. Read indivudal plaintext file(s) and convert to XML format
3. Concatenate all XML files into a single XML file 
4. Save the final XML file into desired destination location

## Particulars of plaintext file

Our plaintext files contain metadata information at the beginning of the file, followed by the actual text.
The metadata information is separated by three or more newlines from the actual text.
Metadata has the format `key: value` and each key-value pair is separated by newline.
In addition some metadata is provided by filename itself, for example filename could be `author_title_year.txt` and this information can be extracted from filename.

In [1]:
# import standard libraries
from pathlib import Path
import zipfile
from datetime import datetime
import re
# we could use XML libraries but for this simple task we can use string manipulation
import sys
# print Python version
print(f"Python version: {sys.version}")
# datetime to see when the script was run last
print(f"Current date and time: {datetime.now()}")
# import third-party libraries
from tqdm import tqdm

Python version: 3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)]
Current date and time: 2024-11-08 11:35:35.719556


## Sample Plaintext files from sample zip file - case of "Raksti un Māksla" from 1940

We will be exploring plaintext files found in collection of "Raksti un Māksla" (English translation: Articles and Art) - a short lived publication from 1940 containing only 6 issues.
We already have segmented plaintext files in zip format, which we will be using for conversion to XML format.

In [2]:
zip_file = Path(r"I:\zips\articles\raksti_un_maksla_articles.zip") # note we use r to indicate raw string, easier to work with Windows paths
# assert that the file exists
assert zip_file.exists(), "File not found"
# how many files are in the zip file
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    files = zip_ref.namelist()
    print(f"Number of files in the zip file: {len(files)}")
# how many of those are text files
text_files = [f for f in files if f.endswith(".txt")]
print(f"Number of text files in the zip file: {len(text_files)}")

Number of files in the zip file: 221
Number of text files in the zip file: 220


In [11]:
# let's print first 20 rows from first file
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    print(f"Opening file: {text_files[0]}")
    print("-"*20)
    with zip_ref.open(text_files[0]) as file:
        for i, line in enumerate(file):
            print(line.decode("utf-8"), end="") # lines have newline character at the end, so we don't need to add it
            if i > 20:
                break

# we should see the metadata in the first 20 rows 
# followed by some newlines 
# we should see start of the text after that

Opening file: raksti_un_maksla_articles/rama1940n01_005_plaintext_s01.txt
--------------------
title: SATURA RĀDĪTĀJS
subheadline: 
author: 
section: 
uri: http://dom.lndb.lv/data/obj/125563



I
i. VISPĀRĒJIE RAKSTI LPP.
Baumanis, A. Pieci gadi . . . . . .. . 119
— Gaumi un paliekošas vērtības . . . . . . . . . . . . 213
Druva, J. Strādāt un vērtēt . . . . . . . . . . . . . . 5
ļurevičs, P. Tradiciju jēga 7
Kalve, V. Svētīts ir darbs! ... . 121
Tichovskis, H. Jaunā Latgale 4°5
Ulmanis, K. Draudzīgais aicinājums . . . . 117
2. RAKSTNIECĪBA UN ŽURNĀLISTIKA
Aistars, E. Dzīvais cilvēks . . . .. . . . . 319
Baumanis, A, Žurnālista piezīmes . . 413
Bičolis, J. Par literātūras kritiku . . . ... . 420
Cedriņš, V. Edvarta Virzas piemiņai • 222


In [3]:
# get file name without extension and without parent folder
def get_file_name(file_path):
    return file_path.stem
# test on first file
print(get_file_name(Path(text_files[0])))

# we can get some metadata from this file name such as dateIssued, issue, page number, short_publication_name
# so given rama1940n01_005_plaintext_s01 we would extract following key value pairs in a dictionary
# {
# "shortPublicationName": "rama",    
# "dateIssued": 1940,
# "issueNumber": 1,
# "page": 5
# let's write a function to extract these values
def get_meta_from_filename(src: Path|str) -> dict[str, str]:
    src = Path(src)
    # regular expression to extract metadata from the filename
    pattern = re.compile(r"(?P<shortPublicationName>[a-z]+)(?P<dateIssued>\d{4})n(?P<issueNumber>\d{2})_(?P<page>\d{3})_plaintext_*")
    match = pattern.match(get_file_name(src))
    if match:
        return match.groupdict()
    else:
        return {}
    
# test on first file
print(get_meta_from_filename(text_files[0]))

rama1940n01_005_plaintext_s01
{'shortPublicationName': 'rama', 'dateIssued': '1940', 'issueNumber': '01', 'page': '005'}


In [4]:
# next we will want to combine meta data from file name with metadata from the file itself
# let's write a function that will take file path and return metadata from the file
# metatadata will be in the form of key value pairs
# metadata is at the beginning of the file and is in the form key: value
# once three newlines are encountered we can stop reading metadata
def get_meta_txt_from_file(zip_ref, src: str, newline_count=3) -> tuple[dict[str, str], str]:
    metadata = {}
    empty_lines = 0
    txt_lines = []

    with zip_ref.open(src) as file:
        for line in file:
            line = line.decode("utf-8")
            if line == "\r\n": # NOTE on Linux and MacOS this would be just \n
                empty_lines += 1
                continue
            if empty_lines >= newline_count:
                txt_lines.append(line)
                continue
            # we want to split by first colon ONLY
            tokens = line.split(":", 1)
            # print(tokens)
            key, value = line.split(":", 1)
            value = value.strip()
            metadata[key] = value
    txt = "".join(txt_lines)
    return metadata, txt

# test on first file
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    meta, txt = get_meta_txt_from_file(zip_ref, text_files[0])
print(meta)
print(txt[:1000])

{'title': 'SATURA RĀDĪTĀJS', 'subheadline': '', 'author': '', 'section': '', 'uri': 'http://dom.lndb.lv/data/obj/125563'}
I
i. VISPĀRĒJIE RAKSTI LPP.
Baumanis, A. Pieci gadi . . . . . .. . 119
— Gaumi un paliekošas vērtības . . . . . . . . . . . . 213
Druva, J. Strādāt un vērtēt . . . . . . . . . . . . . . 5
ļurevičs, P. Tradiciju jēga 7
Kalve, V. Svētīts ir darbs! ... . 121
Tichovskis, H. Jaunā Latgale 4°5
Ulmanis, K. Draudzīgais aicinājums . . . . 117
2. RAKSTNIECĪBA UN ŽURNĀLISTIKA
Aistars, E. Dzīvais cilvēks . . . .. . . . . 319
Baumanis, A, Žurnālista piezīmes . . 413
Bičolis, J. Par literātūras kritiku . . . ... . 420
Cedriņš, V. Edvarta Virzas piemiņai • 222
Grīns, A. Mūžīgās dzīvības dzejnieks 136
Kadilis, /. Galvenās problēmas mūsu literatūra 512
Kārkliņš, /. Par Kārli Skalbi . • • - 28
Korsaks, K. Svarīgākās problēmas lietuvju literātūrā 531
Nonācs.. O. Latviešu prese laiku mijās 14
Rabācs, K. Ziņotāja darbs laikrakstā 229
Skuja, V. Latgales rakstnieku attīstības gaitas Brīva

In [5]:
# now let's create a function that given file and zip file will return metadata from the file and metadata from the file name combined as well as text
def get_all_meta_txt(zip_ref, src: str) -> tuple[dict[str, str], str]:
    meta_file, txt = get_meta_txt_from_file(zip_ref, src)
    meta_name = get_meta_from_filename(src)
    meta = {**meta_file, **meta_name}
    return meta, txt

# test on first file
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    meta, txt = get_all_meta_txt(zip_ref, text_files[0])
print(*meta.items(), sep="\n")
print(txt[:1000])

('title', 'SATURA RĀDĪTĀJS')
('subheadline', '')
('author', '')
('section', '')
('uri', 'http://dom.lndb.lv/data/obj/125563')
('shortPublicationName', 'rama')
('dateIssued', '1940')
('issueNumber', '01')
('page', '005')
I
i. VISPĀRĒJIE RAKSTI LPP.
Baumanis, A. Pieci gadi . . . . . .. . 119
— Gaumi un paliekošas vērtības . . . . . . . . . . . . 213
Druva, J. Strādāt un vērtēt . . . . . . . . . . . . . . 5
ļurevičs, P. Tradiciju jēga 7
Kalve, V. Svētīts ir darbs! ... . 121
Tichovskis, H. Jaunā Latgale 4°5
Ulmanis, K. Draudzīgais aicinājums . . . . 117
2. RAKSTNIECĪBA UN ŽURNĀLISTIKA
Aistars, E. Dzīvais cilvēks . . . .. . . . . 319
Baumanis, A, Žurnālista piezīmes . . 413
Bičolis, J. Par literātūras kritiku . . . ... . 420
Cedriņš, V. Edvarta Virzas piemiņai • 222
Grīns, A. Mūžīgās dzīvības dzejnieks 136
Kadilis, /. Galvenās problēmas mūsu literatūra 512
Kārkliņš, /. Par Kārli Skalbi . • • - 28
Korsaks, K. Svarīgākās problēmas lietuvju literātūrā 531
Nonācs.. O. Latviešu prese laiku mijās

In [6]:
# ok how about the last file
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    meta, txt = get_all_meta_txt(zip_ref, text_files[-1])
print(*meta.items(), sep="\n")
print(txt[:1000])

('title', 'SATURS')
('subheadline', '')
('author', '')
('section', '')
('uri', 'http://dom.lndb.lv/data/obj/237615')
('shortPublicationName', 'rama')
('dateIssued', '1940')
('issueNumber', '06')
('page', '086')
Lpp.
Kārlis Straubergs. Latviešu rakstniecība tagadējā laikmetā 501
Paulīne Bārda. Asara. Dzejolis . . 511
Jānis Kadilis. Galvenās problēmas mūsu literātūrā . . 512
Kostas Korsaks. Svarīgākās problēmas lietuvju literātūrā 531
Frideberts Tuglass. Igaunijas šīsdienas rakstniecība . 537
Viktors Skuja. Latgales rakstnieku attīstības gaitas Brīvajā Latvijā 546
APSKATS
Rakstu un mākslas kameras darbs 554
Kritika 563
C hronika 571
Bibliogrāfija 575
llūstrāciju pielikumi uz atsevišķām lapām: Ed. Kalniņš. lela Rēzekni. Fr. Varslavāns.
Rēzeknes nomale.
Nošu pielikums: J. Zālīts. Mazurka. (Klavierēm.)


In [7]:
# now let's write a function that given meta and text will create xml with doc tag with metadata as attributes and text as content
def create_xml(meta: dict[str, str], txt: str, strip_symbols=True) -> str:
    if strip_symbols:
        # we need to strip < and > from the text in order to create valid xml
        txt = txt.replace("<", "").replace(">", "")

    xml = f"<doc {' '.join([f'{k}="{v}"' for k, v in meta.items()])}>{txt}</doc>"
    return xml

# test on first file
xml = create_xml(meta, txt)
print(xml[:1000])

<doc title="SATURS" subheadline="" author="" section="" uri="http://dom.lndb.lv/data/obj/237615" shortPublicationName="rama" dateIssued="1940" issueNumber="06" page="086">Lpp.
Kārlis Straubergs. Latviešu rakstniecība tagadējā laikmetā 501
Paulīne Bārda. Asara. Dzejolis . . 511
Jānis Kadilis. Galvenās problēmas mūsu literātūrā . . 512
Kostas Korsaks. Svarīgākās problēmas lietuvju literātūrā 531
Frideberts Tuglass. Igaunijas šīsdienas rakstniecība . 537
Viktors Skuja. Latgales rakstnieku attīstības gaitas Brīvajā Latvijā 546
APSKATS
Rakstu un mākslas kameras darbs 554
Kritika 563
C hronika 571
Bibliogrāfija 575
llūstrāciju pielikumi uz atsevišķām lapām: Ed. Kalniņš. lela Rēzekni. Fr. Varslavāns.
Rēzeknes nomale.
Nošu pielikums: J. Zālīts. Mazurka. (Klavierēm.)</doc>


##  Creating Combined XML file

In [8]:
# now let's write a function that will take zip file and output file and will create xml file with all the text files
def create_xml_file(zip_file, output_file):
    # get list of text files in the zip file
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        text_files = [f for f in zip_ref.namelist() if f.endswith(".txt")]

        with open(output_file, "w", encoding="utf-8") as f:
            for src in tqdm(text_files):
                meta, txt = get_all_meta_txt(zip_ref, src) # get metadata and text
                xml = create_xml(meta, txt)
                f.write(xml)
                f.write("\n")

# test zip_file
output_file = Path(r"I:\xml\raksti_un_maksla_articles.xml")

create_xml_file(zip_file, output_file)




100%|██████████| 220/220 [00:00<00:00, 426.32it/s]
