# Annotating Spanish Text for the Corpus Workbench

This demonstration shows how to prepare and annotate raw text data for the import into to the IMS Open Corpus Workbench (CWB). It will assume you know some basics of Python and NLP.

Since working with raw (meaning not-structured) text data can be a tricky task due variations in text formating/encoding and random encounters with messy data, let us first set some parameters for the input and output.

## The Input Data

For this example project we assume somebody - or something - provided (Spanish) text data in the following format: There are different articles of variing length in a single file, with a headline and a newline separating each article.

Example:

```
Cuba
Cuba, oficialmente la República de Cuba, es un país soberano insular del Caribe, asentado en un archipiélago del Mar Caribe. El territorio está organizado en quince provincias y un municipio especial con La Habana como capital y ciudad más poblada.

Español cubano 
El español cubano es la variedad del idioma español empleado en Cuba. Es un subdialecto del español caribeño con pequeñas diferencias regionales, principalmente de entonación y léxico, entre el occidente y el oriente de la isla.
```

Our task is to extract each article and encode it for CWB.

## The Output Data

To import corpus data in the IMS Open Corpus Workbench (CWB) we need to encode it using the CWB format. The standard CWB input format is one-word-per-line text, with the surface form in the first column and token-level annotations specified as additional TAB-separated columns. See: http://cwb.sourceforge.net/documentation.php

Example:

```
<s>
This    DT this
is      VB be
an      DT a
example NN example
</s>
```

Here we see the surface form on the right, a POS Tag and a lemma for each token. Also we include an XML Tag `<s>` to mark sentences.

## Program Architecture

The program final program will be a commandline tool, in which we feed one input file and it will output the encoded data to *stdout*. 

This way, we can write leaner code and use simple Linux/Unix commands to scale our tool for more data. Python and spaCy (http://spacy.io/) will do the heavy lifting, the shell will decide what goes in.

## Prerequisites

To run this code you need: 

 * Python 3.6
 * A python virtual environment 
 * spaCy and a Spanish Language Model installed
 
Howto:
```
python3 -m venv .venv
source .venv/bin/activate
pip3 install spacy
python3 -m spacy download es_core_news_md
```

Enough formalities, let's write some code.

In [None]:
import spacy
import sys
import collections

In [None]:
# Loading spacy's Spanish Language Model globally
NLP = spacy.load('es_core_news_md')

Let us now set up some helper functions to structure our script. Always remember: Functions are your friends.

Since every document is tuple of headline and body, we will represent each document as such in a datastructure. 

We will read the input file and pass all its lines into this little function.

In [None]:
def create_document(lines_of_document):
    """
    Returns tuple for headline and body.
    """

    Document = collections.namedtuple('Document', ['header', 'body'])
    document = Document(body=' '.join(lines_of_document[1:]),
                   header=lines_of_document[0])

    return document

In [None]:
# Example call
create_document(['This is the headline', 'I am a line from the body!', 'Me too.'])

Since we can have multiple articles in a single file, we need to extract each article and pass it into our new function.

This function will do that given all lines of a file. We will assume each article is seperated by a newline character and create a list of Document tuples.

In [None]:
def create_document_list(lines_of_file):
    """
    Creates a list of Document tuples from all the lines of a file.
    """
    
    document = []
    documents = []

    for line in lines_of_file:
            document.append(line.rstrip())
            
            # Either a newline of the last line
            if line == '\n' or line == lines_of_file[-1]:
                documents.append(create_document(document))
                document = []

    return documents

Now let us read in the actual input file. Along with this Notebook, there is a an example file (2017_11_03_example-text.txt) provided. 

In a perfect world, working with text data should be straight forward. Everything is properly encoded using Unicode, right? Wrong! 

Especially data provided from third parties can be messy sometimes. Different Operating Systems, editors and so on can make life hard. Let's write a more robust import to mitigate this:

In [None]:
filepath = './2017_11_03_example-text.txt'
filename = filepath.split('/')[-1]

try:
    encoding = 'utf-8'
    input_doc = open(filepath,  encoding=encoding)
    lines = input_doc.readlines()
except UnicodeDecodeError:
    # Windows Encoding
    encoding = 'cp1252'    
    input_doc = open(filepath,  encoding=encoding)
    lines = input_doc.readlines()
except UnicodeDecodeError:
    # Since we want to pipe stdout, we need to change the target
    print('Could not convert file: {}'.format(filename), file=sys.stderr)
    sys.exit(1)
finally:
    input_doc.close()
    documents = create_document_list(lines)

In [None]:
documents

This list of Document tuple can now be easily processed. But first we want to get some metadata for our corpus. 

Since in this scenario we don't have anything else but the filename to extract metadata, we will do just that. The CQB Encoding allows for XML like tags to structure the document. Thus, we will create a `<text author="Arthur"></text>` tag pair to wrap each document and use XML attributes to describe the metadata.

Again we will write some functions to do that, since functions are our friends, remember? This function will take the filename, extract what ever metadata we can get and output a XML tag.

In [None]:
def print_header(filename):
    """
    Prints document head with its attributes
    Input should be like: 2017_11_03_example-text.txt
    """

    date_list = filename[0:10].split('_')
    # Hint: CWB Metadata cannot contain dashes -
    name = 'id="{}"'.format(filename[0:-4].replace('-', '_'))
    date = 'date="{}"'.format('_'.join(date_list))
    year = 'year="{}"'.format(date_list[0])
    month = 'month="{}"'.format(date_list[1])
    day = 'day="{}"'.format(date_list[2])

    header = '<text {} {} {} {} {}>'.format(name, date, year, month, day)

    print(header)

def print_footer():
    """
    Prints end of document. Just for symmetry.
    """
    print('</text>')

In [None]:
print_header('2017_11_03_example-text.txt')
print_footer()

This will enable us to import each document with its metadata into the CWB. Hint: CWB metadata shoud not contain dashes, especially the ID (which is used for indexing the documents).

Now we can start encoding the body. We will use spaCy to annotate our data and mark each sentence with XML tags.

This function will be able to encode both the body and the headline, using the tag parameter. 

In [None]:
def print_cwb(document, tag='<s>'):
    """
    Annotates and prints the sentences in CQP format
    """

    doc = NLP(document)
    for sentence in doc.sents:    
        print(tag)
        
        sent = NLP(sentence.text)
        for token in sent:
            print('{word}\t{pos}\t{lemma}'.format(
                word=token.text,
                pos=token.pos_,
                lemma=token.lemma_))

        print(tag.replace('<', '</'))

Note that, lemmatization is a bit wrong at the start of a new sentence. The lemma should be in lower cases. However, we will not fix that here.

Finally, putting all the parts together, we now have a small clean main loop to generate the output.

In [None]:
    for doc in documents:
        print_header(filename)
        print_cwb(doc.header, tag='<h1>')
        print_cwb(doc.body)
        print_footer()

As mentioned at the beginning, we want this to be a commandline tool. Therefore, we should add the possibility of passing the input file as a parameter. 

Since this is a Notebook it's not nice to show you this. However, this would be the snippet:

In [None]:
from argparse import ArgumentParser

parser = ArgumentParser(description='Convert TXT document into CWB format')

parser.add_argument(
    '--input',
    help='Input TXT file to convert',
    dest='input',
    required=True)

To scale our tool, we can use our Shell to input large amounts of data. This way, our tool doesn't have to worry about that.

Example
```
#!/usr/bin/env sh

for f in myfiles/*.txt; do python3 converter.py --input "$f" > "${f}.cwb";
done
```

The full script and Shell wrapper are provided.