# Part-of-speech-tagging with *stanza*



This is a Python notebook for pos-tagging that is meant to be used in Google Colab.

The following code will use the Python library *stanza* to tokenize and annotate text.

Further information on *stanza*:

*   How to use *stanza*: https://stanfordnlp.github.io/stanza/index.html
*   The paper on *stanza*: https://aclanthology.org/2020.acl-demos.14/

This code is written specifically to ignore XML-tags that is to print them unchanged and only tokenize and annotate raw text. This is ideal for use with the IMS Open Corpus Workbench where XML-tags provide metadata on each text and should not be annotated.

In theory, you can use this notebook locally on your computer with the necessary dependencies (e.g. Jupyter Notebooks). In that case, you cannot use any commands specific to Google Colab. You will likely also not have a suitable GPU that will make your code run faster.

## Use the GPU
Be sure to set the runtime in Google Colab to a GPU, so the code will run faster. This makes a huge difference with large files!

Google Colab will not select a GPU by default. To change to a GPU in Google Colab, click on **Runtime** above, select **Change runtime type** and select a GPU. Your notebook will disconnect and reconnect, so you may have to rerun cells.

Google Colab may limit the number of times you can use a GPU per day, but this should not be a problem, as the code works on the default settings (CPU) as well.

## Install and import *stanza*


...and other necessary modules.

In [None]:
!pip install stanza

In [2]:
import stanza
import re
import os

## Download the language model for English

In [None]:
stanza.download("en")
print("Language model downloaded")
nlp = stanza.Pipeline("en", processors="tokenize, lemma, pos", verbose=False, use_gpu=True)

For other languages and their corresponding codes (lcode), please see this list: https://stanfordnlp.github.io/stanza/performance.html
For example, for German, replace "en" with "de" or, for Dutch, use "nl".

## Verify that you're using a GPU

Depending on whether the GPU is active or not, the output of this cell should be ```True``` or ```False```.

In [None]:
import torch

torch.cuda.is_available()

## Test the tagger with shorter texts

You may find this useful to gain an impression of how well the tagger performs with your type of text, for instance, a song by Dr. Dre and Snoop Dogg.

In [5]:
# Run this cell to set the sample text

sample = """

And even when I was close to defeat, I rose to my feet
My life's like a soundtrack I wrote to the beat
Treat rap like Cali weed: I smoke 'til I'm 'sleep
Wake up in the A.M., compose a beat
I bring the fire 'til you're soakin' in your seat

"""

Stanza will print each **token**, the **PENN Treebank tag** (xpos), the corresponding **Universal Dependencies tag** (pos), and the **lemma** (lemma) in all lowercase letters on a new line.

See here for further information on the tagsets:

*   **Universal Dependencies Tagset**: https://universaldependencies.org/u/pos
*   **PENN Treebank Tagset**: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html


The output will be a tab-separated list, with each token on a separate line.


In [None]:
# Run this to cell to tag the sample text

# pattern to split XML tags and non-tags
# group 1: XML tags, group 2: non-tag text
parts = re.findall(r'(<[^>]+>)|([^<]+)', sample)


for tag, text in parts:
    if tag:
        # it's an XML tag, so write it unchanged
        print(tag + "\n")
    elif text:
        # it's text content, so lemmatize, clean and write
        doc = nlp(text)

        for sentence in doc.sentences:
            for word in sentence.words:
                print(f"{word.text}\t{word.xpos}\t{word.pos}\t{word.lemma.lower()}\n")


## Use the tagger with a text file



### Connect this notebook to your Google Drive account

If you run this cell, a window should open, asking for permission. If you grant it, another window will have you select the Gooogle Drive account and specify which permissions to grant. You need to allow the notebook to open, read, edit and delete files. This will likely be only one tick box.

If everything worked, the output of the cell should say: ```Mounted at /content/drive```.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

If your file is saved in your Google Drive, the notebook can access it with the following cells.

**The file itself will not be changed!** It will only be opened in read-mode "r", meaning its contents will be read. The notebook will produce a new file with the tagged text.

When inserting the name of your file in the cell below, please include the file extension as well. For a file named "test.txt", the extension would be ".txt".


In [8]:
# Insert the name of your file inside the quotation marks
FILE = "test.txt"

In [9]:
FILE = FILE.strip()

with open(f"/content/drive/My Drive/{FILE}", "r") as input_file:
  name, ext = os.path.splitext(FILE)
  output_file = name + ".vrt"
  content = input_file.read()

With Google Colab set to use the GPU, this code will approximately need 6 minutes to tokenize and annotate a file with over 100.000 tokens.

Be aware that the following code will **automatically download** the created file to your computer, unless the indicated line is commented out. It is advised that the file be immediately downloaded, because all files will be deleted as soon as the runtime disconnects (e.g. if the notebook has been inactive for a while).

The downloaded file will have the name of the original input file with the extension changed to ".vrt" (VeRTicalized file). This is just a normal text file that can be opened with any editor. Find out more here: https://fedora.clarin-d.uni-saarland.de/teaching/Comparing_Corpora_Tutorials/Tutorial_VRT.html

This cell has no output.


In [None]:
# Run this cell to tag the file

from google.colab import files

# pattern to split XML tags and non-tags
# group 1: XML tags, group 2: non-tag text
parts = re.findall(r'(<[^>]+>)|([^<]+)', content)

with open(output_file, "w", encoding="utf-8") as out:
    for tag, text in parts:
        if tag:
            # it's an XML tag, so write it unchanged
            out.write(tag + "\n")
        elif text:
            # it's text content, so lemmatize, clean and write
            doc = nlp(text)

            for sentence in doc.sentences:
                for word in sentence.words:
                    out.write(
                        f"{word.text}\t{word.xpos}\t{word.pos}\t{word.lemma.lower()}\n")

print(f"Lemmatized text written to {output_file}.")

# download the output file to your computer
files.download(output_file)
print(f"{output_file} downloaded.")