# Working with Kielipankki data: loading and operating on VRT

Welcome to this Notebook! It's divided into cells, which are kind of like extremely advanced Excel cells. They contain either formatted text (like this one), or code in the Python programming language which can be run with the results displayed inline.

This is a text cell, but next we'll have a code cell:

In [1]:
print("I'm a code cell!") # And I'm a comment, so I will not show up in the output

I'm a code cell!


If you want to run the code in a code cell, you can hit `Control` + `enter` in the cell. You can also run cells from the menu at the top, under "Cell". There you can also change cell type. You can insert new cells under "Insert", or with keyboard shortcuts you can learn about there.

In this Notebook, run each code block in sequence with Control + enter on each one, and wait for them to finish. When a code block is running, an asterisk (`*`) shows on the left side, and when it's finished, a number appears indicating the order the code blocks have been executed in. If you want to change something in a code block, just hit enter and edit, or double-click on the code. The results will change once you run the code with your changes.

Now for some programming magic stuff..

In [3]:
# These import statements bring in additional modules that we're going to use
import Vrt
import datetime

from course_utils import data_dir
# Some helpers for the course environment

If the earlier data downloading exercise has been completed successfully, you should have a `.vrt` file, or several, available for reading. Using some prepared code we can read in the VRT data into a nice Python data structure. `Vrt` is not a real software library, but just some data ingestion code we wrote to accomplish this for this Notebook.

In [4]:
# If you have a different file, change the name here!
vrt_filename = "M.vrt"
vrt = Vrt.VrtReader(data_dir + vrt_filename)
# Now the variable "vrt" is of the type "VrtReader" and we can use some fields and functions special to it.
# For example, there's an "info" function that gives a helpful textual summary of its features.
print(vrt.info())


Document contains
    2,324 texts
    74,108 sentences
    847,135 tokens
Text attributes are
    timeto
    datetime_json_modified
    publisher
    datetime_published
    dateto
    timefrom
    datetime_content_modified
    departments
    datefrom
    main_department
    id
    url
Sentence attributes are
    paragraph_type
    type
    id
Token fields are
    word
    ref
    lemma
    lemmacomp
    pos
    msd
    dephead
    deprel
    lex/



`VrtReader` has a field called `texts`, which is a list of texts, and each text has a `date`. We can define a list of dates, and find the minimum (earliest) and maximum (latest) dates like this:

In [4]:
dates = [text.date for text in vrt.texts]
min_date = min(dates)
max_date = max(dates)
# We can convert dates to text and print them
print("The earliest text is from " + str(min_date))
print("The latest text is from " + str(max_date))

The earliest text is from 2018-12-31
The latest text is from 2019-01-13


Even though `Vrt` is not a "real" library, it can still give some helpful information about what it can do:

In [5]:
help(Vrt)

Help on module Vrt:

NAME
    Vrt

CLASSES
    builtins.object
        VrtReader
    
    class VrtReader(builtins.object)
     |  VrtReader(vrt_file)
     |  
     |  Read and provide an interface to a VRT file.
     |  
     |  Methods defined here:
     |  
     |  __init__(self, vrt_file)
     |      vrt_file should be either a file object or a file name.
     |  
     |  info(self)
     |      Return a multi-line string with some summary information.
     |  
     |  map_tokens_to_field(self, tokens, field)
     |      Given a list of tokens and a field name, return a list of values in that field
     |  
     |  token_field_is_value(self, token, field, value)
     |      True iff given token has given value in given field
     |  
     |  token_values(self, field)
     |      Return a list of values seen in a named field
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |  

Let's see what the functions `map_tokens_to_field()`, `token_field_is_value()` and `token_values()` can do.

First, `field_values()` lists what values the token fields have, ie. what annotations have been used in, say, the `msd` field.

In [9]:
vrt.field_values('pos')

AttributeError: 'VrtReader' object has no attribute 'field_values'

`map_tokens_to_field()` and `token_field_is_value()` can be used to filter and select tokens and fields. For now, let's do that to one singular text, using the `tokens()` of the first text in the file, ie. `vrt.texts[0]`:

In [11]:
# This will be a list of those tokens that have "V" as their "pos"
verbs_only = [token for token in vrt.texts[0].tokens() if vrt.token_field_is_value(token, "pos", "V")]
# This will be a list of the lemmas of those tokens
verb_lemmas = vrt.map_tokens_to_field(verbs_only, "lemma")
print(verb_lemmas)

['kääntyä', 'katsoa', 'huomata', 'olla', 'hohtaa', 'olla', 'muistaa', 'särkyä', 'karata', 'näyttää', 'olla', 'jatkua', 'ajatella', 'kokeilla', 'vaikuttaa', 'kiihtyä', 'seurata', 'muistuttaa', 'olla', 'ostaa', 'olla', 'ettei', 'nähdä', 'keittää', 'olla', 'lähteä', 'ellei', 'olla', 'sattua', 'laittaa', 'alkaa', 'syöttää', 'alkaa', 'kehkeytyä', 'olla', 'tietää', 'ettei', 'sijaita', 'ei', 'pystyä', 'kurkistaa', 'olla', 'peljätä', 'joutua', 'soittaa', 'ei', 'näyttää', 'kiihtyä', 'ei', 'haistaa', 'olla', 'syttyä', 'palaa', 'olla', 'terästää', 'tulla', 'tutkia', 'olla', 'onkia', 'suoristaa', 'tehdä', 'alkaa', 'nukuttaa', 'estää', 'nukahtaa', 'kontata', 'koettaa', 'työntää', 'hohtaa', 'olla', 'pulpahtaa', 'olla', 'tulla', 'riippua', 'kulkea', 'kadota', 'olla', 'hyytyä', 'olla', 'leijua', 'olla', 'upota', 'ilmestyä', 'miettiä', 'olla', 'koskea', 'alkaa', 'imeytyä', 'olla', 'kiristyä', 'alkaa', 'kiihtyä', 'upota', 'olla', 'alkaa', 'suoltua', 'kadota', 'olla', 'olla', 'olla', 'olla', 'tehdä', 'ki

Okay, that's enough for this Notebook. Feel free to play around with the data, you won't break anything. See you in the next Notebook on doing some analysis!