# Working with Kielipankki data: loading and operating on VRT

Welcome to this Notebook! It's divided into cells, which are kind of like extremely advanced Excel cells. They contain either formatted text (like this one), or code in the Python programming language which can be run with the results displayed inline.

This is a text cell, but next we'll have a code cell:

In [None]:
print("I'm a code cell!") # And I'm a comment, so I will not show up in the output

If you want to run the code in a code cell, you can hit `Control` + `enter` in the cell. You can also run cells from the menu at the top, under "Cell". There you can also change cell type. You can insert new cells under "Insert", or with keyboard shortcuts you can learn about there.

In this Notebook, run each code block in sequence with Control + enter on each one, and wait for them to finish. When a code block is running, an asterisk (`*`) shows on the left side, and when it's finished, a number appears indicating the order the code blocks have been executed in. If you want to change something in a code block, just hit enter and edit, or double-click on the code. The results will change once you run the code with your changes.

Now for some programming magic stuff..

In [None]:
# These import statements bring in additional modules that we're going to use
import datetime

from course_utils import VrtReader, data_dir
# Some helpers for the course environment

If the earlier data downloading exercise has been completed successfully, you should have a `.vrt` file, or several, available for reading. Using some prepared code we can now read the VRT data into a nice Python data structure. `course_utils` is not a real software library, but just some data ingestion code we wrote to accomplish this for this Notebook.

In [None]:
# If you have a different file, change the name here!
vrt_filename = "eduskunta-v1.5-vrt/eduskunta.vrt"
vrt = VrtReader(data_dir + vrt_filename, max_texts = 500)
# Now the variable "vrt" is of the type "VrtReader" and we can use some fields and functions special to it.
# For example, there's an "info" function that gives a helpful textual summary of its features.
print(vrt.info())

`VrtReader` has a field called `texts`, which is a list of texts, and each text has a `date`. We can define a list of dates, and find the minimum (earliest) and maximum (latest) dates like this:

In [None]:
dates = [text.date for text in vrt.texts]
min_date = min(dates)
max_date = max(dates)
# We can convert dates to text and print them
print("The earliest text is from " + str(min_date))
print("The latest text is from " + str(max_date))

Even though `VrtReader` is not a "real" library, it can still give some helpful information about what it can do:

In [None]:
help(VrtReader)

Let's see what the functions `map_tokens_to_field()`, `token_field_is_value()` and `token_values()` can do.

First, `field_values()` lists what values the token fields have, ie. what annotations have been used in, say, the `msd` field.

In [None]:
vrt.field_values('pos')

`map_tokens_to_field()` and `token_field_is_value()` can be used to filter and select tokens and fields. For now, let's do that to one singular text, using the `tokens()` of the first text in the file, ie. `vrt.texts[0]`:

In [None]:
# This will be a list of those tokens that have "V" as their "pos"
verbs_only = [token for token in vrt.texts[0].tokens() if vrt.token_field_is_value(token, "pos", "V")]
# This will be a list of the lemmas of those tokens
verb_lemmas = vrt.map_tokens_to_field(verbs_only, "lemma")
print(verb_lemmas)

Okay, that's enough for this Notebook. Feel free to play around with the data, you won't break anything. See you in the next Notebook on doing some analysis!