# Kielipankki's introduction to data analysis

This Notebook will follow up on the previous one about parsing VRT, so we will go through that part without comment.

In [None]:
# These import statements bring in modules that we're going to use
import datetime
import matplotlib.pyplot as plt
import numpy as np

# This is a special directive to make plots show up in the Notebook
%matplotlib notebook

from course_utils import VrtReader, data_dir, month_range
# Some helpers for the course environment

In [None]:
# If you have a different file, change the name here!
vrt_filename = "eduskunta-v1.5-vrt/eduskunta.vrt"
vrt = VrtReader(data_dir + vrt_filename, max_texts = 500)

Let's try to make a basic plot showing the number of texts per time period. To do that, we need a list of time segments to serve as our X-axis.

In [None]:
dates = [text.date for text in vrt.texts]
min_date = min(dates)
max_date = max(dates)
# This will be the X-axis of our graph, a list of dates of the first of every month.
# month_range is a little custom function, you may want to use a more serious framework like Pandas for real work
daterange = month_range(min_date, max_date)

Then for each text in `texts` we'll assign it to a bin and increment the count of that bin. We'll do this by calculating how many months after the first element of `daterange` it is.

In [None]:
# A little helper function: the months since year 0 of a date
def month_number(date): return 12*date.year + date.month

# Subtract this from every date
min_months = month_number(daterange[0])

# First we have zero texts in every bin
counts = [0 for date in daterange]

for text in vrt.texts:
    # This text belongs in this bin
    bin_number = month_number(text.date) - min_months
    # Increment the counter for that bin
    counts[bin_number] += 1
    

Then we will use a library called `pyplot` to draw a bar chart. By convention, we have imported it as `plt`.

To nicely label the X-axis labels, we will tell `plt.xticks()` where to put them and what to call them. Then we build the chart with `plt.bar()` and finally show it inline with `plt.show()`.

In [None]:
# The list of x-coordinates for the bars. We want as many bars as there are entries in daterange.
x_ticks = range(len(daterange))

# For each x-axis label, show just the year and month
x_labels = [f"{date.year}-{date.month}" for date in daterange]

# Give x_ticks and x_labels to pyplot. rotation = 45 makes the labels tilted so they don't
# end up obscuring each other.
plt.xticks(ticks = x_ticks, labels = x_labels, rotation = 45)

plt.bar(x_ticks, counts) # This is the command to make the bar chart
plt.tight_layout() # This adjusts whitespace around the plot
plt.show()

In the above graph, you should be able to use the mouse to interactively investigate it. If you choose the pan tool, you can hold down the left mouse botton to pan around the view, and the right mouse botton while moving the mouse to zoom in and out. You can also resize the graph as desired.

In the zoomed-out view, the x-axis labels are hard to read. One way to reduce clutter is to omit some of the labels. For example, you could try changing the `x_labels` -variable like this:

`x_labels = [f"{date.year}-{date.month}" if date.month in (1,7) else "" for date in daterange]`

You can read this as "If the month number is 1 or 7, make a label like `2020-01`. Otherwise, make an empty label (`""`). If you make that change, and re-run the code cell, you will see the x-labels for the other month numbers disappear.

If you're using the Eduskunta data, perhaps you can already make an observation about this histogram?

Next, we will try out a way to get some summary numeric data about our texts.

If you recall our Korp session, we were wondering about "noun disease" and the tendency of language users to turn verbs into nouns, and use very generic verbs with them. One coarse method of seeing how much common verbs dominate is to calculate the inequality of the distribution of the counts of the verbs. For this purpose, we will define a function that will take in a list of lemmas, like `["tehdä", "olla", "tehdä", "toteuttaa"]`, and return the Gini inequality of that list. In the case of the list I just mentioned, there are three different lemmas, one of which occurred twice, and two which occurred once.

You don't need to understand the function, just consider the input, output, and try it out!

In [None]:
def gini(instances):
    # http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm
    # First we tally up the count of each item in instances
    counts = {}
    for instance in instances:
        counts[instance] = counts.get(instance, 0) + 1
    # The list needs to be sorted, so we do that here, and make it into a numpy array
    array = np.array(sorted(counts.values()))
    index = np.arange(1,array.shape[0]+1)
    n = array.shape[0]
    # The formula itself
    return ((np.sum((2 * index - n  - 1) * array)) / (n * np.sum(array)))

Let's try the list I mentioned:

In [None]:
print(gini(["tehdä", "olla", "tehdä", "toteuttaa"]))

Feel free to try altering the list and seeing how the inequality value changes. Now let's apply this function to the first text in the corpus.

In [None]:
# The first text is vrt.texts[0], because zero is the first index in a Python list
verb_tokens = [token for token in vrt.texts[0].tokens() if vrt.token_field_is_value(token, "pos", "V")]
verb_lemmas = vrt.map_tokens_to_field(verb_tokens, "lemma")
print(gini(verb_lemmas))

We can wrap this process in another function, that will take a VrtText and return the Gini index of its verbs.

In [None]:
def text_to_verb_gini_index(vrt, text):
    verb_tokens = [token for token in text.tokens() if vrt.token_field_is_value(token, "pos", "V")]
    verb_lemmas = vrt.map_tokens_to_field(verb_tokens, "lemma")
    return gini(verb_lemmas)

Now we can get the inequality index of each of the first 10 texts:

In [None]:
for text in vrt.texts[:10]:
    print(text_to_verb_gini_index(vrt, text))

Now we have a number to plot, calculate correlations with, or filter to find interesting texts.