# Semantic features

## Hypotheses

We might be able to use a rule-based word2vec model to:

1. distinguish entrance and exit,

2. distinguish entrance/exit vs. everything else. 

## 1 &emsp; Preparing the data

### 1.1 &ensp; Vector model selection

We will be using *CBOW* as it focuses more on the context than on the word itself, i.e. verbs for entrance and exit may be different, yet their contexts might be more or less the same.

Models are taken from [https://rusvectores.org/ru/models/](rusvectores.org).

In [1]:
import gensim
import warnings

In [2]:
warnings.simplefilter("ignore")

**Option A: tagged CBOW**

* Trained on: Russian National Corpus

* Vocabulary: 270 000 000 tokens, 189 183 unique

* Tagset: Universal Tags (see UD tagset)

* min Frequency: 5

* Vector size: 300

In [3]:
import wget
import zipfile

In [4]:
model_path = "./data/models/180.zip"
with zipfile.ZipFile(model_path, "r") as archive:
    stream = archive.open("model.bin")
    model_continious = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=True)

In [5]:
model_continious.most_similar("входить_VERB")

[('войти_VERB', 0.760601818561554),
 ('воходить_VERB', 0.5944163799285889),
 ('вбежать_VERB', 0.5676171183586121),
 ('выходить_VERB', 0.5533232688903809),
 ('войти_NOUN', 0.5268169045448303),
 ('вплыть_VERB', 0.520382285118103),
 ('вбегать_VERB', 0.5187638401985168),
 ('вводить_VERB', 0.5162067413330078),
 ('вваливаться_VERB', 0.5069583654403687),
 ('включать_VERB', 0.5045877695083618)]

**Option 2: FastText untagged CBOW**

* Trained on: Araneum Russian

* Vocabulary: approx 10 billion, 195 782 unique

* Tagset: none

* min Frequency: 5

* Vector size: 400

In [6]:
path_to_fasttext = "./data/models/araneum_none_fasttextcbow_300_5_2018/araneum_none_fasttextcbow_300_5_2018.model"
model_fasttext = gensim.models.KeyedVectors.load(path_to_fasttext)

In [7]:
model_fasttext.most_similar("входить")

[('привходить', 0.868297815322876),
 ('включать', 0.7789416313171387),
 ('включаться', 0.7182040810585022),
 ('дреходить', 0.7058184146881104),
 ('произходить', 0.6974779963493347),
 ('бреходить', 0.6909902691841125),
 ('вляться', 0.6862760782241821),
 ('включаеться', 0.6861984729766846),
 ('влючать', 0.6828867197036743),
 ('всовываться', 0.679784893989563)]

**Result:** we'll use Option A (it looks as if tagging was a crucial feature).

### 1.2 &ensp; Dividing point

When a verb starts to be similar to another one? Let's take a look at the most similar verbs (once again) and calculate an average of their similarity.

**Entrance** 

In [8]:
model_continious.most_similar("входить_VERB")

[('войти_VERB', 0.760601818561554),
 ('воходить_VERB', 0.5944163799285889),
 ('вбежать_VERB', 0.5676171183586121),
 ('выходить_VERB', 0.5533232688903809),
 ('войти_NOUN', 0.5268169045448303),
 ('вплыть_VERB', 0.520382285118103),
 ('вбегать_VERB', 0.5187638401985168),
 ('вводить_VERB', 0.5162067413330078),
 ('вваливаться_VERB', 0.5069583654403687),
 ('включать_VERB', 0.5045877695083618)]

In [9]:
from statistics import mean

In [10]:
mean([sim_t[1] for sim_t in model_continious.most_similar("входить_VERB")])

0.5569674491882324

**Exit**

In [11]:
model_continious.most_similar("уходить_VERB")

[('уйти_VERB', 0.6411738991737366),
 ('убегать_VERB', 0.6407353281974792),
 ('уводить_VERB', 0.6357207298278809),
 ('пойти_VERB', 0.6315332651138306),
 ('убежать_VERB', 0.6295132637023926),
 ('отходить_VERB', 0.6092807054519653),
 ('увести_VERB', 0.5756492018699646),
 ('убраться_VERB', 0.5727350115776062),
 ('сбегать_VERB', 0.5644490718841553),
 ('уезжать_VERB', 0.5639676451683044)]

In [12]:
mean([sim_t[1] for sim_t in model_continious.most_similar("уходить_VERB")])

0.6064758121967315

**Conclusion:** using average similarity doesn't make any sense.

### 1.3 &ensp; Loading directions

So far — using 2018 dataset again.

In [13]:
import pandas as pd
import numpy as np

In [14]:
df_path = "./data/csv/2018_annotated_dirs.csv"
df = pd.read_csv(df_path, sep=";")
df.head()

Unnamed: 0,Text,TEI type,ADJ,ADVB,INTJ,NOUN,PREP,VERB
0,"«да куда ж он делся-то, господи?»",setting,0.0,0.0,0.0,0.181818,0.0,0.090909
1,"«дома, что ли-ча, лазарь?»",setting,0.0,0.0,0.0,0.2,0.0,0.0
2,"автор опять испуганно высовывается, но быстро ...",mixed,0.0,0.166667,0.0,0.111111,0.055556,0.166667
3,автор хочет соединить руки коломбины и пьеро. ...,setting,0.142857,0.081633,0.0,0.285714,0.081633,0.163265
4,аграфена кондратьевна и липочка (разряженная,modifier,0.0,0.0,0.0,0.5,0.0,0.0


## 2 &emsp; Developing the general algorithm


Now, let's assign TEI types based on semantic similarity:

1. Check whether the list of verbs contains a verb from top-10 most similar lists for entrance and exit respectively,

2. Otherwise — calculate an overall vector for the verbs,

3. Compare it with the vector for *entrance* and *exit*: the one that is closer "wins" and the direction is assigned that type.

### 2.1 &emsp; Data preparation

Top-10 `entrance` similarities with their POS tag removed:

In [15]:
most_similar_entrance = []
for entrance_tuple in  model_continious.most_similar("входить_VERB"):
    verb = entrance_tuple[0].split("_")[0]
    new_tuple = (verb, entrance_tuple[1])
    most_similar_entrance.append(new_tuple)

In [16]:
most_similar_entrance

[('войти', 0.760601818561554),
 ('воходить', 0.5944163799285889),
 ('вбежать', 0.5676171183586121),
 ('выходить', 0.5533232688903809),
 ('войти', 0.5268169045448303),
 ('вплыть', 0.520382285118103),
 ('вбегать', 0.5187638401985168),
 ('вводить', 0.5162067413330078),
 ('вваливаться', 0.5069583654403687),
 ('включать', 0.5045877695083618)]

Top-10 `exit` similarities, processed the same way:

In [17]:
most_similar_exit = []
for exit_tuple in  model_continious.most_similar("уходить_VERB"):
    verb = exit_tuple[0].split("_")[0]
    new_tuple = (verb, exit_tuple[1])
    most_similar_exit.append(new_tuple)

In [18]:
most_similar_exit

[('уйти', 0.6411738991737366),
 ('убегать', 0.6407353281974792),
 ('уводить', 0.6357207298278809),
 ('пойти', 0.6315332651138306),
 ('убежать', 0.6295132637023926),
 ('отходить', 0.6092807054519653),
 ('увести', 0.5756492018699646),
 ('убраться', 0.5727350115776062),
 ('сбегать', 0.5644490718841553),
 ('уезжать', 0.5639676451683044)]

### 2.2 &emsp; Developing the algorithm

**Similarity measure:** calculating cosine similarity between two random vectors:

In [19]:
from gensim import matutils

In [20]:
def vec_similarity(v1, v2):
    v1_norm = matutils.unitvec(np.array(v1).astype(float))
    v2_norm = matutils.unitvec(np.array(v2).astype(float))
    return np.dot(v1_norm, v2_norm)

Retrieving a **total vector** for all the verbs from the directions in general:

In [21]:
def get_w2v_vectors(text):
    total_counter = 0
    total_vector = np.zeros(300)
    for word in text:
        try:
            vector = np.array(model_similarity.wv[word])
            total_vector += vector
            total_counter += 1
        except:
            continue
    res_vector = total_vector / total_counter
    return res_vector

Finally, a **TEI type decision**.

The idea visualisation: 

![semantic algorithm — a scheme](data/figures/semantic_algorithm.png)

In [22]:
def assign_tei_type(verbs_list, most_similar_entrance, most_similar_exit):
    tei_type = ""
    total_counter = 0
    total_vector = np.zeros(300)
    for verb in verbs_list:
        # case 1 -- verb in most_common for a type
        # in this case, we return the type immediately
        if verb in most_similar_entrance:
            tei_type = "entrance"
            return tei_type
        elif verb in most_similar_exit:
            tei_type = "exit"
            return tei_type
        # case 2 -- verb is unknown
        else:
            try:
                verb_vec = np.array(model_continious.wv[verb+"_VERB"])
                total_vector += verb_vec
                total_counter += 1
            except:
                 # verb not in model vocabulary
                pass
    # case 2 -- calculating the total vector and comparing
    direction_vector = total_vector/total_counter
    similarity_entrance = vec_similarity(direction_vector, model_continious.wv["войти_VERB"])
    similarity_exit = vec_similarity(direction_vector, model_continious.wv["выйти_VERB"])
    if similarity_entrance > similarity_exit:
        tei_type = "entrance"
    elif similarity_entrance < similarity_exit:
        tei_type = "exit"
    else:
        tei_type = "unknown"
    return tei_type

## 3 &emsp; Testing the hypotheses

### 3.1 &emsp; Distinguishing `entrance` and `exit`

**Hypothesis:** we might be able to use a rule-based word2vec model to distinguish `entrance` and `exit`. 

#### 3.1.1 &emsp; Calculating and development

**Step 1:** Leaving out only `entrance` and `exit`:

In [23]:
entr_exit_types = ["entrance", "exit"]
df_entrance_exit = df.loc[df["TEI type"].isin(entr_exit_types)].reset_index(drop=True)
df_entrance_exit.head()

Unnamed: 0,Text,TEI type,ADJ,ADVB,INTJ,NOUN,PREP,VERB
0,аграфена кондратьевна уходит,exit,0.0,0.0,0.0,0.666667,0.0,0.333333
1,аграфена кондратьевна уходит с олимпиадой самс...,exit,0.0,0.0,0.0,0.666667,0.166667,0.166667
2,анна андреевна и марья антоновна вбегают на сцену,entrance,0.0,0.0,0.0,0.625,0.125,0.125
3,быстро уходит; телегин идет за ним,exit,0.0,0.142857,0.0,0.142857,0.142857,0.285714
4,в сильном волнении уходит,exit,0.25,0.0,0.0,0.25,0.25,0.25


**Step 2:** the only part of speech that matters for this task is the verb. Basically, the direction might be as simple as *входит*  or *уходит*, so let's extract all the verbs.

In [24]:
from pymystem3 import Mystem

mystem = Mystem()

In [25]:
def leave_out_verbs(direction):
    direction_verbs = []
    word_analyses = mystem.analyze(str(direction))
    for parse in word_analyses:
        if parse.get("analysis"):
            pos = parse["analysis"][0]["gr"].split(",")[0]
            if pos == "V":
                direction_verbs.append(parse["analysis"][0]["lex"])
    return direction_verbs

In [26]:
verbs = [leave_out_verbs(direction) for direction in df_entrance_exit["Text"]]
df_entrance_exit["Verbs"] = verbs
df_entrance_exit.head()

Unnamed: 0,Text,TEI type,ADJ,ADVB,INTJ,NOUN,PREP,VERB,Verbs
0,аграфена кондратьевна уходит,exit,0.0,0.0,0.0,0.666667,0.0,0.333333,[уходить]
1,аграфена кондратьевна уходит с олимпиадой самс...,exit,0.0,0.0,0.0,0.666667,0.166667,0.166667,[уходить]
2,анна андреевна и марья антоновна вбегают на сцену,entrance,0.0,0.0,0.0,0.625,0.125,0.125,[вбегать]
3,быстро уходит; телегин идет за ним,exit,0.0,0.142857,0.0,0.142857,0.142857,0.285714,"[уходить, идти]"
4,в сильном волнении уходит,exit,0.25,0.0,0.0,0.25,0.25,0.25,[уходить]


**Step 3:** applying the algorithm.

In [27]:
def get_rule_based_types(df_verb_series):
    rule_based_types = []
    for direction_verbs in df_verb_series:
        str_verbs = [str(verb) for verb in direction_verbs]
        rule_type = assign_tei_type(str_verbs, most_similar_entrance, most_similar_exit)
        rule_based_types.append(rule_type)
    return rule_based_types

In [28]:
df_entrance_exit["Rule-based"] = get_rule_based_types(df_entrance_exit["Verbs"])
df_entrance_exit.head()

Unnamed: 0,Text,TEI type,ADJ,ADVB,INTJ,NOUN,PREP,VERB,Verbs,Rule-based
0,аграфена кондратьевна уходит,exit,0.0,0.0,0.0,0.666667,0.0,0.333333,[уходить],exit
1,аграфена кондратьевна уходит с олимпиадой самс...,exit,0.0,0.0,0.0,0.666667,0.166667,0.166667,[уходить],exit
2,анна андреевна и марья антоновна вбегают на сцену,entrance,0.0,0.0,0.0,0.625,0.125,0.125,[вбегать],entrance
3,быстро уходит; телегин идет за ним,exit,0.0,0.142857,0.0,0.142857,0.142857,0.285714,"[уходить, идти]",exit
4,в сильном волнении уходит,exit,0.25,0.0,0.0,0.25,0.25,0.25,[уходить],exit


**Finally:** checking the hypothesis. If it's true, we'll get high scores on the metrics of precision and recall.

In [29]:
from sklearn.metrics import classification_report

In [30]:
print(classification_report(df_entrance_exit["TEI type"], df_entrance_exit["Rule-based"]))

              precision    recall  f1-score   support

    entrance       0.97      0.84      0.90        37
        exit       0.88      0.95      0.91        39
     unknown       0.00      0.00      0.00         0

   micro avg       0.89      0.89      0.89        76
   macro avg       0.62      0.60      0.60        76
weighted avg       0.92      0.89      0.91        76



Looks like that's true!

Let's check what's marked as unknown:

In [31]:
df_entrance_exit[df_entrance_exit["Rule-based"] == "unknown"]

Unnamed: 0,Text,TEI type,ADJ,ADVB,INTJ,NOUN,PREP,VERB,Verbs,Rule-based
7,вошед,entrance,0.0,0.0,0.0,1.0,0.0,0.0,[],unknown
56,слуга убирает и уносит тарелки вместе с осипом,exit,0.0,0.125,0.0,0.375,0.125,0.25,[],unknown


These two are special in their own way: no.7 (*вошед* = "having entered") is a no longer used gerund, no.56 mixes business and exit (*убирает и уносит* = "cleans and takes away" [apparently, exiting the scene]).

### 3.1 &emsp; Hypothesis 1 result

We **certainly can** use rule- and word2vec-based algorithm to distinguish between `entrance` and `exit` types.

### 3.2 &emsp; `entrance`/`exit` vs. everything else

Now, let's check whether the algorithm still works well on all the directions. In this case, it should distinguish entrance/exit from everything else (these directions should be marked as *unknown*).

In [32]:
known_types = set(["entrance", "exit"])
def rename_type(tei_type):
    if tei_type not in known_types:
        tei_type = "unknown"
    return tei_type

In [33]:
unknown_types = []
for dir_type in df["TEI type"]:
    unknown_types.append(rename_type(dir_type))

In [34]:
from copy import copy

In [35]:
df_unknown = copy(df)
df_unknown["Manual type"] = unknown_types
verbs = [leave_out_verbs(direction) for direction in df_unknown["Text"]]
df_unknown["Verbs"] = verbs
df_unknown.head()

Unnamed: 0,Text,TEI type,ADJ,ADVB,INTJ,NOUN,PREP,VERB,Manual type,Verbs
0,"«да куда ж он делся-то, господи?»",setting,0.0,0.0,0.0,0.181818,0.0,0.090909,unknown,[деваться]
1,"«дома, что ли-ча, лазарь?»",setting,0.0,0.0,0.0,0.2,0.0,0.0,unknown,[]
2,"автор опять испуганно высовывается, но быстро ...",mixed,0.0,0.166667,0.0,0.111111,0.055556,0.166667,unknown,"[высовываться, исчезать, оттягивать]"
3,автор хочет соединить руки коломбины и пьеро. ...,setting,0.142857,0.081633,0.0,0.285714,0.081633,0.163265,unknown,"[хотеть, соединять, взвиваться, улетать, разбе..."
4,аграфена кондратьевна и липочка (разряженная,modifier,0.0,0.0,0.0,0.5,0.0,0.0,unknown,[]


In [36]:
rule_types = []
for direction_verbs in df_unknown["Verbs"]:
    str_verbs = [str(verb) for verb in direction_verbs]
    rule_type = assign_tei_type(str_verbs, most_similar_entrance, most_similar_exit)
    rule_types.append(rule_type)

In [37]:
df_unknown["Rule-based"] = get_rule_based_types(df_unknown["Verbs"])
df_unknown.head()

Unnamed: 0,Text,TEI type,ADJ,ADVB,INTJ,NOUN,PREP,VERB,Manual type,Verbs,Rule-based
0,"«да куда ж он делся-то, господи?»",setting,0.0,0.0,0.0,0.181818,0.0,0.090909,unknown,[деваться],exit
1,"«дома, что ли-ча, лазарь?»",setting,0.0,0.0,0.0,0.2,0.0,0.0,unknown,[],unknown
2,"автор опять испуганно высовывается, но быстро ...",mixed,0.0,0.166667,0.0,0.111111,0.055556,0.166667,unknown,"[высовываться, исчезать, оттягивать]",exit
3,автор хочет соединить руки коломбины и пьеро. ...,setting,0.142857,0.081633,0.0,0.285714,0.081633,0.163265,unknown,"[хотеть, соединять, взвиваться, улетать, разбе...",exit
4,аграфена кондратьевна и липочка (разряженная,modifier,0.0,0.0,0.0,0.5,0.0,0.0,unknown,[],unknown


**Finally:** checking the hypothesis.

In [38]:
print(classification_report(df_unknown["Rule-based"], df_unknown["Manual type"]))

              precision    recall  f1-score   support

    entrance       0.84      0.09      0.17       336
        exit       0.95      0.14      0.24       268
     unknown       0.32      0.99      0.49       254

   micro avg       0.37      0.37      0.37       858
   macro avg       0.70      0.41      0.30       858
weighted avg       0.72      0.37      0.28       858



We have large recall yet small precision on the `entrance` and `exit`. Given that
$$Recall = \frac{True\ positives}{True\ positives + False\ negatives},$$
we have correctly assigined the majority of these directions.

On the other hand, let's remember that
$$Precision = \frac{True\ positives}{True\ positives + False\ positives},$$
and apart from that, we have a very unbalanced data set where the other TEI types altogether are much more frequent than `entrance`+`exit`. Basically, our result means that we get a lot of false positives just because of the dataset size.