# Classifying the directions

## Introduction

**Data: **stage directions from all the plays

* quantity: 22k+, unique: 13 955,
    - annotated: 1 150, unique: 858

**Goal: **classify the direction into 9 categories, as in TEI-P5 standard.

**Algorithms: **

* kNN,
* Decision Tree,
* Random Forest

In [1]:
import os

In [2]:
annotated_path = "." + os.sep + "csv" + os.sep + "annotated_dirs.csv"
shares_path = "." + os.sep + "csv" + os.sep + "shares_dirs.csv"

## Working with the data

### Loading

In [3]:
import pandas as pd
import numpy as np

First, the **preannotated directions** from five plays:

* “Svoi ljudi — sochtiomsja” (_It’s a Family Affair — We’ll Settle It Ourselves_) by Alexander Ostrovsky,
* “Khorev”, by Alexander Sumarokov,
* “Balaganchik”', by Alexander Blok,
* “Revizor”, (_The Government Inspector_) by Nikolai Gogol,
* “Djadja Vanja”, (_Uncle Vanja_) by Anton Chekhov.

In [4]:
annotated_dirs = pd.read_csv(annotated_path, sep=";", encoding="utf-8")
annotated_dirs.drop_duplicates(inplace=True)
annotated_dirs.set_index("Text", inplace=True)
annotated_dirs.head()

Unnamed: 0_level_0,TEI type
Text,Unnamed: 1_level_1
гостиная в доме большова,setting
сидит у окна с книгой,business
с жаром,delivery
вздыхает,delivery
молчание,delivery


After that, **the dataset with information from [means-merged-features notebook](./means-merged-features.ipynb)**.

In [5]:
shares = pd.read_csv(shares_path, sep=";", encoding="utf-8")
shares.drop_duplicates(inplace=True)
shares.set_index("Text", inplace=True)
shares.head()

Unnamed: 0_level_0,ADJ,ADVB,INTJ,NOUN,PREP,VERB
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
входит брат бертольд,0.0,0.0,0.0,0.666667,0.0,0.333333
бертольд и франц,0.0,0.0,0.0,0.666667,0.0,0.0
входит мартын,0.0,0.0,0.0,0.5,0.0,0.5
расходятся в разные стороны,0.25,0.0,0.0,0.25,0.25,0.25
почесывается,0.0,0.0,0.0,0.0,0.0,1.0


### Merging

Both datasets will be merged in order to have everything in one place for more convenient use.

In [6]:
annotated_dirs = annotated_dirs.join(shares)
annotated_dirs.head()

Unnamed: 0_level_0,TEI type,ADJ,ADVB,INTJ,NOUN,PREP,VERB
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"«да куда ж он делся-то, господи?»",setting,0.0,0.0,0.0,0.181818,0.0,0.090909
"«дома, что ли-ча, лазарь?»",setting,0.0,0.0,0.0,0.2,0.0,0.0
"автор опять испуганно высовывается, но быстро исчезает, как будто его оттянул кто-то за фалды",mixed,0.0,0.166667,0.0,0.111111,0.055556,0.166667
"автор хочет соединить руки коломбины и пьеро. но внезапно все декорации взвиваются и улетают вверх. маски разбегаются. автор оказывается склоненным над одним только пьеро, который беспомощно лежит на пустой сцене в белом балахоне своем с красными пуговицами. заметив свое положение, автор убегает стремительно",setting,0.142857,0.081633,0.0,0.285714,0.081633,0.163265
аграфена кондратьевна и липочка (разряженная,modifier,0.0,0.0,0.0,0.5,0.0,0.0


## Preprocessing

### Normalization

All the directions will be normalized: the words will be turned into their normal form (i.e. _играл_ -> _играть_, _стулья_ -> _стул_, etc.).Stop words (such as interjections) will not be removed, because they might be important for identifying direction type.

It is also common practiсe to turn eveything to lowercase, and the directions are all lowercase already.

In [7]:
from pymorphy2 import MorphAnalyzer
from nltk.tokenize import wordpunct_tokenize
import string

morph = MorphAnalyzer()
punct = string.punctuation + "«»"

In [8]:
def normalize(text):
    text = text.lower()
    
    tokens = wordpunct_tokenize(text)
    
    lemmas_raw = [morph.parse(token)[0].normal_form for token in tokens]
    lemmas = [lemma for lemma in lemmas_raw 
              if lemma not in punct
             and lemma != "?»"]
    
    return " ".join(lemmas)

In [9]:
annotated_dirs["Normalized text"] = annotated_dirs.index.map(normalize)
shares["Normalized text"] = shares.index.map(normalize)

In [10]:
annotated_dirs.head()

Unnamed: 0_level_0,TEI type,ADJ,ADVB,INTJ,NOUN,PREP,VERB,Normalized text
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"«да куда ж он делся-то, господи?»",setting,0.0,0.0,0.0,0.181818,0.0,0.090909,да куда ж он деться то господь
"«дома, что ли-ча, лазарь?»",setting,0.0,0.0,0.0,0.2,0.0,0.0,дом что ли ча лазарь
"автор опять испуганно высовывается, но быстро исчезает, как будто его оттянул кто-то за фалды",mixed,0.0,0.166667,0.0,0.111111,0.055556,0.166667,автор опять испуганно высовываться но быстро и...
"автор хочет соединить руки коломбины и пьеро. но внезапно все декорации взвиваются и улетают вверх. маски разбегаются. автор оказывается склоненным над одним только пьеро, который беспомощно лежит на пустой сцене в белом балахоне своем с красными пуговицами. заметив свое положение, автор убегает стремительно",setting,0.142857,0.081633,0.0,0.285714,0.081633,0.163265,автор хотеть соединить рука коломбина и пьеро ...
аграфена кондратьевна и липочка (разряженная,modifier,0.0,0.0,0.0,0.5,0.0,0.0,аграфена кондратий и липочка разрядить


### TF-IDF vectorization

At this stage, the directions are vectorized as it is the easiest way to get numbers out of texts. The algorithm is TF-IDF, which is quite common for the NLP tasks and problems.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
tfidf = TfidfVectorizer()
tfidf.fit(shares["Normalized text"].values)
X_tf_sparse = tfidf.transform(annotated_dirs["Normalized text"].values)
X_tf = X_tf_sparse.toarray()

### Creating feature arrays

We will create separate arrays for three series of experiments:

1. POS shares,
2. TF-IDF vectors of normalized directions,
3. Combination of the two mentioned above: POS shares _and_ TF-IDF vectors.

In [13]:
pos_cols = ["ADJ", "ADVB", "INTJ", "NOUN", "PREP", "VERB"]
X_pos = annotated_dirs[pos_cols].as_matrix()

Concatenating the **dense matrices**:

In [14]:
X_total = np.concatenate((X_tf, X_pos), axis=1)

Concatenating the **sparse matix of vectors and dense — of POS shares**:

In [15]:
from scipy.sparse import hstack

In [16]:
X_total_sparse = hstack((X_pos, X_tf_sparse))

## Goal variable

This classification tool has to predict direction types, hence `TEI type` is our goal variable.

In [17]:
y = annotated_dirs["TEI type"].as_matrix()

### Encoding categories

In order to make the machine learning algorithm work faster, all the types of directions will be encoded. In this case, `LabelEncoder` will be used (because it's much easier to use with the categorical features, such as our types!).

In [18]:
from sklearn.preprocessing import LabelEncoder

In [19]:
le = LabelEncoder()
le.fit_transform(y)

array([7, 7, 5, 7, 6, 6, 7, 3, 3, 6, 2, 7, 0, 5, 0, 1, 7, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 6, 0, 1, 0,
       0, 1, 0, 3, 1, 0, 0, 0, 7, 1, 1, 0, 1, 4, 1, 7, 0, 1, 0, 1, 1, 7,
       1, 1, 3, 0, 0, 1, 1, 1, 5, 0, 2, 1, 0, 0, 0, 1, 1, 1, 0, 0, 3, 1,
       0, 0, 5, 0, 7, 1, 1, 1, 2, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 3, 1, 5, 0, 2, 2, 2, 3, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 5, 2, 2, 2, 2, 5, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 3, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 5, 7, 1, 1, 1, 0, 0, 0, 0, 7, 7, 6, 7, 6, 0, 0, 7, 7, 0, 1, 1,
       0, 0, 7, 7, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       5, 1, 0, 4, 7, 0, 1, 1, 0, 1, 4, 4, 0, 5, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 7, 1, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0,
       0, 3, 1, 4, 0, 1, 0, 5, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 7, 0, 3,
       6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0,

This is how categories were encoded:

In [20]:
for label, category in enumerate(le.classes_):
    print("{} - {}".format(category, label))

business - 0
delivery - 1
entrance - 2
exit - 3
location - 4
mixed - 5
modifier - 6
setting - 7


#### Label binarization

While testing metrics, found out that label binarization is required for calculating the ROC-AUC score.

In [21]:
from sklearn.preprocessing import label_binarize

In [22]:
y_bin = label_binarize(y, classes=[0,8])

## Machine learning

We'll run three series of experiments:

1. POS shares,
2. TF-IDF vectors of normalized directions,
3. Combination of the two mentioned above: POS shares _and_ TF-IDF vectors.

(they were also mentioned in _Creating feature arrays_ part)

All the models will undergo **5-fold cross-validation**.

In [23]:
experiments_data = [("TF-IDF", X_tf_sparse), 
               ("POS", X_pos), 
               ("TF-IDF + POS", X_total_sparse)]

### Grid search

In order not to copypaste the code, let's wrap all the grid searches and evaluations into the functions.

In [24]:
from sklearn.model_selection import GridSearchCV

In [25]:
def train_model(model, train_data, params):
    grid = GridSearchCV(
        model,
        param_grid=params,
        cv=5,
        n_jobs=-1
    )
    grid.fit(train_data, y)
    return grid

In [26]:
def compute_best(algorithm, params):
    for label, train_set in experiments_data:
        print("Start training: {}".format(label))
        ready_model = train_model(algorithm, 
                                 train_set,
                                 params)
        print("Finish training: {}".format(label))
        print("\t- best parameters: {}\n\t- best score: {}".format(ready_model.best_params_, 
                                                             ready_model.best_score_))

### K Nearest Neighbors

In [27]:
from sklearn.neighbors import KNeighborsClassifier

In [28]:
knn_params = {"n_neighbors": np.arange(1, 102, 2)}
compute_best(KNeighborsClassifier(), knn_params)

Start training: TF-IDF
Finish training: TF-IDF
	- best parameters: {'n_neighbors': 23}
	- best score: 0.627039627039627
Start training: POS
Finish training: POS
	- best parameters: {'n_neighbors': 23}
	- best score: 0.6433566433566433
Start training: TF-IDF + POS
Finish training: TF-IDF + POS
	- best parameters: {'n_neighbors': 23}
	- best score: 0.7051282051282052


### Decision Tree

In [29]:
from sklearn.tree import DecisionTreeClassifier

In [30]:
tree_params = {"max_depth": np.arange(1, 101)}
compute_best(DecisionTreeClassifier(), tree_params)

Start training: TF-IDF
Finish training: TF-IDF
	- best parameters: {'max_depth': 66}
	- best score: 0.6445221445221445
Start training: POS
Finish training: POS
	- best parameters: {'max_depth': 10}
	- best score: 0.6421911421911422
Start training: TF-IDF + POS
Finish training: TF-IDF + POS
	- best parameters: {'max_depth': 18}
	- best score: 0.7191142191142191


### Random Forest

In [31]:
from sklearn.ensemble import RandomForestClassifier

In [32]:
forest_params = {"n_estimators": np.arange(1, 101)}
compute_best(RandomForestClassifier(random_state=1968), forest_params)

Start training: TF-IDF
Finish training: TF-IDF
	- best parameters: {'n_estimators': 65}
	- best score: 0.6666666666666666
Start training: POS
Finish training: POS
	- best parameters: {'n_estimators': 20}
	- best score: 0.6573426573426573
Start training: TF-IDF + POS
Finish training: TF-IDF + POS
	- best parameters: {'n_estimators': 50}
	- best score: 0.7400932400932401
