# Basic clustering

In this notebook we:

* normalize the directions,
* represent them as feature vectors (via TF-IDF),
* visualise the vectors,
* run a simple K-Means clustering algorithm to see where it gets us.

## Before we start: globals and paths

In [None]:
import os

In [None]:
directions_path = ".." + os.sep + "directions"
csv_path = "." + os.sep + "csv"
corpus_path = ".." + os.sep + "RusDraCor"

## Loading data

In [None]:
import pandas as pd
import numpy as np

### Plays

In [None]:
play_df = pd.read_csv(csv_path + os.sep + "joint_data.csv", sep=";", 
                 encoding="utf-8", index_col=0)
play_df.head()
# Островский, Гоголь, Сумароков

### Directions

In [None]:
all_directions_path = directions_path + os.sep + "all_directions.txt"
with open(all_directions_path, "r", encoding="utf-8") as alldirs_file:
    alldirs = [line.strip("\n") for line in alldirs_file.readlines() if line.strip("\r\n")]

Let's see how many directions we've got now...

In [None]:
len(alldirs)

...and take a look at an example of direction. For instance, at the last one:

In [None]:
alldirs[-1]

## Preprocessing

### Text normalization

In order to make clustering easier, I'll normalize all the directions as following:

* the words will be turned into their normal form (i.e. _играл -> играть_, _стулья -> стул_, etc.),
* stop words (such as interjections) will be removed.

The directions are all lowercase already.

In [None]:
from pymorphy2 import MorphAnalyzer
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
import string

morph = MorphAnalyzer()
stops = stopwords.words("russian")

In [None]:
def normalize(text):
    text = text.lower()
    
    tokens = wordpunct_tokenize(text)
    
    lemmas_raw = [morph.parse(token)[0].normal_form for token in tokens]
    lemmas = [lemma for lemma in lemmas_raw 
              if lemma not in stops 
             and lemma not in string.punctuation]
    
    return " ".join(lemmas)

In [None]:
alldirs_norm = [normalize(line) for line in alldirs if normalize(line)]

Let's check whether this caused a change in amount of directions:

In [None]:
len(alldirs) != len(alldirs_norm)

In [None]:
len(alldirs) - len(alldirs_norm)
# what's thrown out?

And also let's take a look at a random direction when it's normalized:

In [None]:
alldirs_norm[-1]
# list of characters -- compare to get proper names

### Vectorizing

I'm vectorizing the directions because it's the easiest way to get numbers out of texts. The algorithm is **TF-IDF**, which is quite common for the NLP tasks and problems. 

More information:
* [scikit-learn page](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) with documentation on functions and parameters,
* _I'll probably add an article on that, but I have to find it first! :)_

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer()
tfidf.fit(alldirs_norm)
X = tfidf.transform(alldirs_norm)

## Plotting the results

Now, let's plot what we have to see whether there are any clusters. Unfortunately, we have our TF-IDF results as a sparse matrix, so we'll run a **LSA (latent semantic analysis)** to reduce the number of dimensions down to 2 in order to be able to plot them.

In [None]:
from sklearn.decomposition import TruncatedSVD
# maths behind?

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
svd = TruncatedSVD(n_components=2)
svd.fit(X)
X_2d = svd.transform(X)

In [None]:
plt.figure(figsize=(12,8))
plt.scatter(X_2d[:,0], X_2d[:,1])

## Yay, machine learning!

I'll use **KMeans clustering** algorithm because we have medium amount of directions and 8 clusters (see in [readme](./README.md) — all the classes from TEI classification except for `mixed`). The classes are:

1. setting,
2. entrance,
3. exit,
4. business,
5. novelistic,
6. delivery,
7. modifier,
8. location.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

In [None]:
from sklearn.cluster import KMeans

In [None]:
k_means = KMeans(n_clusters=8)
k_means.fit(X)

In [None]:
y_means = k_means.predict(X)
plt.figure(figsize=(12,8))
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y_means, s=50, cmap='viridis')

centers = k_means.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
# cluster meanings?