# Stage directions: basic features

This notebook contains:

* directions extraction
* basic values calculation

## Imports
<span style="font-size:larger;color:blue;">Run this before everything else!</span>

In [1]:
# imports
import os
from lxml import etree
import re
from pymystem3 import Mystem
from bs4 import BeautifulSoup
from statistics import mean
import pandas as pd


# we need this namespaces parameter to search in TEI-encoded files correctly
tei_ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

# one Mystem for everything, so that we won't have to initialize it every time
mystem = Mystem()

## Preparation

### Data
I used [RusDraCor](https://github.com/dracor-org/rusdracor) as of 8th February, 2018. More precise data on the corpus is below.

In [2]:
corpus_path = ".." + os.sep + "RusDraCor"

### Play class
I want to create a table with information on the plays we research, so it will be much easier to have a separate ```Play``` class with all the data.


```python
class Play:
    def __init__(self, path, author, title, year_written, year_published, amount_of_directions, amount_of_acts):
        self.path = path
        self.author = author
        self.title = title
        self.year_written = year_written
        self.year_published = year_published
        self.amount_of_directions = amount_of_directions
        self.amount_of_acts = amount_of_acts
```

Update: we don't really need that one.

## Extraction

### Play information
The majority of information pieces can be extracted from the ```html <teiHeader></teiHeader>``` in XML itself. Each act is wrapped in ```<div type="act">```, so counting the acts also isn't much of a problem.

In [3]:
def get_play_data(play_path):
    reg_space = re.compile("\s{2,}")
    root = etree.parse(play_path)
    title = root.find(".//tei:titleStmt/tei:title", tei_ns).text
    title = reg_space.sub(" ", title)
    author = root.find(".//tei:titleStmt/tei:author", tei_ns).text
    try:
        written_str = root.find(".//tei:sourceDesc/tei:bibl/tei:bibl/tei:date[@type=\"written\"]", tei_ns).attrib["when"]
        year_written = int(written_str)
    except:
        year_written = -1
    try:
        published_str = root.find(".//tei:sourceDesc/tei:bibl/tei:bibl/tei:date[@type=\"print\"]", tei_ns).attrib["when"]
        year_published = int(published_str)
    except:
        year_published = -1
    acts = len(root.findall(".//tei:text/tei:body/tei:div[@type=\"act\"]", tei_ns))
    return author, title, year_written, year_published, acts    

### Amount of words in a play
Word count is required to count some averages later on.

In [4]:
def word_count(play_path, mystem):
    with open(play_path, "r", encoding="utf-8") as play:
        soup = BeautifulSoup(play, "xml")
        text = soup.find("text").text
    lemmas_play = set(mystem.lemmatize(text))
    words_play = [item for item in mystem.analyze(text) if "analysis" in item]
    return len(lemmas_play), len(words_play)

### Directions
Basically, all the directions are wrapped by the ```<stage></stage>``` tag, so we can just use regex. The only problem with this approach is that sometimes directions are on separate lines, and many spaces and tabs may occur in between two words — and this can also be fixed with the help of regex. 

In [5]:
def get_directions(play_path):
    reg_space = re.compile("\s{2,}")
    root = etree.parse(play_path)
    directions_tags = root.findall(".//tei:stage", tei_ns)
    directions_text = []
    for direction in directions_tags:
        try:
            text = direction.text
            text = reg_space.sub(" ", text)
            text = text.replace("\xa0", " ")
            text = text.strip("\(\) \.")
            text = text.lower()
            directions_text.append(text)
        except:
            pass
    return directions_text

In addition — let's count the average length of a direction in a play!

In [6]:
def average_direction(directions, mystem):
    dir_length = []
    dir_lemmas = []
    for direction in directions:
        length = len([item for item in mystem.analyze(direction) if "analysis" in item])
        dir_length.append(length)
        lemmas = len(set(mystem.lemmatize(direction)))
        dir_lemmas.append(lemmas)
    return mean(dir_length), mean(dir_lemmas)

## Putting it all together

### Saving the data
The following is being done in order to save the extraced directions just in case (and not to extract them from the play every time we need them).

In [7]:
# some basic stuff we need in order to save everything properly
directions_path = ".." + os.sep + "data"
if not os.path.exists(directions_path):
    os.mkdirs(directions_path)

all_directions_file = directions_path + os.sep + "all_directions.txt"
if not os.path.exists(all_directions_file):
    f = open(all_directions_file, "w", encoding="utf-8")
    f.close()

In [8]:
def save_directions(play_name, directions, option):
    if option == "sep":
        directions_file = directions_path + os.sep + play_name.replace("xml", "txt")
        with open(directions_file, "w", encoding="utf-8") as directions_save:
            directions_save.write("\n".join(directions))
    elif option == "all":
        with open(all_directions_file, "a", encoding="utf-8") as all_directions_save:
            all_directions_save.write("\n".join(directions) + "\n")

### Working with the corpus

Now, we traverse all the plays in the ```corpus_folder``` and do the following: 
* extract all the directions and save them: 
    * in a ```directions_list``` variable,
    * in a separate file,
    * in a file that contains all the directions
* extract information about a play and save it into ```play_list``` (to convert it into a Pandas dataframe later)

In [9]:
# this is required because I'm using a Mac, which sometimes creates system folders like .DS_Store;
# in any other cases -- never mind
play_files = [item for item in os.listdir(corpus_path) if item.endswith(".xml")]

In [10]:
directions_list = []
play_list = []
path_list = []

dropped_count = 1
for play in play_files:
    play_path = corpus_path + os.sep + play
    
    st_dirs = get_directions(play_path)
    author, title, year_written, year_published, acts = get_play_data(play_path)
    lemmas, words = word_count(play_path, mystem)
    words_dir, lemmas_dir = average_direction(st_dirs, mystem)
    
    play_info = {"Path": play_path, 
            "Author": author, 
            "Title": title, 
            "Written": year_written, 
            "Published": year_published, 
            "Amount of directions": len(st_dirs), 
            "Amount of acts": acts if acts!=0 else 1,
            "Lemmas": lemmas,
            "Words": words,
            "Words per direction": words_dir,
            "Lemmas per direction": lemmas_dir}
    
    if (play_info["Written"] != -1) and (play_info["Published"] != -1):
        # directions
        directions_list.append(st_dirs)
        save_directions(play, st_dirs, option="sep")
        save_directions(play, st_dirs, option="all")
        
        # add information about a play
        play_list.append(play_info)
        path_list.append(play_path)
    else:
        print("Dropped a play: {}, total dropped: {}".format(play_path, dropped_count))
        dropped_count += 1

Dropped a play: ../RusDraCor/sukhovo-kobylin-smert-tarelkina.xml, total dropped: 1
Dropped a play: ../RusDraCor/shakhovskoy-ne-lubo-ne-slushai.xml, total dropped: 2
Dropped a play: ../RusDraCor/shakhovskoy-pustodomy.xml, total dropped: 3
Dropped a play: ../RusDraCor/lomonosov-tamira-i-selim.xml, total dropped: 4
Dropped a play: ../RusDraCor/naydyonov-deti-vanjushina.xml, total dropped: 5
Dropped a play: ../RusDraCor/plavilshchikov-sgovor-kutejkina.xml, total dropped: 6
Dropped a play: ../RusDraCor/shakhovskoy-svoya-semya.xml, total dropped: 7
Dropped a play: ../RusDraCor/turgenev-razgovor-na-bolshoj-doroge.xml, total dropped: 8
Dropped a play: ../RusDraCor/prutkov-spor-drevnih-grecheskih-filosofov-ob-izjaschnom.xml, total dropped: 9
Dropped a play: ../RusDraCor/shakhovskoy-urok-koketkam.xml, total dropped: 10
Dropped a play: ../RusDraCor/krylov-urok-dochkam.xml, total dropped: 11
Dropped a play: ../RusDraCor/krylov-amerikantsy.xml, total dropped: 12


#### Corrections
Note that the first ```if``` statement in the ```for``` loop removes all the plays where __it is not clear when the play was written and/or published__.

Furthermore, if there were no ```<div type="act">```, it means that __there is only one act in the play (not zero)__. I corrected it while creating ```play_info```.

### Data frame

Now, we create a Pandas data frame with all the information about the plays. We'll use paths to the plays to refer to rows.

In [11]:
play_df = pd.DataFrame(play_list).set_index("Path")
play_df.head()

Unnamed: 0_level_0,Amount of acts,Amount of directions,Author,Lemmas,Lemmas per direction,Published,Title,Words,Words per direction,Written
Path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
../RusDraCor/pushkin-stseny-iz-rytsarskih-vremen.xml,1,38,"Пушкин, Александр Сергеевич",997,4.973684,1837,Сцены из рыцарских времен,3397,3.131579,1835
../RusDraCor/turgenev-holostjak.xml,1,687,"Тургенев, Иван Сергеевич",2788,6.391557,1849,Холостяк,21501,4.50655,1849
../RusDraCor/gogol-zhenitba.xml,2,254,"Гоголь, Николай Васильевич",2188,5.952756,1842,Женитьба,13094,3.925197,1835
../RusDraCor/krylov-sonnyj-poroshok-ili-pohischennaja-krestjanka.xml,3,88,"Крылов, Иван Андреевич",1550,6.522727,1905,Сонный порошок или похищенная крестьянка,7605,4.215909,1798
../RusDraCor/blok-neznakomka.xml,1,80,"Блок, Александр Александрович",1342,18.025,1907,Незнакомка,4222,16.4625,1906


## Simple numbers
We will calculate:

* word and lemma count,
* amounts of the following parts-of-speech:
    - nouns,
    - adjectives,
    - verbs,
    - adverbs,
    - interjections
* amount of directions, per act

### Word, lemma count
Word and lemma count were calculated previously (see "Extraction" chapter). In here, I also calculate amount of words and lemmas per act.

In [12]:
# directions per act
play_df["Directions per act"] = play_df["Amount of directions"]/play_df["Amount of acts"]
# words per act
play_df["Words per act"] = play_df["Words"]/play_df["Amount of acts"]
# lemmas per act
play_df["Lemmas per act"] = play_df["Lemmas"]/play_df["Amount of acts"]
play_df.head()

Unnamed: 0_level_0,Amount of acts,Amount of directions,Author,Lemmas,Lemmas per direction,Published,Title,Words,Words per direction,Written,Directions per act,Words per act,Lemmas per act
Path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
../RusDraCor/pushkin-stseny-iz-rytsarskih-vremen.xml,1,38,"Пушкин, Александр Сергеевич",997,4.973684,1837,Сцены из рыцарских времен,3397,3.131579,1835,38.0,3397.0,997.0
../RusDraCor/turgenev-holostjak.xml,1,687,"Тургенев, Иван Сергеевич",2788,6.391557,1849,Холостяк,21501,4.50655,1849,687.0,21501.0,2788.0
../RusDraCor/gogol-zhenitba.xml,2,254,"Гоголь, Николай Васильевич",2188,5.952756,1842,Женитьба,13094,3.925197,1835,127.0,6547.0,1094.0
../RusDraCor/krylov-sonnyj-poroshok-ili-pohischennaja-krestjanka.xml,3,88,"Крылов, Иван Андреевич",1550,6.522727,1905,Сонный порошок или похищенная крестьянка,7605,4.215909,1798,29.333333,2535.0,516.666667
../RusDraCor/blok-neznakomka.xml,1,80,"Блок, Александр Александрович",1342,18.025,1907,Незнакомка,4222,16.4625,1906,80.0,4222.0,1342.0
