# Stage directions: basic features

This notebook contains:

* directions extraction
* basic values calculation

## Imports
<span style="font-size:larger;color:blue;">Run this before everything else!</span>

In [1]:
# imports
import os
from lxml import etree
import re
import pandas as pd
from pymystem3 import Mystem


# we need this namespaces parameter to search in TEI-encoded files correctly
tei_ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

## Preparation

### Data
I used [RusDraCor](https://github.com/dracor-org/rusdracor) as of 8th February, 2018. More precise data on the corpus is below.

In [2]:
corpus_path = ".." + os.sep + "RusDraCor"

### Play class
I want to create a table with information on the plays we research, so it will be much easier to have a separate ```Play``` class with all the data.

In [3]:
class Play:
    def __init__(self, path, author, title, year_written, year_published, amount_of_directions, amount_of_acts):
        self.path = path
        self.author = author
        self.title = title
        self.year_written = year_written
        self.year_published = year_published
        self.amount_of_directions = amount_of_directions
        self.amount_of_acts = amount_of_acts

## Extraction

### Play information
The majority of information pieces can be extracted from the ```html <teiHeader></teiHeader>``` in XML itself. Each act is wrapped in ```<div type="act">```, so counting the acts also isn't much of a problem.

In [4]:
def get_play_data(play_path):
    reg_space = re.compile("\s{2,}")
    root = etree.parse(play_path)
    title = root.find(".//tei:titleStmt/tei:title", tei_ns).text
    title = reg_space.sub(" ", title)
    author = root.find(".//tei:titleStmt/tei:author", tei_ns).text
    try:
        written_str = root.find(".//tei:sourceDesc/tei:bibl/tei:bibl/tei:date[@type=\"written\"]", tei_ns).attrib["when"]
        year_written = int(written_str)
    except:
        year_written = -1
    try:
        published_str = root.find(".//tei:sourceDesc/tei:bibl/tei:bibl/tei:date[@type=\"print\"]", tei_ns).attrib["when"]
        year_published = int(published_str)
    except:
        year_published = -1
    acts = len(root.findall(".//tei:text/tei:body/tei:div[@type=\"act\"]", tei_ns))
    return author, title, year_written, year_published, acts    

### Directions
Basically, all the directions are wrapped by the ```<stage></stage>``` tag, so we can just use regex. The only problem with this approach is that sometimes directions are on separate lines, and many spaces and tabs may occur in between two words — and this can also be fixed with the help of regex. 

In [5]:
def get_directions(play_path):
    reg_space = re.compile("\s{2,}")
    root = etree.parse(play_path)
    directions_tags = root.findall(".//tei:stage", tei_ns)
    directions_text = []
    for direction in directions_tags:
        try:
            text = direction.text
            text = reg_space.sub(" ", text)
            text = text.replace("\xa0", " ")
            text = text.strip("\(\) \.")
            text = text.lower()
            directions_text.append(text)
        except:
            pass
    return directions_text

## Putting it all together

### Saving the data
The following is being done in order to save the extraced directions just in case (and not to extract them from the play every time we need them).

In [6]:
# some basic stuff we need in order to save everything properly
directions_path = ".." + os.sep + "data"
if not os.path.exists(directions_path):
    os.mkdirs(directions_path)

all_directions_file = directions_path + os.sep + "all_directions.txt"
if not os.path.exists(all_directions_file):
    f = open(all_directions_file, "w", encoding="utf-8")
    f.close()

In [7]:
def save_directions(play_name, directions, option):
    if option == "sep":
        directions_file = directions_path + os.sep + play_name.replace("xml", "txt")
        with open(directions_file, "w", encoding="utf-8") as directions_save:
            directions_save.write("\n".join(directions))
    elif option == "all":
        with open(all_directions_file, "a", encoding="utf-8") as all_directions_save:
            all_directions_save.write("\n".join(directions) + "\n")

### Working with the corpus

Now, we traverse all the plays in the ```corpus_folder``` and do the following: 
* extract all the directions and save them: 
    * in a ```directions_list``` variable,
    * in a separate file,
    * in a file that contains all the directions
* extract information about a play and save it into ```play_list``` (to convert it into a Pandas dataframe later)

In [8]:
# this is required because I'm using a Mac, which sometimes creates system folders like .DS_Store;
# in any other cases -- never mind
play_files = [item for item in os.listdir(corpus_path) if item.endswith(".xml")]

In [9]:
directions_list = []
play_list = []
id_list = []

i = 0
dropped_list = []
for play in play_files:
    play_path = corpus_path + os.sep + play
    
    st_dirs = get_directions(play_path)
    author, title, year_written, year_published, acts = get_play_data(play_path)
    play_info = {"Path": play_path, 
            "Author": author, 
            "Title": title, 
            "Written": year_written, 
            "Published": year_published, 
            "Amount of directions": len(st_dirs), 
            "Amount of acts": acts if acts!=0 else 1}
    
    if (play_info["Written"] != -1) and (play_info["Published"] != -1):
        # directions
        directions_list.append(st_dirs)
        save_directions(play, st_dirs, option="sep")
        save_directions(play, st_dirs, option="all")
        
        # add information about a play
        play_list.append(play_info)
        id_list.append(i)
        i += 1
    else:
        dropped_list.append(play)

print("{} plays were dropped; they are:\n- {}".format(len(dropped_list), "\n- ".join(dropped_list)))

12 plays were dropped; they are:
- sukhovo-kobylin-smert-tarelkina.xml
- shakhovskoy-ne-lubo-ne-slushai.xml
- shakhovskoy-pustodomy.xml
- lomonosov-tamira-i-selim.xml
- naydyonov-deti-vanjushina.xml
- plavilshchikov-sgovor-kutejkina.xml
- shakhovskoy-svoya-semya.xml
- turgenev-razgovor-na-bolshoj-doroge.xml
- prutkov-spor-drevnih-grecheskih-filosofov-ob-izjaschnom.xml
- shakhovskoy-urok-koketkam.xml
- krylov-urok-dochkam.xml
- krylov-amerikantsy.xml


#### Corrections
Note that the first ```if``` statement in the ```for``` loop removes all the plays where __it is not clear when the play was written and/or published__.

Furthermore, if there were no ```<div type="act">```, it means that __there is only one act in the play (not zero)__. I corrected it while creating ```play_info```.

### Data frame

Now, we create a Pandas data frame with all the information about the plays. We'll use simple integer index (```ind_list```) to refer to rows.

In [None]:
play_df = pd.DataFrame(play_list, index=id_list)
play_df.head()

## Simple numbers
We will calculate:

* word count,
* amounts of the following parts-of-speech:
    - nouns,
    - adjectives,
    - verbs,
    - adverbs,
    - interjections

In [None]:
m = Mystem()
