# Stage directions: basic features

This notebook contains:

* directions extraction
* basic values calculation

## Preparation

### Corpus
I used [RusDraCor](https://github.com/dracor-org/rusdracor) as of <font color="#900C3F">20th October, 2018</font>. 

In [1]:
import os

In [2]:
corpus_path = ".." + os.sep + "RusDraCor"

### Folders and files

|       Folder      |                Path              |         What's inside            |
|:-----------------:|:--------------------------------:|:--------------------------------:|
|directions_folder  |`../directions/`                  |Extracted directions, play-by-play|
|csv_folder         |`../csv/`                         |Datasets in comma-separated format|
|all_directions_file|`../directions/all_directions.txt`|All directions in one file        |

In [3]:
directions_folder = ".." + os.sep + "directions"
if not os.path.exists(directions_folder):
    os.mkdir(directions_folder)

In [4]:
csv_folder = "." + os.sep + "csv"
if not os.path.exists(csv_folder):
    os.mkdir(csv_folder)

In [5]:
all_directions_file = directions_folder + os.sep + "all_directions.txt"
f = open(all_directions_file, "w", encoding="utf-8")
f.close()

## Extraction

### Dates issue

There's a certain problem with the date of the play: the corpus provides us three various dates, which are:
* `print`, 
* `premiere`, 
* and `written`. 

Sometimes not all of them are present. To get a single date for a play, we do the following:

1. If `print` and `premiere` are available, we take the minimum of the two. 
2. If `written` is more than 10 years before `print` or `premiere`, we take `written`. This is the case when the play was censored, banned, or probably discovered after author's death. By that we achieve that a manuscript of a play that was only printed 10 years or later after its inception will be grouped among plays from a different time.

In [6]:
def single_date(date_print, date_premiere, date_written):
    if date_print and date_premiere:
        date_definite = min(date_print, date_premiere)
    elif date_premiere:
        date_definite = date_premiere
    else:
        date_definite = date_print
    
    if date_written and date_definite:
        if date_definite - date_written > 10:
            date_definite = date_written
        elif date_written and not date_definite:
            date_definite = date_written
    return date_definite

### Play information
The majority of information pieces can be extracted from the ```html <teiHeader></teiHeader>``` in XML itself. Each act is wrapped in ```<div type="act">```, so counting the acts also isn't much of a problem.

In [7]:
import re
from lxml import etree

# we need this parameter to search in TEI-encoded files correctly
tei_ns = {'tei': 'http://www.tei-c.org/ns/1.0'}

In [8]:
def get_play_data(play_path):
    reg_space = re.compile("\s{2,}")
    root = etree.parse(play_path)
    
    title = root.find(".//tei:titleStmt/tei:title", tei_ns).text
    title = reg_space.sub(" ", title)
    
    author = root.find(".//tei:titleStmt/tei:author", tei_ns).text
    
    try:
        written_str = root.find(".//tei:sourceDesc/tei:bibl/tei:bibl/tei:date[@type=\"written\"]", tei_ns).attrib["when"]
        year_written = int(written_str)
    except:
        year_written = None
    try:
        published_str = root.find(".//tei:sourceDesc/tei:bibl/tei:bibl/tei:date[@type=\"print\"]", tei_ns).attrib["when"]
        year_published = int(published_str)
    except:
        year_published = None
    try:
        premiere_str = root.find(".//tei:sourceDesc/tei:bibl/tei:bibl/tei:date[@type=\"premiere\"]", tei_ns).attrib["when"]
        year_premiere = int(premiere_str)
    except:
        year_premiere = None
    year_definite = single_date(year_published, year_premiere, year_written)
        
    acts = len(root.findall(".//tei:text/tei:body/tei:div[@type=\"act\"]", tei_ns))
    return author, title, year_definite, acts    

### Amount of words in a play
Word count is required to count some averages later on.

In [9]:
from bs4 import BeautifulSoup
from pymystem3 import Mystem

# one Mystem for everything, so that we won't have to initialize it every time
mystem = Mystem()

In [10]:
def word_count(play_path, mystem):
    with open(play_path, "r", encoding="utf-8") as play:
        soup = BeautifulSoup(play, "xml")
        text = soup.find("text").text
    lemmas_play = set(mystem.lemmatize(text))
    words_play = [item for item in mystem.analyze(text) if "analysis" in item]
    return len(lemmas_play), len(words_play)

### Directions
Basically, all the directions are wrapped by the ```<stage></stage>``` tag, so we can just use regex. The only problem with this approach is that sometimes directions are on separate lines, and many spaces and tabs may occur in between two words — and this can also be fixed with the help of regex. 

In [11]:
def get_directions(play_path):
    reg_space = re.compile("\s{2,}")
    root = etree.parse(play_path)
    directions_tags = root.findall(".//tei:stage", tei_ns)
    directions_text = []
    for direction in directions_tags:
        try:
            text = direction.text
            text = reg_space.sub(" ", text)
            text = text.replace("\xa0", " ")
            text = text.strip("\(\) \.")
            text = text.lower()
            directions_text.append(text)
        except:
            pass
    return directions_text

In addition — let's count the average length of a direction in a play!

In [12]:
from statistics import mean

In [13]:
def average_direction(directions, mystem):
    dir_length = []
    dir_lemmas = []
    for direction in directions:
        length = len([item for item in mystem.analyze(direction) if "analysis" in item])
        dir_length.append(length)
        lemmas = len(set(mystem.lemmatize(direction)))
        dir_lemmas.append(lemmas)
    return mean(dir_length), mean(dir_lemmas)

## Putting it all together

### Saving the data
The following is being done in order to save the extracted directions just in case (and not to extract them from the play every time we need them).

In [14]:
def save_directions(play_name, directions, option):
    if option == "sep":
        directions_file = directions_folder + os.sep + play_name.replace("xml", "txt")
        with open(directions_file, "w", encoding="utf-8") as directions_save:
            directions_save.write("\n".join(directions))
    elif option == "all":
        with open(all_directions_file, "a", encoding="utf-8") as all_directions_save:
            all_directions_save.write("\n".join(directions) + "\n")

### Working with the corpus

Now, we traverse all the plays in the ```corpus_folder``` and do the following: 
* extract all the directions and save them: 
    * in a ```directions_list``` variable,
    * in a separate file,
    * in a file that contains all the directions
* extract information about a play and save it into ```play_list``` (to convert it into a Pandas dataframe later)

In [15]:
# this is required because I'm using a Mac, which sometimes creates system folders like .DS_Store;
# in any other cases -- never mind
play_files = [item for item in os.listdir(corpus_path) if item.endswith(".xml")]

In [16]:
directions_list = []
play_list = []
dropped_count = 0
parsed_count = 0
total_plays = len(play_files)

In [17]:
for play in play_files:
    play_path = corpus_path + os.sep + play
    
    st_dirs = get_directions(play_path)
    author, title, year, acts = get_play_data(play_path)
    lemmas, words = word_count(play_path, mystem)
    words_dir, lemmas_dir = average_direction(st_dirs, mystem)
    
    play_info = {"Path": play, 
            "Author": author, 
            "Title": title, 
            "Year": year,
            "Directions": len(st_dirs), 
            "Acts": acts if acts!=0 else 1,
            "Lemmas": lemmas,
            "Lemmas, per direction": lemmas_dir,
            "Words": words,
            "Words, per direction": words_dir}
    
    if play_info["Year"]:
        # directions
        directions_list.append(st_dirs)
        save_directions(play, st_dirs, option="sep")
        save_directions(play, st_dirs, option="all")
        
        # add information about a play
        play_list.append(play_info)
        
        # "logging" print
        parsed_count += 1
        print("Successfully parsed a play: {}, total parsed: {}/{} ({:.2f}%)".format(play_path, 
            parsed_count, total_plays, parsed_count/total_plays*100))
    else:
        dropped_count += 1
        print("Dropped a play: {}".format(play_path))

print("All in all, dropped {} plays.".format(dropped_count))

Successfully parsed a play: ../RusDraCor/pushkin-stseny-iz-rytsarskih-vremen.xml, total parsed: 1/102 (0.98%)
Successfully parsed a play: ../RusDraCor/turgenev-holostjak.xml, total parsed: 2/102 (1.96%)
Successfully parsed a play: ../RusDraCor/gogol-zhenitba.xml, total parsed: 3/102 (2.94%)
Successfully parsed a play: ../RusDraCor/blok-neznakomka.xml, total parsed: 4/102 (3.92%)
Successfully parsed a play: ../RusDraCor/ostrovsky-bednaja-nevesta.xml, total parsed: 5/102 (4.90%)
Successfully parsed a play: ../RusDraCor/gogol-lakeiskaja.xml, total parsed: 6/102 (5.88%)
Successfully parsed a play: ../RusDraCor/prutkov-fantaziya.xml, total parsed: 7/102 (6.86%)
Successfully parsed a play: ../RusDraCor/bulgakov-dni-turbinyh.xml, total parsed: 8/102 (7.84%)
Successfully parsed a play: ../RusDraCor/pushkin-pir-vo-vremja-chumy.xml, total parsed: 9/102 (8.82%)
Successfully parsed a play: ../RusDraCor/sukhovo-kobylin-smert-tarelkina.xml, total parsed: 10/102 (9.80%)
Successfully parsed a play: ..

Successfully parsed a play: ../RusDraCor/chekhov-ivanov.xml, total parsed: 84/102 (82.35%)
Successfully parsed a play: ../RusDraCor/blok-balaganchik.xml, total parsed: 85/102 (83.33%)
Successfully parsed a play: ../RusDraCor/kheraskov-venecianskaya-monahinya.xml, total parsed: 86/102 (84.31%)
Successfully parsed a play: ../RusDraCor/chekhov-na-bolshoi-doroge.xml, total parsed: 87/102 (85.29%)
Successfully parsed a play: ../RusDraCor/pushkin-skupoj-rytsar.xml, total parsed: 88/102 (86.27%)
Successfully parsed a play: ../RusDraCor/gogol-igroki.xml, total parsed: 89/102 (87.25%)
Successfully parsed a play: ../RusDraCor/ozerov-dmitrij-donskoj.xml, total parsed: 90/102 (88.24%)
Successfully parsed a play: ../RusDraCor/gogol-teatralnyi-razezd.xml, total parsed: 91/102 (89.22%)
Successfully parsed a play: ../RusDraCor/gogol-utro-delovogo-cheloveka.xml, total parsed: 92/102 (90.20%)
Successfully parsed a play: ../RusDraCor/ostrovsky-snegurochka.xml, total parsed: 93/102 (91.18%)
Successfully p

#### Corrections
Note that the first ```if``` statement in the ```for``` loop removes all the plays in which __the `year` value is not clear__. Luckily, we don't have those :)

Furthermore, if there were no ```<div type="act">```, it means that __there is only one act in the play (not zero)__. I corrected it while creating ```play_info```.

### Data frame

Now, we create a Pandas data frame with all the information about the plays. We'll use paths to the plays to refer to rows.

In [18]:
import pandas as pd

In [19]:
play_df = pd.DataFrame(play_list)
play_df.head()

Unnamed: 0,Acts,Author,Directions,Lemmas,"Lemmas, per direction",Path,Title,Words,"Words, per direction",Year
0,1,"Пушкин, Александр Сергеевич",38,996,4.973684,pushkin-stseny-iz-rytsarskih-vremen.xml,Сцены из рыцарских времен,3399,3.131579,1837
1,1,"Тургенев, Иван Сергеевич",687,2785,6.391557,turgenev-holostjak.xml,Холостяк,21501,4.50655,1849
2,2,"Гоголь, Николай Васильевич",254,2187,5.952756,gogol-zhenitba.xml,Женитьба,13094,3.925197,1842
3,1,"Блок, Александр Александрович",132,1373,12.462121,blok-neznakomka.xml,Незнакомка,4314,10.856061,1907
4,5,"Островский, Александр Николаевич",442,2372,5.504525,ostrovsky-bednaja-nevesta.xml,Бедная невеста,22554,3.68552,1852


## Simple numbers
We will calculate:

* amount of directions
    * total in play,
    * per act
* amount of words
    * total in play,
    * per act,
    * per direction
* amount of lemmas
    * total in play,
    * per act,
    * per direction

### Word, lemma count
Word and lemma count were calculated previously (see "Extraction" chapter). In here, I also calculate amount of words and lemmas per act.

In [20]:
# directions per act
play_df["Directions, per act"] = play_df["Directions"]/play_df["Acts"]
# words per act
play_df["Words, per act"] = play_df["Words"]/play_df["Acts"]
# lemmas per act
play_df["Lemmas, per act"] = play_df["Lemmas"]/play_df["Acts"]
play_df.head()

Unnamed: 0,Acts,Author,Directions,Lemmas,"Lemmas, per direction",Path,Title,Words,"Words, per direction",Year,"Directions, per act","Words, per act","Lemmas, per act"
0,1,"Пушкин, Александр Сергеевич",38,996,4.973684,pushkin-stseny-iz-rytsarskih-vremen.xml,Сцены из рыцарских времен,3399,3.131579,1837,38.0,3399.0,996.0
1,1,"Тургенев, Иван Сергеевич",687,2785,6.391557,turgenev-holostjak.xml,Холостяк,21501,4.50655,1849,687.0,21501.0,2785.0
2,2,"Гоголь, Николай Васильевич",254,2187,5.952756,gogol-zhenitba.xml,Женитьба,13094,3.925197,1842,127.0,6547.0,1093.5
3,1,"Блок, Александр Александрович",132,1373,12.462121,blok-neznakomka.xml,Незнакомка,4314,10.856061,1907,132.0,4314.0,1373.0
4,5,"Островский, Александр Николаевич",442,2372,5.504525,ostrovsky-bednaja-nevesta.xml,Бедная невеста,22554,3.68552,1852,88.4,4510.8,474.4


### Saving the dataframe

In [21]:
play_df.to_csv(csv_folder + os.sep + "general_information.csv", index=False, sep=";", encoding="utf-8")