# Mean POS amounts and creating a bigger dataset

**What we do in this notebook:**

* calculate mean amounts of various parts-of-speech in plays,
* create a dataset with these mean values,
* add this information to the dataset with general information on the plays.

## Preparations

### Paths and folders

In [1]:
import os

In [2]:
directions_path = ".." + os.sep + "directions"
csv_path = "." + os.sep + "csv"
corpus_path = ".." + os.sep + "RusDraCor"

This is required because I'm using a Mac, which sometimes creates system folders like `.DS_Store`; in any other case — never mind.

In [3]:
directions_files = [item for item in os.listdir(directions_path) 
                    if (item.endswith(".txt") and not item.startswith("all_directions"))]
play_files = [item for item in os.listdir(corpus_path) if item.endswith(".xml")]

## Parts-of-speech

### Counting POS
We will count the following parts-of-speech:

* nouns,
* adjectives,
* verbs,
* adverbs,
* interjections,
* prepositions.

The values will be respresented as a _fraction_, a result of division of the POS to the total amount of words in the direction.

$$ \text{POS count} = \frac{\text{amount of POS in a direction}}{\text{total amount of words}} $$

Tokenization will be done with help of [NLTK](https://www.nltk.org/) `wordpunct` tokenizer.

Part-of-speech tagging will be performed via ```pymmorphy2```, [an open-source Python library](http://pymorphy2.readthedocs.io/en/latest/) for morphological annotation of Russian texts, based on [Russian Open Corpus](http://opencorpora.org/).

In [4]:
import re
from pymorphy2 import MorphAnalyzer
from nltk.tokenize import wordpunct_tokenize

morph = MorphAnalyzer()

### Extracting the information from a _single direction_

In [5]:
def count_pos(direction):
    pos_dict = {"ADJ": 0, "ADVB": 0, "INTJ": 0, "NOUN": 0, "PREP": 0, "VERB": 0, 
                "Words": 0}
    tokens = wordpunct_tokenize(direction)
    for token in tokens:
        try:
            analysis = morph.parse(token)[0]
            pos = str(analysis.tag.POS)
            if pos != "PUNCT":
                pos_dict["Words"] += 1
                if pos in set(["ADJF", "ADJS", "COMP"]):
                    pos_dict["ADJ"] += 1
                elif pos in set(["VERB", "INFN"]):
                    pos_dict["VERB"] += 1
                elif pos in pos_dict.keys():
                    pos_dict[pos] += 1 
        except:
            pass
    return pos_dict

### Parsing the corpus

#### Applying this function to the whole play

In [6]:
def pos_play(directions_file):
    play_total = {"ADJ": [], "ADVB": [], "INTJ": [], "NOUN": [], "PREP": [], "VERB": [],
                  "Words": []}
    
    full_path = str(directions_path) + os.sep + directions_file
    with open(full_path, "r", encoding="utf-8") as directions_f:
        directions = [line.strip("\n") for line in directions_f.readlines() if line != "\n"]
    
    for st_dir in directions:
        pos_direction = count_pos(st_dir)
        for part_of_speech in pos_direction.keys():
            play_total[part_of_speech].append(pos_direction[part_of_speech])
    return play_total

#### Crawling the files

Now let's crawl the files and create a dataset.

In [7]:
from statistics import mean

In [8]:
plays_info = []
directions_parsed = 0
total_plays = len(play_files)

for directions_file in directions_files:
    stats ={}
    play = pos_play(directions_file)
    for key in play.keys():
        if key != "Words":
            stats[key] = mean(play[key])
    stats["Path"] = directions_file
    plays_info.append(stats)
    
    # logging print
    directions_parsed += 1
    print("Successfully parsed directions: {}, total parsed: {}/{} ({:.2f}%)".format(directions_file, 
        directions_parsed, total_plays, directions_parsed/total_plays*100))

Successfully parsed directions: sumarokov-horev.txt, total parsed: 1/102 (0.98%)
Successfully parsed directions: prutkov-srodstvo-mirovyh-sil.txt, total parsed: 2/102 (1.96%)
Successfully parsed directions: gorky-egor-bulychov-i-drugie.txt, total parsed: 3/102 (2.94%)
Successfully parsed directions: turgenev-vecher-v-sorrente.txt, total parsed: 4/102 (3.92%)
Successfully parsed directions: gumilyov-gondla.txt, total parsed: 5/102 (4.90%)
Successfully parsed directions: chekhov-leshii.txt, total parsed: 6/102 (5.88%)
Successfully parsed directions: prutkov-oprometchivyj-turka.txt, total parsed: 7/102 (6.86%)
Successfully parsed directions: sukhovo-kobylin-svadba-krechinskogo.txt, total parsed: 8/102 (7.84%)
Successfully parsed directions: chekhov-chaika.txt, total parsed: 9/102 (8.82%)
Successfully parsed directions: lomonosov-tamira-i-selim.txt, total parsed: 10/102 (9.80%)
Successfully parsed directions: pushkin-kamenniy-gost.txt, total parsed: 11/102 (10.78%)
Successfully parsed dire

Successfully parsed directions: sukhovo-kobylin-delo.txt, total parsed: 94/102 (92.16%)
Successfully parsed directions: saltykov-shchedrin-smert-pazuhina.txt, total parsed: 95/102 (93.14%)
Successfully parsed directions: chekhov-predlozhenie.txt, total parsed: 96/102 (94.12%)
Successfully parsed directions: turgenev-razgovor-na-bolshoj-doroge.txt, total parsed: 97/102 (95.10%)
Successfully parsed directions: tolstoy-tsar-fedor-ioannovich.txt, total parsed: 98/102 (96.08%)
Successfully parsed directions: ostrovsky-volki-i-ovtsy.txt, total parsed: 99/102 (97.06%)
Successfully parsed directions: chekhov-tragik-ponevole.txt, total parsed: 100/102 (98.04%)
Successfully parsed directions: pushkin-mocart-i-saleri.txt, total parsed: 101/102 (99.02%)
Successfully parsed directions: ostrovsky-svoi-ljudi.txt, total parsed: 102/102 (100.00%)


### POS shares/parts in the directions

We have a file with all the directions, called `all_directions.txt`. In here, all the directions are stored together, without any distinction by play. What I want to achieve here is to get the _shares_ of each part-of-speech from the list. 

In [9]:
all_directions_path = directions_path + os.sep + "all_directions.txt"
with open(all_directions_path, "r", encoding="utf-8") as alldir_file:
    all_directions = [line.strip("\n") for line in alldir_file.readlines() if line.strip("\n")]

This is how many directions we have:

In [10]:
len(all_directions)

24058

#### Counting shares

This is how we count a share of the given POS from the list in a direction.

In [11]:
def pos_share(pos_dict):
    share_dict = {"ADJ": 0.0, "ADVB": 0.0, "INTJ": 0.0, "NOUN": 0.0, "PREP": 0.0, "VERB": 0.0}
    for pos in ["ADJ", "ADVB", "INTJ", "NOUN", "PREP", "VERB"]:
        share_dict[pos] = pos_dict[pos]/pos_dict["Words"]
    return share_dict

Now, let us parse the directions.

In [12]:
directions_share_list = []
ready_dirs = 0
total_dirs = len(all_directions)

for direction in all_directions:
    # calculate things
    pos_dict = count_pos(direction)
    share_dict = pos_share(pos_dict)
    share_dict["Text"] = direction
    
    # add them to the list
    directions_share_list.append(share_dict)
    
    # logging print
    ready_dirs += 1
    ratio = ready_dirs/total_dirs*100
    if ready_dirs % 5000 == 0:
        print("Total parsed: {}/{}, or {:.2f}%".format(ready_dirs, total_dirs, ratio))
print("Done!")

Total parsed: 5000/24058, or 20.78%
Total parsed: 10000/24058, or 41.57%
Total parsed: 15000/24058, or 62.35%
Total parsed: 20000/24058, or 83.13%
Done!


### Datasets

Now, turning everything into pretty tables and datasets!

In [13]:
import pandas as pd

pd.set_option("max_colwidth", 1000)

This is the dataset of plays:

In [14]:
df_mean = pd.DataFrame(plays_info)
df_mean = df_mean.set_index("Path")
df_mean.head()

Unnamed: 0_level_0,ADJ,ADVB,INTJ,NOUN,PREP,VERB
Path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
sumarokov-horev.txt,0.545455,0.090909,0.0,1.666667,0.242424,0.121212
prutkov-srodstvo-mirovyh-sil.txt,1.73913,1.043478,0.0,4.021739,2.065217,2.347826
gorky-egor-bulychov-i-drugie.txt,0.457286,0.276382,0.0,1.839196,0.738693,1.095477
turgenev-vecher-v-sorrente.txt,0.354167,0.3125,0.0,1.548611,0.8125,0.923611
gumilyov-gondla.txt,0.370968,0.145161,0.112903,1.806452,0.548387,0.435484


And this is the dataset of directions:

In [15]:
share_df = pd.DataFrame(directions_share_list)
share_df = share_df.set_index("Text")
share_df.head()

Unnamed: 0_level_0,ADJ,ADVB,INTJ,NOUN,PREP,VERB
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
входит брат бертольд,0.0,0.0,0.0,0.666667,0.0,0.333333
бертольд и франц,0.0,0.0,0.0,0.666667,0.0,0.0
входит мартын,0.0,0.0,0.0,0.5,0.0,0.5
расходятся в разные стороны,0.25,0.0,0.0,0.25,0.25,0.25
почесывается,0.0,0.0,0.0,0.0,0.0,1.0


## Merging the information
Let's add this information to the dataset we created in another notebook, [directions-basic](./directions-basic.ipynb).

### Loading data with general information about the plays

Now, we'll load information from [directions-basic](./directions-basic.ipynb).

In [16]:
df_info = pd.read_csv(csv_path + os.sep + "general_information.csv", sep=";", index_col=False)

Unfortunately, we also have to do some transformations in order to merge everything into a single dataframe:

* all the paths in the `File` column end in `.xml`, though it should be `.txt` in order to be consistent,
* we'll also use `File` column as index.

In [17]:
# we need this to rename strings
def xml_to_txt(file_name):
    new_name = file_name[:-3] + "txt"
    return new_name

In [18]:
df_info["Path"] = df_info["Path"].apply(xml_to_txt)
df_info = df_info.set_index("Path")

After the transformations, the dataset looks like this:

In [19]:
df_info.head()

Unnamed: 0_level_0,Acts,Author,Directions,Lemmas,"Lemmas, per direction",Title,Words,"Words, per direction",Year,"Directions, per act","Words, per act","Lemmas, per act"
Path,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
pushkin-stseny-iz-rytsarskih-vremen.txt,1,"Пушкин, Александр Сергеевич",38,996,4.973684,Сцены из рыцарских времен,3399,3.131579,1837,38.0,3399.0,996.0
turgenev-holostjak.txt,1,"Тургенев, Иван Сергеевич",687,2785,6.391557,Холостяк,21501,4.50655,1849,687.0,21501.0,2785.0
gogol-zhenitba.txt,2,"Гоголь, Николай Васильевич",254,2187,5.952756,Женитьба,13094,3.925197,1842,127.0,6547.0,1093.5
blok-neznakomka.txt,1,"Блок, Александр Александрович",132,1373,12.462121,Незнакомка,4314,10.856061,1907,132.0,4314.0,1373.0
ostrovsky-bednaja-nevesta.txt,5,"Островский, Александр Николаевич",442,2372,5.504525,Бедная невеста,22554,3.68552,1852,88.4,4510.8,474.4


## Merging datasets

Now, we have to merge the datasets.

In [20]:
df = pd.concat([df_info, df_mean], axis=1)
df.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,Acts,Author,Directions,Lemmas,"Lemmas, per direction",Title,Words,"Words, per direction",Year,"Directions, per act","Words, per act","Lemmas, per act",ADJ,ADVB,INTJ,NOUN,PREP,VERB
blok-balaganchik.txt,1,"Блок, Александр Александрович",38,910,22.263158,Балаганчик,2240,22.736842,1906,38.0,2240.0,910.0,3.289474,1.315789,0.0,7.815789,3.315789,3.368421
blok-korol-na-ploschadi.txt,3,"Блок, Александр Александрович",133,1475,14.631579,Король на площади,5535,12.466165,1907,44.333333,1845.0,491.666667,1.421053,0.87218,0.045113,4.428571,1.684211,2.090226
blok-neznakomka.txt,1,"Блок, Александр Александрович",132,1373,12.462121,Незнакомка,4314,10.856061,1907,132.0,4314.0,1373.0,1.575758,0.742424,0.0,3.492424,1.393939,1.954545
bulgakov-dni-turbinyh.txt,4,"Булгаков, Михаил Афанасьевич",372,2901,5.634409,Дни Турбиных,16426,3.739247,1926,93.0,4106.5,725.25,0.416667,0.13172,0.018817,1.478495,0.540323,0.833333
bulgakov-ivan-vasilevich.txt,3,"Булгаков, Михаил Афанасьевич",319,2195,6.507837,Иван Васильевич,10303,4.721003,1936,106.333333,3434.333333,731.666667,0.39185,0.15674,0.00627,2.015674,0.689655,1.068966


## Saving datasets

Let's save the resulting dataset for further use.

Shares:

In [21]:
share_df.to_csv(csv_path + os.sep + "shares_dirs.csv", index=True, sep=";", encoding="utf-8")

Play data:

In [22]:
df.to_csv(csv_path + os.sep + "joint_data.csv", index=False, sep=";", encoding="utf-8")