# Mean POS amounts and creating a bigger dataset

**What do we do in this notebook:**

* calculate mean amounts of various parts-of-speech in plays,
* create a dataset with these mean values,
* add this information to the dataset with general information on the plays.

## Imports and globals

Run the following before everything else:

In [1]:
import os
from pymystem3 import Mystem
from statistics import mean
import pandas as pd

In [2]:
mystem = Mystem()

In [3]:
directions_path = ".." + os.sep + "directions"
csv_path = "." + os.sep + "csv"
corpus_path = ".." + os.sep + "RusDraCor"

## Preparations

This is required because I'm using a Mac, which sometimes creates system folders like `.DS_Store`; in any other case — never mind.

In [4]:
directions_files = [item for item in os.listdir(directions_path) if (item.endswith(".txt") and not item.startswith("all_directions"))]
play_files = [item for item in os.listdir(corpus_path) if item.endswith(".xml")]

## Parts-of-speech
We will count the following parts-of-speech:

* nouns,
* adjectives,
* verbs,
* adverbs,
* interjections.

The values will be respresented as a _fraction_, a result of division of the POS to the total amount of words in the direction.

$$ \text{POS count} = \frac{\text{amount of POS in a direction}}{\text{total amount of words}} $$

Tokenization and word count will be performed via ```pymystem3```, a Python wrapper for [Mystem](https://tech.yandex.ru/mystem/).

In [5]:
def count_pos(direction):
    pos_dict = {"S": 0, "A": 0, "V": 0, "ADV": 0, "INTJ": 0}
    analyses = mystem.analyze(direction)
    
    for analysis in analyses:
        # if there's no "analysis" in keys, it's not a word and we shouldn't count it
        if "analysis" in analysis:
            try:
                pos = analysis["analysis"][0]["gr"].split(",")[0]
                
                # increment part-of-speech we need
                if "=" in pos:
                        pos = pos[:pos.index("=")]
                        if pos in pos_dict.keys():
                            pos_dict[pos] += 1
                        # peculiarity of Mystem: it distinguishes ADV ("быстро") and ADVPRO ("как")
                        elif pos == "ADVPRO":
                            pos_dict["ADV"] += 1
            except:
                pass
    return pos_dict

In [6]:
def get_pos_amounts_words(directions_file):
    pos_list = {"S": [], "A": [], "V": [], "ADV": [], "INTJ": []}
    
    full_path = str(directions_path) + os.sep + directions_file
    with open(full_path, "r", encoding="utf-8") as directions_f:
        directions = [line.strip("\n") for line in directions_f.readlines() if line != "\n"]
    
    for st_dir in directions:
        pos_this = count_pos(st_dir)
        
        for part_of_speech in pos_this:
            pos_list[part_of_speech].append(pos_this[part_of_speech])
    
    return pos_list

Now let's crawl the files and create a dataset.

In [7]:
plays_info = []
dummy = {"path": "", "S": 0, "A": 0, "V": 0, "ADV": 0, "INTJ": 0}
for directions_file in directions_files:
    stats = get_pos_amounts_words(directions_file)
    for key in stats:
        stats[key] = mean(stats[key])
    stats["path"] = directions_file
    plays_info.append(stats)

In [19]:
df_mean = pd.DataFrame(plays_info)

In [23]:
df_mean.set_index("path")
df_mean.head()

Unnamed: 0,A,ADV,INTJ,S,V,path
0,0.151515,0.30303,0.0,0,0.030303,sumarokov-horev.txt
1,0.371859,0.296482,0.0,0,0.135678,gorky-egor-bulychov-i-drugie.txt
2,0.180556,0.277778,0.020833,0,0.125,turgenev-vecher-v-sorrente.txt
3,0.145161,0.209677,0.112903,0,0.096774,gumilyov-gondla.txt
4,0.126506,0.108434,0.012048,0,0.060241,chekhov-leshii.txt


## Merging the information
Now, let's add this information to the dataset we created in another notebook, [directions-basic](./directions-basic.ipynb).

### Preparation: mean values
First, let's do some cleanup in our _dataset with mean values_:
* rename columns so that they're all capitalized,
* rename `path` column to `File` so that we know which play was the source to the information in the row.
* set `File` column as index in order to:
    1. avoid creation of columns we don't really need,
    2. use this column as a key when merging all the data together.

In [10]:
df_mean.rename(columns={"path":"File", "words":"Words"}, inplace=True)
df_mean = df_mean.set_index("File")

Now, let's take a look.

In [11]:
df_mean.head()

Unnamed: 0_level_0,A,ADV,INTJ,S,V
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
sumarokov-horev.txt,0.151515,0.30303,0.0,0,0.030303
gorky-egor-bulychov-i-drugie.txt,0.371859,0.296482,0.0,0,0.135678
turgenev-vecher-v-sorrente.txt,0.180556,0.277778,0.020833,0,0.125
gumilyov-gondla.txt,0.145161,0.209677,0.112903,0,0.096774
chekhov-leshii.txt,0.126506,0.108434,0.012048,0,0.060241


### Loading data with general information about the plays

Now, we'll load information from [directions-basic](./directions-basic.ipynb).

In [12]:
df_info = pd.read_csv(csv_path + os.sep + "general_information.csv", sep=";")

Unfortunately, we also have to do some transformations in order to merge everything into a single dataframe:

* all the paths in the `File` column end in `.xml`, though it should be `.txt` in order to be consistent,
* we'll also use `File` column as index.

In [13]:
# we need this to rename strings
def xml_to_txt(file_name):
    new_name = file_name[:-3] + "txt"
    return new_name

In [14]:
df_info["File"] = df_info["File"].apply(xml_to_txt)
df_info = df_info.set_index("File")

Now the dataset looks like this:

In [15]:
df_info.head()

Unnamed: 0_level_0,Amount of acts,Amount of directions,Author,Lemmas,Lemmas per direction,Title,Words,Words per direction,Year,Directions per act,Words per act,Lemmas per act
File,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
pushkin-stseny-iz-rytsarskih-vremen.txt,1,38,"Пушкин, Александр Сергеевич",996,4.973684,Сцены из рыцарских времен,3399,3.131579,1837,38.0,3399.0,996.0
turgenev-holostjak.txt,1,687,"Тургенев, Иван Сергеевич",2785,6.391557,Холостяк,21501,4.50655,1849,687.0,21501.0,2785.0
gogol-zhenitba.txt,2,254,"Гоголь, Николай Васильевич",2187,5.952756,Женитьба,13094,3.925197,1842,127.0,6547.0,1093.5
krylov-sonnyj-poroshok-ili-pohischennaja-krestjanka.txt,3,88,"Крылов, Иван Андреевич",1550,6.522727,Сонный порошок или похищенная крестьянка,7605,4.215909,1800,29.333333,2535.0,516.666667
blok-neznakomka.txt,1,80,"Блок, Александр Александрович",1342,18.025,Незнакомка,4222,16.4625,1907,80.0,4222.0,1342.0


## Merging datasets

Now, we have to merge the datasets.

In [16]:
df = pd.concat([df_info, df_mean], axis=1)
df.head()

Unnamed: 0,Amount of acts,Amount of directions,Author,Lemmas,Lemmas per direction,Title,Words,Words per direction,Year,Directions per act,Words per act,Lemmas per act,A,ADV,INTJ,S,V
blok-balaganchik.txt,1,38,"Блок, Александр Александрович",910,22.263158,Балаганчик,2240,22.736842,1906,38.0,2240.0,910.0,2.684211,1.210526,0.0,0,0.342105
blok-korol-na-ploschadi.txt,3,133,"Блок, Александр Александрович",1475,14.631579,Король на площади,5535,12.466165,1907,44.333333,1845.0,491.666667,1.090226,0.977444,0.0,0,0.218045
blok-neznakomka.txt,1,80,"Блок, Александр Александрович",1342,18.025,Незнакомка,4222,16.4625,1907,80.0,4222.0,1342.0,1.9,1.05,0.0,0,0.35
bulgakov-dni-turbinyh.txt,4,372,"Булгаков, Михаил Афанасьевич",2901,5.634409,Дни Турбиных,16426,3.739247,1926,93.0,4106.5,725.25,0.314516,0.134409,0.005376,0,0.099462
bulgakov-ivan-vasilevich.txt,3,319,"Булгаков, Михаил Афанасьевич",2195,6.507837,Иван Васильевич,10303,4.721003,1936,106.333333,3434.333333,731.666667,0.210031,0.163009,0.0,0,0.100313


## Saving dataset

Now, let's save the resulting dataset for further use.

In [17]:
df.to_csv(csv_path + os.sep + "joint_data.csv", sep=";", encoding="utf-8")