# Most frequent parts of speech

As a side story, we'll take a look at what the most frequent lemmas and directions are.

## Data

First, we have to load the directions; they are all stored in `all_directions.txt` and `directions` variable; apart from that, we'll count some parts of speech in `pos_dict` dictionary.

In [1]:
import os

In [2]:
alldirs_path = ".." + os.sep + "directions" + os.sep + "all_directions.txt"
with open(alldirs_path, "r", encoding="utf-8") as f:
    directions = [line.strip("\n") for line in f.readlines()]

In [3]:
pos_dict = {"ADJ": [], "ADVB": [], "INTJ": [], "NOUN": [], "PREP": [], "VERB": []}

## Extracting lemmas

In [4]:
from nltk import wordpunct_tokenize
from pymorphy2 import MorphAnalyzer

morph = MorphAnalyzer()

In [5]:
def count_pos(direction, pos_dict):
    tokens = wordpunct_tokenize(direction)
    for token in tokens:
        try:
            analysis = morph.parse(token)[0]
            lemma = str(analysis.normal_form)
            pos = str(analysis.tag.POS)
            if pos != "PUNCT":
                if pos in set(["ADJF", "ADJS", "COMP"]):
                    pos_dict["ADJ"].append(lemma)
                elif pos in set(["VERB", "INFN"]):
                    pos_dict["VERB"].append(lemma)
                elif pos in pos_dict.keys():
                    pos_dict[pos].append(lemma)
        except:
            pass
    return pos_dict

In [6]:
for direction in directions:
    pos_dict = count_pos(direction, pos_dict)

## Counters

In [7]:
from collections import Counter

In [8]:
for pos in [pos for pos in pos_dict.keys() if pos not in ["INTJ", "PREP"]]:
    print("\n10 most common {}".format(pos))
    c = Counter()
    c.update(pos_dict[pos])
    for lemma, count in c.most_common(10):
        print("{}: {}".format(count, lemma))


10 most common ADJ
567: тот
558: весь
473: один
233: который
232: другой
189: свой
163: передний
140: слышный
115: гостиный
112: муромский

10 most common ADVB
308: тихо
269: потом
165: несколько
162: быстро
137: громко
132: опять
91: вдруг
85: вслед
83: тоже
81: вполголоса

10 most common NOUN
1219: рука
1015: дверь
636: сторона
487: стол
395: пауза
344: иван
332: голова
321: голос
319: комната
262: окно

10 most common VERB
1393: уходить
1147: входить
498: садиться
460: идти
405: подходить
394: брать
294: выходить
294: смотреть
260: вставать
259: целовать


## Directions

Also let's find out 10 most frequent directions:

In [9]:
from collections import Counter

c = Counter()

In [10]:
alldirs_path = ".." + os.sep + "directions" + os.sep + "all_directions.txt"
with open(alldirs_path, "r", encoding="utf-8") as f:
    directions = [line.strip("\n") for line in f.readlines()]
c.update(directions)

In [11]:
for direction, count in c.most_common(10):
    print("{}: {}".format(count, direction))

463: уходит
346: в сторону
330: пауза
136: смеется
120: про себя
102: помолчав
98: молчание
93: встает
91: кричит
86: входит
