# Most frequent parts of speech

As a side story, we'll take a look at what the most frequent lemmas and directions are.

## Data

First, we have to load the directions; they are all stored in `all_directions.txt` and `directions` variable; apart from that, we'll count some parts of speech in `pos_dict` dictionary.

In [1]:
import os

In [2]:
alldirs_path = ".." + os.sep + "directions" + os.sep + "all_directions.txt"
with open(alldirs_path, "r", encoding="utf-8") as f:
    directions = [line.strip("\n") for line in f.readlines()]

In [3]:
pos_dict = {"ADJ": [], "ADVB": [], "INTJ": [], "NOUN": [], "PREP": [], "VERB": []}

## Extracting lemmas

In [4]:
from nltk import wordpunct_tokenize
from pymorphy2 import MorphAnalyzer

morph = MorphAnalyzer()

In [5]:
def count_pos(direction, pos_dict):
    tokens = wordpunct_tokenize(direction)
    for token in tokens:
        try:
            analysis = morph.parse(token)[0]
            lemma = str(analysis.normal_form)
            pos = str(analysis.tag.POS)
            if pos != "PUNCT":
                if pos in set(["ADJF", "ADJS", "COMP"]):
                    pos_dict["ADJ"].append(lemma)
                elif pos in set(["VERB", "INFN"]):
                    pos_dict["VERB"].append(lemma)
                elif pos in pos_dict.keys():
                    pos_dict[pos].append(lemma)
        except:
            pass
    return pos_dict

In [6]:
for direction in directions:
    pos_dict = count_pos(direction, pos_dict)

## Counters

In [7]:
from collections import Counter

In [8]:
for pos in [pos for pos in pos_dict.keys() if pos not in ["INTJ", "PREP"]]:
    print("\n10 most common {}".format(pos))
    c = Counter()
    c.update(pos_dict[pos])
    for lemma, count in c.most_common(10):
        print("{}: {}".format(count, lemma))


10 most common ADJ
510: тот
499: весь
414: один
215: который
203: другой
171: свой
157: передний
133: слышный
112: муромский
109: гостиный

10 most common ADVB
289: тихо
240: потом
156: быстро
134: несколько
120: опять
111: громко
83: вдруг
81: вслед
74: немного
71: вполголоса

10 most common NOUN
1111: рука
920: дверь
561: сторона
441: стол
389: пауза
306: голова
289: голос
281: комната
281: иван
244: окно

10 most common VERB
1258: уходить
1024: входить
448: садиться
428: идти
365: подходить
356: брать
274: выходить
266: смотреть
227: вставать
227: целовать


## Directions

Also let's find out 10 most frequent directions:

In [9]:
from collections import Counter

c = Counter()

In [10]:
alldirs_path = ".." + os.sep + "directions" + os.sep + "all_directions.txt"
with open(alldirs_path, "r", encoding="utf-8") as f:
    directions = [line.strip("\n") for line in f.readlines()]
c.update(directions)

In [11]:
for direction, count in c.most_common(10):
    print("{}: {}".format(count, direction))

414: уходит
326: пауза
313: в сторону
130: смеется
102: помолчав
90: кричит
82: поет
80: входит
79: встает
75: молчание
