# Создание словаря для symspell

Создадим словарь на основе справочников профессий HH.RU и общероссийского классификатора занятий [ОК 010-2014 (МСКЗ-08)](https://data.mos.ru/classifier/7710168515-obshcherossiyskiy-klassifikator-zanyatiy?pageNumber=58&versionNumber=1&releaseNumber=1). Русскоязычная версия [ISCO08](https://esco.ec.europa.eu/en/classification/occupation_main).

Для каждого слова внесем в словарь все словоформы.

## Загрузка библиотек

In [20]:
import pandas as pd
import re

In [21]:
from symspellpy import SymSpell, Verbosity
import pymorphy2

In [22]:
from nltk.tokenize import word_tokenize
from nltk import download
from nltk.corpus import stopwords
download('punkt')
download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dkharitonov/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dkharitonov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
import sys
sys.path.append('../')
from src import drop_stopwords, tokenize_drop_punkt

In [24]:
MA = pymorphy2.MorphAnalyzer()

## Загрузка данных

In [25]:
isco08 = pd.read_csv('../datasets/external/ok-010-2014_ISCO-08_ru.csv', encoding = 'cp1251', sep=';')
roles = pd.read_csv('../datasets/external/hh_prof_roles.csv')
specialities = pd.read_csv('../datasets/external/hh_prof_specializations.csv')

## Функции

In [26]:
def morh_analyse(word: str) -> list:
    result = []

    phrase = MA.parse(word)[0]
    tag = phrase.tag
    if 'LATN' in tag:
        result = [word]
    else:
        result = [p.word for p in phrase.lexeme]

    return result

## Формирование словаря

In [31]:
docs = []
text = (isco08.NAME.to_list() 
        + roles.category_name.to_list() 
        + roles.prof_name.to_list()
        + specialities.category_name.to_list()
        + specialities.prof_name.to_list())

for sentence in text:
    tokens = drop_stopwords(tokenize_drop_punkt(sentence))
    for token in tokens:
        docs += morh_analyse(token)

dictionary = pd.Series(docs, name='term').value_counts().reset_index(level=0)
dictionary.columns = ['term', 'count']

Объединим словарь профессий со словарем symspell.

In [32]:
symdict = pd.read_csv('../models/symspell/ru-100k.txt', sep=' ', header=0)
symdict.columns = ['term', 'count']

In [67]:
merged_dict = symdict.merge(dictionary, how='outer', on='term', suffixes=('sym', 'prof'))


In [68]:
merged_dict.countsym.fillna(0, inplace=True)
merged_dict.countprof.fillna(0, inplace=True)
merged_dict['countprof'] *= 45000
merged_dict['count'] = (merged_dict['countsym'] + merged_dict['countprof']).astype('int')
merged_dict.drop(['countsym', 'countprof'], axis=1, inplace=True)

In [69]:
merged_dict.sort_values(by='count', ascending=False).to_csv('../models/symspell/professions.txt', sep=' ', header=False, index=False)

## Проверка загрузки

In [2]:
sym_spell = SymSpell(max_dictionary_edit_distance=3, prefix_length=7)
dictionary_path = '../models/symspell/professions.txt'
sym_spell.load_dictionary(dictionary_path, 0, 1)

True

In [7]:
input_term = 'инжинер'
suggestions = sym_spell.lookup(input_term, Verbosity.CLOSEST, include_unknown=True)

In [8]:
for suggestion in suggestions:
    print(suggestion)

инженер, 1, 25


## Вывод

Мы создали и сохранили словарь для исправления опечаток в названиях профессий.