### Описание задания

1. (1 балл) Подготовить мини-корпус (4-5 текстов или до 10 тысяч токенов) с разметкой ключевых слов. 
Предполагается, что вы найдете источник текстов, в котором **уже выделены** ключевые слова.
Укажите источник корпуса и опишите, в каком виде там были представлены ключевые слова. **yes**

2. (2 балла) Разметить ключевые слова самостоятельно. Оценить пересечение с имеющейся разметкой.
Составить эталон разметки (например, пересечение или объединение вашей разметки и исходной). **yes**

3. (2 балла) Применить к этому корпусу 3 метода извлечения ключевых слов на выбор: RAKE, TextRank, tf idf, OKAPI BM25. 
**yes**

4. (2 балла) Составить морфологические/синтаксические шаблоны для ключевых слов и фраз, выделить соответствующие им подстроки из корпуса (например, именные группы Adj+Noun).
Применить эти фильтры к спискам ключевых слов. **yes**

4. (2  балла) Оценить точность, полноту, F-меру выбранных методов относительно эталона:
с учётом морфосинтаксических шаблонов и без них. **yes**

5. (1 балл) Описать ошибки автоматического выделения ключевых слов (что выделяется лишнее, что не выделяется);
предложить свои методы решения этих проблем. **yes**

1. Я нашла [список датасетов для keyword extraction](https://github.com/LIAAD/KeywordExtractor-Datasets) и взяла оттуда датасет SemEval2017, потому что мне нравится идея автоматически генерировать ключевые слова для научных статей. В датасете для каждого файла с текстом был файл с ключевыми словами, по одному сочетанию на строку.

Сначала соберу датафрейм из датасета: соединю текстовые файлы с приписанными им списками ключевых слов

In [3]:
import os
import pandas as pd

In [4]:
def find_files(data_folder):
    text_files_paths = []
    kw_files_paths = []
    for root, dirs, files in os.walk(data_folder):
        for name in files:
            path = os.path.join(root, name)
            if name.endswith('.txt'):
                text_files_paths.append(path)
            if name.endswith('.key'):
                kw_files_paths.append(path)
            else:
                continue
    return text_files_paths, kw_files_paths

In [5]:
text_files_paths, kw_files_paths = find_files('SemEval2017')

In [6]:
def create_sorted_tuple(file_paths_list):
    sorted_by_filename = []
    for path in file_paths_list:
        file = path.split('/')[2]
        sorted_by_filename.append((path, file))
    sorted_by_filename.sort(key=lambda x: x[1])
    return sorted_by_filename

In [7]:
def create_df(sorted_texts_paths, sorted_keyword_paths):
    df = pd.DataFrame(columns = ['text', 'keywords'])
    for text_path, keyword_path in zip(sorted_texts_paths, sorted_keyword_paths):
        with open(text_path[0], 'r', encoding='utf-8') as text_file:
            text = text_file.read()
        with open(keyword_path[0], 'r', encoding='utf-8') as keyword_file:
            keywords = keyword_file.read()
            keywords = ",".join(keywords.split('\n'))
        single_text_df = pd.DataFrame({'text': [text], 'keywords': [keywords]})
        df = pd.concat([df, single_text_df], axis=0)
    return df

In [8]:
a = create_sorted_tuple(text_files_paths)
b = create_sorted_tuple(kw_files_paths)

In [9]:
df = create_df(a, b)

In [10]:
df.head(5)

Unnamed: 0,text,keywords
0,"Complex Langevin (CL) dynamics [1,2] provides...","CL,complexified configuration space,Complex La..."
0,Nuclear theory devoted major efforts since 4 d...,"C60,combining quantum features,field of cluste..."
0,The next important step might be the derivatio...,"continuum space-time,Dirac equation,future res..."
0,This work shows how our approach based on the ...,"class virial expansions,field partition functi..."
0,A fluctuating vacuum is a general feature of q...,a collection of fermionic fields describing co...


Сохраню датафрейм в отдельный файл, чтобы потом пользоваться им

In [11]:
df.to_csv('keword_extraction_semeval.csv', index=False, header=True)

In [12]:
df = pd.read_csv('keword_extraction_semeval.csv')

Оставлю 5 рандомных текстов и размечу для них ключевые слова вручную.

In [13]:
test_df = df.sample(5, random_state=42)

In [14]:
test_df

Unnamed: 0,text,keywords
454,This paper presents general results on the Jav...,"compare just the constant static trees,compari..."
73,This section is devoted to the discretization ...,derivation of the semi-discrete advection–diff...
310,"In this Letter, we present results of a relati...","Bethe–Salpeter equation,calculate the average ..."
175,Max-linear programs have been used to describe...,adapt the existing methods for finding real so...
332,We propose an equilibrium model that allows to...,analyze different approaches to recover networ...


Text1

In [15]:
test_df['text'].tolist()[0]

'This paper presents general results on the Java source code snippet detection problem. We propose the tool which uses graph and subgraph isomorphism detection. A number of solutions for all of these tasks have been proposed in the literature. However, although that all these solutions are really fast, they compare just the constant static trees. Our solution offers to enter an input sample dynamically with the Scripthon language while preserving an acceptable speed. We used several optimizations to achieve very low number of comparisons during the matching algorithm.\n'

In [16]:
k_w1 = [
    'Java',
    'source code snippet detection problem',
    'isomorphism detection',
    'Scripthon'
]

Text2

In [17]:
test_df['text'].tolist()[1]

'This section is devoted to the discretization of the advection–diffusion equation and to the analysis of dispersion and diffusion eigencurves for different polynomial orders. The spectral/hp continuous Galerkin method considered closely resembles the formulation presented in [7]. Sec. 2.1 describes in detail the derivation of the semi-discrete advection–diffusion problem as applied to wave-like solutions, from which the relevant eigencurves can be obtained. The inviscid case (linear advection) is then addressed in Sec. 2.2, where the role of primary and secondary eigencurves is discussed from the perspective introduced in [9]. The viscous case is subsequently considered in Sec. 2.3, where eigencurves are shown to feature irregular oscillations for problems strongly dominated by either convection or diffusion.\n'

In [18]:
k_w2 = [
    'advection–diffusion equation',
    'dispersion and diffusion eigencurves',
    'Galerkin method',
    'eigencurves'
]

Text3

In [19]:
test_df['text'].tolist()[2]

'In this Letter, we present results of a relativistic calculation of decay constants in the framework of full Salpeter equation. The full Salpeter equation is a relativistic equation describing a bound state. Since this method has a very solid basis in quantum field theory, it is very good in describing a bound state which is a relativistic system. In a previous paper [16], we solved the instantaneous Bethe–Salpeter equation [17], which is also called full Salpeter equation [18]. After we solved the full Salpeter equation, we obtained the relativistic wave function of the bound state. We used this wave function to calculate the average kinetic energy of the heavy quark inside a heavy meson in 0− state, and obtained values which agree very well with recent experiments. We also found there that the relativistic corrections are quite large and cannot be ignored [16]. In this Letter we use this method to predict the values of decay constants of heavy mesons in 0− state.\n'

In [20]:
k_w3 = [
    'relativistic calculation of decay constants',
    'full Salpeter equation',
    'quantum field theory',
    'decay constants'
]

Text4

In [21]:
test_df['text'].tolist()[3]

'Max-linear programs have been used to describe optimisation problems for multiprocessor interactive systems. In some instances the variables used in this model are required to be integer; however, no method seems to exist for finding integer solutions to max-linear programs.For a generic class of matrices, we show that integer solutions to two-sided max-linear systems and programs can be found in polynomial time. For general matrices, we adapt the existing methods for finding real solutions to obtain algorithms for finding integer solutions.\n'

In [22]:
k_w4 = [
    'Max-linear programs',
    'matrices',
    'multiprocessor interactive systems',
    'optimisation'
]

Text5

In [23]:
test_df['text'].tolist()[4]

'We propose an equilibrium model that allows to analyze the long-run impact of the electricity market design on transmission line expansion by the regulator and investment in generation capacity by private firms in liberalized electricity markets. The model incorporates investment decisions of the transmission system operator and private firms in expectation of an energy-only market and cost-based redispatch. In different specifications we consider the cases of one vs. multiple price zones (market splitting) and analyze different approaches to recover network cost—in particular lump sum, generation capacity based, and energy based fees. In order to compare the outcomes of our multilevel market model with a first best benchmark, we also solve the corresponding integrated planner problem. Using two test networks we illustrate that energy-only markets can lead to suboptimal locational decisions for generation capacity and thus imply excessive network expansion. Market splitting heals thes

In [24]:
k_w5 = [
    'equilibrium model',
    'electricity market',
    'transmission line',
    'investment',
    'market splitting',
    'integrated planner problem',
    'energy-only markets'
]

In [25]:
manually_extracted_keywords = [k_w1, k_w2, k_w3, k_w4, k_w5]

Я разметила слова, теперь нужно придумать как сранивать списки слов между собой и получать precision, recall и F-measure.

Думаю сделать это так: для каждого элемента эталонного списка ключевых слов буду смотреть является ли он подстрокой какого-то элемента из сравниваемого списка ключевых слов и наоборот.

Я хочу, чтобы ```equilibrium model``` и ```equilibrium models``` считались успехом. 

 ```True Positive``` - элемент одного списка является подстрокой элемента другого списка.
 
 ```False Negative```- для элемента из эталонного списка не нашлось элемента из сравниваемого списка, с которым можно было бы вступить в отношения подстрокования. 
 
 ```False Positive``` - для элемента из сравниваемого списка не нашлось элемента из эталонного списка, с которым можно было бы вступить в отношения подстрокования. 


In [153]:
def tp_fn_fp(gold_kw, annotated_kw):
    tp = 0
    fn = 0
    fp = 0
    annotated_kw_matched = []
    gold_kw_matched = []
    for gold_keyword in gold_kw:
            for annotated_keyword in annotated_kw:
                if gold_keyword.find(annotated_keyword) == 0 or annotated_keyword.find(gold_keyword) == 0:
                    if annotated_keyword not in annotated_kw_matched:
                        tp += 1
                        annotated_kw_matched.append(annotated_keyword)
                        gold_kw_matched.append(gold_keyword)
                    else:
                        continue
    fn = len(gold_kw) - len(gold_kw_matched)
    fp = len(annotated_kw) - len(annotated_kw_matched)
    return tp, fn, fp   

Посмотрю, насколько моя разметка совпадает с эталонной. Я буду считать точность, полноту и F-меру для каждого текста отдельно и потом посмотрю на их среднее арифметическое как на метрику по всей выборке.

In [154]:
def precision_recall_f_measure(tp, fn, fp):
    try:
        precision = tp / (tp + fp)
    except ZeroDivisionError:
        precision = 0
    try:
        recall = tp / (tp + fn)
    except ZeroDivisionError:
        recall = 0
    try:
        f_measure = 2 * ((precision * recall)/(precision + recall))
    except ZeroDivisionError:
        f_measure = 0
    return precision, recall, f_measure

In [155]:
gold_kw_lists = []
for text_keywords in test_df.keywords.tolist():
    kw_list = text_keywords.split(',')
    kw_list.pop()
    gold_kw_lists.append(kw_list)

In [156]:
texts_precision = []
texts_recall = []
texts_f_measure = []

for gold_kw, annotated_kw in zip(gold_kw_lists, manually_extracted_keywords):
    tp, fn, fp = tp_fn_fp(gold_kw, annotated_kw)
    precision, recall, f_measure = precision_recall_f_measure(tp, fn, fp)
    texts_precision.append(precision)
    texts_recall.append(recall)
    texts_f_measure.append(f_measure)

In [157]:
import statistics 

In [158]:
print(f'Average PRECISION on 5 texts: {statistics.mean(texts_precision)}')
print(f'Average RECALL on 5 texts: {statistics.mean(texts_recall)}')
print(f'Average F-MEASURE on 5 texts: {statistics.mean(texts_f_measure)}')

Average PRECISION on 5 texts: 0.5857142857142857
Average RECALL on 5 texts: 0.2204184704184704
Average F-MEASURE on 5 texts: 0.3133479853479853


По метрикам я вижу, что моя разметка гораздо менее полная чем разметка в датасете - я выделила лишь 22% ключевых слов, выделенных аннотаторами.

В датасете ключевые слова к статьям размечали люди, которые их читали. Я читала только абстракты и не шарю в предметных областях статей, поэтому буду считать разметку датасета эталонной и дальше считать все метрики только по ней.

## Извлечение ключевых слов

Моя работа с методами извлечения ключевых слов основана на следующих предположениях:

1) Регистр не важен и можно сделать lower() для всего текста

2) Все стоп-слова нужно удалять

3) Все знаки, кроме буквенных надо удалять

4) Нужно лемматизировать текст

In [32]:
import re

In [33]:
with open('extended_stopwords_towards_nlp.txt', 'r', encoding='utf-8') as file:
    raw = file.read()
stopwords = raw.split()

In [34]:
import pattern
from pattern.en import lemma, lexeme

In [35]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub('[^-a-z]', ' ', text)
    text = " ".join(text.split())
    lemmatized_text = [lemma(wd) for wd in text.split() if lemma(wd) not in stopwords]
    text = " ".join(lemmatized_text)
    return text

### RAKE

In [36]:
import RAKE

In [37]:
def extract_keywords_with_RAKE(texts):
    kws_of_texts_collection = []
    rake = RAKE.Rake(stopwords)
    for text in texts:
        text_kw_list = []
        text_kws = rake.run(text, maxWords=3, minFrequency=2)
        for kw in text_kws:
            text_kw_list.append(kw[0])
        text_kw_list = ",".join(text_kw_list)
        kws_of_texts_collection.append(text_kw_list)
    return kws_of_texts_collection

In [38]:
%%time
df['RAKE_keywords'] = extract_keywords_with_RAKE(df['text'].tolist())

CPU times: user 2.9 s, sys: 19.5 ms, total: 2.92 s
Wall time: 2.96 s


In [39]:
df.head(5)

Unnamed: 0,text,keywords,RAKE_keywords
0,"Complex Langevin (CL) dynamics [1,2] provides...","CL,complexified configuration space,Complex La...","cl,10]"
1,Nuclear theory devoted major efforts since 4 d...,"C60,combining quantum features,field of cluste...","field,10]"
2,The next important step might be the derivatio...,"continuum space-time,Dirac equation,future res...","continuum space-time,continuum limit,objects h..."
3,This work shows how our approach based on the ...,"class virial expansions,field partition functi...","solution,form"
4,A fluctuating vacuum is a general feature of q...,a collection of fermionic fields describing co...,"nuclear physics,nucleon,surface"


Оказалось, что RAKE на этих текстах работает намного хуже, если делать препроцессинг. Поэтому я не стала его делать, посмотрю, как это влияет на другие методы.

### TextRank

In [40]:
from gensim.summarization import keywords

In [41]:
def extract_keywords_with_TextRank(texts):
    kws_of_texts_collection = []
    for text in texts:
        text = preprocess_text(text)
        text_kw_list = []
        text_kws = keywords(text, pos_filter=[], scores=True)
        for kw in text_kws:
            text_kw_list.append(kw[0])
        text_kw_list = ",".join(text_kw_list)
        kws_of_texts_collection.append(text_kw_list)
    return kws_of_texts_collection

In [42]:
%%time
df['TextRank_keywords'] = extract_keywords_with_TextRank(df['text'].tolist())

CPU times: user 11.7 s, sys: 184 ms, total: 11.9 s
Wall time: 11.9 s


In [43]:
df.head(5)

Unnamed: 0,text,keywords,RAKE_keywords,TextRank_keywords
0,"Complex Langevin (CL) dynamics [1,2] provides...","CL,complexified configuration space,Complex La...","cl,10]","result,thi,field theory,distribution,rely samp..."
1,Nuclear theory devoted major efforts since 4 d...,"C60,combining quantum features,field of cluste...","field,10]","quantum,approach,dynamic,dynamical,development..."
2,The next important step might be the derivatio...,"continuum space-time,Dirac equation,future res...","continuum space-time,continuum limit,objects h...","continuum,treatment,description,lattice partic..."
3,This work shows how our approach based on the ...,"class virial expansions,field partition functi...","solution,form","approach,solution,partition,equation,class,for..."
4,A fluctuating vacuum is a general feature of q...,a collection of fermionic fields describing co...,"nuclear physics,nucleon,surface","field,bag,nucleon,scale,vacuum,force,surface,c..."


Мне нравится, что TextRank выделил совсем другие ключевые слова. Этот метод лучше работает с препроцессингом.

### Yet Another Keyword Extractor (Yake)

In [None]:
# !pip install git+https://github.com/LIAAD/yake

In [45]:
import yake

In [193]:
kw_extractor = yake.KeywordExtractor(dedupLim=0.9, dedupFunc='seqm')

In [194]:
def extract_keywords_with_Yake(texts):
    kws_of_texts_collection = []
    for text in texts:
        text = preprocess_text(text)
        text_kw_list = []
        text_kws = kw_extractor.extract_keywords(text)
        for kw in text_kws:
            text_kw_list.append(kw[0])
        text_kw_list = ",".join(text_kw_list)
        kws_of_texts_collection.append(text_kw_list)
    return kws_of_texts_collection

In [195]:
%%time
df['Yake_keywords'] = extract_keywords_with_Yake(df['text'].tolist())

CPU times: user 2min 14s, sys: 1.06 s, total: 2min 15s
Wall time: 2min 18s


In [203]:
df.head(5)

Unnamed: 0,text,keywords,RAKE_keywords,TextRank_keywords,Yake_keywords,RAKE_keywords_filtered,TextRank_keywords_filtered,Yake_keywords_filtered
0,"Complex Langevin (CL) dynamics [1,2] provides...","CL,complexified configuration space,Complex La...","cl,10]","result,thi,field theory,distribution,rely samp...","dynamic provide approach,provide approach circ...",,sign problem,"provide approach,approach circumvent,year numb..."
1,Nuclear theory devoted major efforts since 4 d...,"C60,combining quantum features,field of cluste...","field,10]","quantum,approach,dynamic,dynamical,development...","treat thi semi,theory devote major,devote majo...",,development laser,"theory devote,major effort,major effort,effort..."
2,The next important step might be the derivatio...,"continuum space-time,Dirac equation,future res...","continuum space-time,continuum limit,objects h...","continuum,treatment,description,lattice partic...","object hop lattice,knowledge object hop,step d...",,,"knowledge object,object hop,step derivation,in..."
3,This work shows how our approach based on the ...,"class virial expansions,field partition functi...","solution,form","approach,solution,partition,equation,class,for...","van der waal,describe van der,reproduce van de...",,phase transition,"virial expansion,base combination,mechanic non..."
4,A fluctuating vacuum is a general feature of q...,a collection of fermionic fields describing co...,"nuclear physics,nucleon,surface","field,bag,nucleon,scale,vacuum,force,surface,c...","fermionic field describe,fluctuate vacuum gene...",,,"fermionic field,field describe,boundary condit..."


И снова другие ключевые слова. Этот метод медленнее остальных.

## Шаблоны

Я буду использовать следующие шаблоны:
 
1) VERB NOUN - Measure complexity
2) ADJ NOUN - Complex systems
3) NOUN NOUN - Knowledge object

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher



Использую код из [ответа на стаковерфлоу](https://stackoverflow.com/questions/55393087/pos-pattern-mining-with-spacy)

In [60]:
patterns = [
    [{'POS': 'VERB'}, {'POS': 'NOUN'}],
    [{'POS': 'ADJ'}, {'POS': 'NOUN'}],
    [{'POS': 'NOUN'}, {'POS': 'NOUN'}]
    ]
matcher = Matcher(nlp.vocab)
matcher.add("verb-noun", None, patterns[0])
matcher.add("adj-noun", None, patterns[1])
matcher.add("noun-noun", None, patterns[2])

In [73]:
def filter_kw(keyword_string):
    filtered_kws = []
    keywords = keyword_string.split(',')
    for keyword in keywords:
        keyword = nlp(keyword)
        matches = matcher(keyword)
        for match_id, start, end in matches:
            string_id = nlp.vocab.strings[match_id]
            span = keyword[start:end]
            filtered_kws.append(span.text)
    filtered_kws = ",".join(filtered_kws)
    return filtered_kws

In [74]:
%%time
df['RAKE_keywords_filtered'] = df['RAKE_keywords'].apply(lambda x: str(filter_kw(x)))

CPU times: user 15.6 s, sys: 197 ms, total: 15.8 s
Wall time: 17.2 s


In [75]:
%%time
df['TextRank_keywords_filtered'] = df['TextRank_keywords'].apply(lambda x: str(filter_kw(x)))

CPU times: user 34.1 s, sys: 355 ms, total: 34.4 s
Wall time: 35.4 s


In [204]:
%%time
df['Yake_keywords_filtered'] = df['Yake_keywords'].apply(lambda x: str(filter_kw(x)))

CPU times: user 1min 9s, sys: 545 ms, total: 1min 10s
Wall time: 1min 11s


In [78]:
df.head(5)

Unnamed: 0,text,keywords,RAKE_keywords,TextRank_keywords,Yake_keywords,RAKE_keywords_filtered,TextRank_keywords_filtered,Yake_keywords_filtered
0,"Complex Langevin (CL) dynamics [1,2] provides...","CL,complexified configuration space,Complex La...","cl,10]","result,thi,field theory,distribution,rely samp...","dynamic provide approach,provide approach circ...",,sign problem,"provide approach,approach circumvent,year numb..."
1,Nuclear theory devoted major efforts since 4 d...,"C60,combining quantum features,field of cluste...","field,10]","quantum,approach,dynamic,dynamical,development...","treat thi semi,theory devote major,devote majo...",,development laser,"theory devote,major effort,major effort,effort..."
2,The next important step might be the derivatio...,"continuum space-time,Dirac equation,future res...","continuum space-time,continuum limit,objects h...","continuum,treatment,description,lattice partic...","object hop lattice,knowledge object hop,step d...",,,"knowledge object,object hop,step derivation,in..."
3,This work shows how our approach based on the ...,"class virial expansions,field partition functi...","solution,form","approach,solution,partition,equation,class,for...","van der waal,describe van der,reproduce van de...",,phase transition,"virial expansion,base combination,mechanic non..."
4,A fluctuating vacuum is a general feature of q...,a collection of fermionic fields describing co...,"nuclear physics,nucleon,surface","field,bag,nucleon,scale,vacuum,force,surface,c...","fermionic field describe,fluctuate vacuum gene...",,,"fermionic field,field describe,boundary condit..."


Шаблоны отсекли практически всё из Rake и TextRank, а из Yake много чего осталось и это интересно.

## Оценка метрик работы разных методов с фильтрацией и без

In [180]:
def method_metrics(df, column):
    gold_kw_lists = []
    annotated_keyword_lists = []
    for text_kw_string in df.keywords.tolist():
        text_gold_kw = text_kw_string.lower().split(',')
        text_gold_kw.pop()
        gold_kw_lists.append(text_gold_kw)
        
    for text_kw_string in df[column].tolist():
        text_anno_kw = text_kw_string.split(',')
        if text_anno_kw == ['']:
            text_anno_kw = None
        annotated_keyword_lists.append(text_anno_kw)
        
    texts_precision = []
    texts_recall = []
    texts_f_measure = []
    for gold_kw, annotated_kw in zip(gold_kw_lists, annotated_keyword_lists):
        if annotated_kw is not None:
            tp, fn, fp = tp_fn_fp(gold_kw, annotated_kw)
            precision, recall, f_measure = precision_recall_f_measure(tp, fn, fp)
        else:
            precision = 0
            recall = 0
            f_measure = 0
        texts_precision.append(precision)
        texts_recall.append(recall)
        texts_f_measure.append(f_measure)
        
    print(column)
    print(f'Average PRECISION: {statistics.mean(texts_precision)}')
    print(f'Average RECALL: {statistics.mean(texts_recall)}')
    print(f'Average F-MEASURE: {statistics.mean(texts_f_measure)}')

In [181]:
method_metrics(df, 'RAKE_keywords')

RAKE_keywords
Average PRECISION: 0.4508016862820938
Average RECALL: 0.11869674436361802
Average F-MEASURE: 0.16550152192298564


In [182]:
method_metrics(df, 'RAKE_keywords_filtered')

RAKE_keywords_filtered
Average PRECISION: 0.2260070965505878
Average RECALL: 0.027900741556889604
Average F-MEASURE: 0.044852879305750874


In [183]:
method_metrics(df, 'TextRank_keywords')

TextRank_keywords
Average PRECISION: 0.34181311569153117
Average RECALL: 0.18882842079479797
Average F-MEASURE: 0.2350402862371047


In [184]:
method_metrics(df, 'TextRank_keywords_filtered')

TextRank_keywords_filtered
Average PRECISION: 0.17799188640973632
Average RECALL: 0.017052562891098202
Average F-MEASURE: 0.03009388238890513


In [205]:
method_metrics(df, 'Yake_keywords')

Yake_keywords
Average PRECISION: 0.11987829614604463
Average RECALL: 0.14884229891891954
Average F-MEASURE: 0.12746080280606156


In [206]:
method_metrics(df, 'Yake_keywords_filtered')

Yake_keywords_filtered
Average PRECISION: 0.11559734666565397
Average RECALL: 0.11288814468625952
Average F-MEASURE: 0.10685791642215416


Согласно метрикам на этом датасете лучшая F-мера у TextRank_keywords.

RAKE реже ошибается, но и предсказывает в целом меньше.

YAKE чаще ошибается, но и предсказывает в целом больше.


## Описание ошибок

Разберу ошибки на примере 3 из 5 текстов, где я размечала ключевые слова вручную

In [169]:
%%time 
test_df['RAKE_keywords'] = extract_keywords_with_RAKE(test_df['text'].tolist())
test_df['TextRank_keywords'] = extract_keywords_with_TextRank(test_df['text'].tolist())
test_df['Yake_keywords'] = extract_keywords_with_Yake(test_df['text'].tolist())
test_df['RAKE_keywords_filtered'] = test_df['RAKE_keywords'].apply(lambda x: str(filter_kw(x)))
test_df['TextRank_keywords_filtered'] = test_df['TextRank_keywords'].apply(lambda x: str(filter_kw(x)))
test_df['Yake_keywords_filtered'] = test_df['Yake_keywords'].apply(lambda x: str(filter_kw(x)))

CPU times: user 2.62 s, sys: 42 ms, total: 2.66 s
Wall time: 2.95 s


In [170]:
test_df

Unnamed: 0,text,keywords,RAKE_keywords,TextRank_keywords,Yake_keywords,RAKE_keywords_filtered,TextRank_keywords_filtered,Yake_keywords_filtered
454,This paper presents general results on the Jav...,"compare just the constant static trees,compari...",solutions,"number solution,detection,propose,match,paper ...","thi paper general,dure match algorithm,paper g...",,number solution,"constant static,static tree,enter input,enter ..."
73,This section is devoted to the discretization ...,derivation of the semi-discrete advection–diff...,2,"eigencurve,discretization advection diffusion,...","thi devote discretization,equation analysi dis...",,,"polynomial order,order spectral,resemble formu..."
310,"In this Letter, we present results of a relati...","Bethe–Salpeter equation,calculate the average ...","full salpeter equation,decay constants,bound s...","heavy,state,thi,relativistic calculation,calcu...","full salpeter equation,describe bound state,he...","full salpeter,salpeter equation,bound state",relativistic calculation,"full salpeter,salpeter equation,bound state,bo..."
175,Max-linear programs have been used to describe...,adapt the existing methods for finding real so...,"max-linear programs,finding integer solutions","solution,generic,general,integer,matrice","describe optimisation problem,optimisation pro...","finding integer,integer solutions",,"describe optimisation,optimisation problem,opt..."
332,We propose an equilibrium model that allows to...,analyze different approaches to recover networ...,"generation capacity,private firms,energy-,anal...","market,network,model,base,investment generatio...","differ acros regime,generation capacity privat...","generation capacity,private firms","investment generation,generation capacity","acros regime,generation capacity,private firm,..."


#### Text1

In [174]:
for keyword in test_df['keywords'].tolist()[0].split(','):
    print(keyword)


compare just the constant static trees
comparisons
enter an input sample dynamically
graph and subgraph isomorphism detection
Java source code snippet detection
matching algorithm
preserving an acceptable speed
Scripthon language
several optimizations



In [175]:
for keyword in test_df['RAKE_keywords'].tolist()[0].split(','):
    print(keyword)

solutions


In [176]:
for keyword in test_df['TextRank_keywords'].tolist()[0].split(','):
    print(keyword)

number solution
detection
propose
match
paper general result


In [178]:
for keyword in test_df['TextRank_keywords_filtered'].tolist()[0].split(','):
    print(keyword)

number solution


In [177]:
for keyword in test_df['Yake_keywords'].tolist()[0].split(','):
    print(keyword)

thi paper general
dure match algorithm
paper general result
general result java
result java source
java source code
source code snippet
tool graph subgraph
graph subgraph isomorphism
fast compare constant
compare constant static
constant static tree
offer enter input
enter input sample
input sample dynamically
sample dynamically scripthon
dynamically scripthon language
scripthon language preserve
language preserve acceptable
preserve acceptable speed


In [179]:
for keyword in test_df['Yake_keywords_filtered'].tolist()[0].split(','):
    print(keyword)

constant static
static tree
enter input
enter input
input sample
scripthon language
language preserve
acceptable speed


Ошибки по **Text1**

Не выделились имена собственные: ```Java, Scripthon```.

Не выделились сочетания длиннее 3 токенов: compare ```just the constant static trees```, ```graph and subgraph isomorphism detection```.

#### Text2

In [197]:
for keyword in test_df['keywords'].tolist()[1].split(','):
    print(keyword)



derivation of the semi-discrete advection–diffusion problem
discretization of the advection–diffusion equation
dispersion and diffusion eigencurves
Galerkin method
inviscid case
linear advection
viscous case
wave-like solutions



In [199]:
for keyword in test_df['RAKE_keywords'].tolist()[1].split(','):
    print(keyword)

2


In [200]:
for keyword in test_df['TextRank_keywords'].tolist()[1].split(','):
    print(keyword)

eigencurve
discretization advection diffusion
consider
case
problem
galerkin


In [202]:
for keyword in test_df['Yake_keywords'].tolist()[1].split(','):
    print(keyword)

thi devote discretization
equation analysi dispersion
polynomial order spectral
continuou galerkin method
closely resemble formulation
resemble formulation describe
formulation describe detail
describe detail derivation
detail derivation semi-discrete
apply wave-like solution
wave-like solution relevant
address role primary
role primary secondary
discuss perspective introduce
perspective introduce viscou
feature irregular oscillation
problem apply wave-like
inviscid case linear
introduce viscou case
viscou case subsequently


Ошибки по **Text2**

Выделились малозначимые в целом слова: ```case, consider, problem, 2```    

#### Text3

In [209]:
for keyword in test_df['keywords'].tolist()[2].split(','):
    print(keyword)

Bethe–Salpeter equation
calculate the average kinetic energy
describing a bound state
framework of full Salpeter equation
full Salpeter equation
heavy meson
heavy mesons
heavy quark
obtained values
predict the values of decay constants
quantum field theory
relativistic calculation of decay constants
relativistic corrections are quite large
relativistic equation
relativistic system
relativistic wave function of the bound state
Salpeter equation
solved the full Salpeter equation
solved the instantaneous Bethe–Salpeter equation
use this method
wave function



In [211]:
for keyword in test_df['RAKE_keywords'].tolist()[2].split(','):
    print(keyword)

full salpeter equation
decay constants
bound state
0− state
letter
method
solved
values


In [212]:
for keyword in test_df['TextRank_keywords'].tolist()[2].split(','):
    print(keyword)

heavy
state
thi
relativistic calculation
calculate
solve
constant


In [213]:
for keyword in test_df['Yake_keywords'].tolist()[2].split(','):
    print(keyword)

full salpeter equation
describe bound state
heavy meson state
bound state thi
salpeter equation relativistic
instantaneou bethe salpeter
framework full salpeter
call full salpeter
bethe salpeter equation
salpeter equation call
equation full salpeter
good describe bound
meson state agree
solve full salpeter
wave function bound
function bound state
salpeter equation solve
solid basi quantum
basi quantum field
quantum field theory


Ошибки по **Text3**

Кажется, Yake всегда выдает триграммы, но как перебрать все варианты н-грамм я не понимаю.

## Общий вывод по проблемам и предложения по их решению

1) Проверять токены и последовательности токенов на бытие NE или именем собственным: к примеру, можно считать все токены, которые капитализированы не в начале предложения - именами собственными.

2) Перебирать параметры экстракторов: сначала поискать униграммы, потом биграммы, потом триграммы и включать в итоговую выборку их комбинации.

3) Использовать информацию о "важности" токена как в TF-IDF, чтобы избегать выделения слов вроде ```consider``` как ключевых.