# WSD глагола "break". Результат тетрадки: 3/10


В этой тетрадке использовались такие способы: 
1. использование другого размера контекстного окна
2. в Леске брались только определения
3. брались только примеры
4. брались и примеры, и определения
5. это все тестировалось со стоп-словами/без стоп-слов
6. стемминг
7. POS-теггинг слов "break"
9. вместо контекстов в алгоритме Леска бралось просто предложение (и это показало лучший результат, нежели чем контексты). 

Я специально старалась не подбирать "более удобную" выборку предложений со словом break, поэтому все эти эксперименты честные. Основные проблемы экспериментов заключаются в том, что, во-первых, в Wordnet представлены далеко не все значения (особенно слов "break", связанных с разными видами спорта, и фразовых глаголов); во-вторых, практически нет совпадений между лексиконом моей выборки и определениями и примерами из Wordnet. POS-теггинг немного помог, но все же одним Леском в случае такого корпуса не обойтись. 


Отсюда начинается подготовка данных. 

In [15]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.tokenize import sent_tokenize
from nltk.wsd import lesk
from string import punctuation

punct = punctuation+'«»—…“”*№–'
stops = set(stopwords.words('english'))


def tokenize(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word and word not in stops]
    words = [word for word in words if word]

    return words

#### 1. Разобьем корпус на предложения с помощью nltk.sent_tokenize() 

In [16]:
initial_corpus = [text for text in open('corpus_eng.txt')]
sent_corpus = []
for text in initial_corpus:
    sents = sent_tokenize(text)
    for sent in sents:
        sent_corpus.append(sent)

In [17]:
len(sent_corpus)

198699

#### 2. Составим корпус из предложений, где есть слово 'break'

In [18]:
break_corpus = [sent for sent in sent_corpus if 'break' in tokenize(sent)]
len(break_corpus)

448

In [19]:
break_corpus[:10]

['They break the rules in order to convince the rule-makers that they need to change the rules, which is itself a kind of state-approved process.',
 'In the second innings, he got locked up in his position so much, intent on trying to play an offspinner against the turn, that he was stunned by a big-turning off break that ricocheted off his pads and fell on the stumps.',
 'She wasn’t very easy to break in.',
 'Their five-day schedule break is Feb. 10-15 and they have 13 of their final 20 games at home starting March 2.',
 '"Back in my poor college days when I worked at Walmart we had a fight break out over a bike.',
 'Since then, the party has said it respects the will of the people, but many Labour lawmakers are hoping to steer the talks with the European Union away from what some fear will be a clean break with the bloc\'s lucrative single market - the so-called "hard Brexit."',
 'or is it to break the record of first president to get indicted and cuffed while being sworn in?',
 '“Ye

#### 3. Выберем 10 случайных предложений из этого корпуса

In [20]:
import random
random.seed(5)

corpus_10 = random.sample(break_corpus, 10)

#### 4. Проверим, правильно ли определяются значения слова "break" с помощью алгоритма Леска

In [21]:
def get_words_in_context(words, window=3):
    words2context = []
    for i in range(len(words)):
        left = words[max(0,i-window):i] 
        right = words[i+1:i+window+1]
        target = words[i]
        words2context.append((target, left+right)) 
    return words2context


def lesk(word: type = str, sentence: type = list):
    bestsense = 0
    maxoverlap = 0 
    for i, synset in enumerate(wn.synsets(word)):
        definition = tokenize(synset.definition())
        definition = set(definition)
        sentence = set(sentence)
        overlap = len(definition & sentence)          
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i  
    return bestsense

Сначала посмотрим на возможные значения:

In [22]:
for i, synset in enumerate(wn.synsets("break")):
    print(i, synset.definition(), synset.examples())

0 some abrupt occurrence that interrupts an ongoing activity ['the telephone is an annoying interruption', 'there was a break in the action when a player was hurt']
1 an unexpected piece of good luck ['he finally got his big break']
2 (geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other ['they built it right over a geological fault', "he studied the faulting of the earth's crust"]
3 a personal or social separation (as between opposing factions) ['they hoped to avoid a break in relations']
4 a pause from doing something (as work) ['we took a 10-minute break', 'he took time out to recuperate']
5 the act of breaking something ['the breakage was unavoidable']
6 a time interval during which there is a temporary cessation of something []
7 breaking of hard tissue such as bone ['it was a nasty fracture', 'the break seems to have been caused by a fall']
8 the occurrence of breaking ['the break in the dam threatened the valley']
9 an abrup

И на предложения:

In [23]:
for i, sent in enumerate(corpus_10):
    print(i, sent)

0 Dog's pitch invasion forces early tea break
1 Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
2 The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
3 But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
4 They represent a grand break from the past.
5 Woodland is making only his third start on the 2016-17 PGA Tour, having put down his golf clubs for a while to get married and have an extended break from the game.
6 Cohen, a native of Quebec, was already a celebrated poet and novelist when he moved to New York in 1966 at age 31 to break into the m

На всякий случай, уточним контексты нескольких предложений:

In [24]:
for text in initial_corpus:
    if "Dog's pitch invasion forces early tea break" in text:
        print(text)
    elif "But Iranian international Azmoun levelled for Rostov" in text:
        print(text)
    elif "They represent a grand break from the past" in text:
        print(text)
    elif "This is usually sufficient to keep the sinker from moving" in text:
        print(text)
    elif "Home town boy Jadeja" in text:
        print(text)

But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.

Home town boy Jadeja had earlier given India the first break at 47, successfully earning a shout for leg before wicket against England skipper Cook. 

Share Stuart Broad suffers foot injury in second Test - Yahoo7 Stuart Broad has a foot problem but England do not yet know how serious it is, according to the bowler's opening partner James Anderson. Dog's pitch invasion forces early tea break UP NEXT VIDEO Dog's pitch invasion forces early tea break. Source: FoxSports. Dog's pitch invasion forces early tea break 

You can peg the slip sinker to prevent it from sliding on the line. Jam one end of a toothpick in the head of a sinker as far as it will go, then break or clip it off. This is usually sufficient to keep the sinker from moving, but at times you may want to also jam the other end of the toothpick into the back of the cone and break it off to keep 

Установим соответствия: 
* 1 предложение - перерыв на чай во время игры в крикет (пришлось гуглить эту статью, чтобы понять); (подходят значения 4,6,62)  
* 2 предложение - значения 4, 6, 62
* 3 предложение - "прорвать осаду"; могут фигурально подойти значения 16, 20, 23, 51 но на самом деле в Wordnet нет идеально подходящего значения 
* 4 предложение - перерыв на футбольном матче; подойдут значения 4,6,62
* 5 предложение - "большой поворот" в Индии; значение 49 
* 6 предложение - значения 4, 6, 62
* 7 предложение - переносное значение 26; также может подойти 42
* 8 предложение - значения 16, 20, 23, 51
* 9 предложение - "сломать зубочистку"; значения 7, 20
* 10 предложение - сложно понять, что это за значение; оно относится к крикету. Допустим, что это подача в крикете. Самые близкие значения Wordnet - 11, 14. Они относятся к спорту, но могут быть неправильными. 


Посмотрим, что у нас получилось с алгоритмом Леска, которым мы пользовались на паре, с окном 3: 

In [25]:
def get_meanings(corpus_list, corpus_str):
    
    for i, sent in enumerate(corpus_list):
        contexts = get_words_in_context(sent)
        for word, context in contexts:
            if word == "break":
                meaning_number = lesk(word, context)
                print(i+1, "DEFINITION: " + str(meaning_number) + '\t'+ wn.synsets(word)[meaning_number].definition())
                print(i+1, "SENTENCE: ", corpus_str[i])
    return 0
    
corpus_10_tokenized = [tokenize(sent) for sent in corpus_10]
get_meanings(corpus_10_tokenized, corpus_10)

1 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
1 SENTENCE:  Dog's pitch invasion forces early tea break
2 DEFINITION: 43	happen or take place
2 SENTENCE:  Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
3 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
3 SENTENCE:  The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
4 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
4 SENTENCE:  But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
5 DEFINITION: 0	some abrupt occurrence th

0

Пока ни одно значение не совпадает. 

# Эксперименты

### 1. Попробуем взять другой размер окна

С окном 2 еще хуже. 

In [26]:
def get_words_in_context(words, window=2):
    words2context = []
    for i in range(len(words)):
        left = words[max(0,i-window):i] 
        right = words[i+1:i+window+1]
        target = words[i]
        words2context.append((target, left+right)) 
    return words2context

get_meanings(corpus_10_tokenized, corpus_10)

1 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
1 SENTENCE:  Dog's pitch invasion forces early tea break
2 DEFINITION: 43	happen or take place
2 SENTENCE:  Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
3 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
3 SENTENCE:  The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
4 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
4 SENTENCE:  But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
5 DEFINITION: 0	some abrupt occurrence th

0

С окнами другого размера тоже нет разницы. 

### 2. Попробуем со стоп-словами. 
8 предложение правильно соотнеслось с 20-м значением. 
#### Результат: 1/10

In [27]:
def tokenize(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word]
    words = [word for word in words if word]

    return words


def get_words_in_context(words, window=3):
    words2context = []
    for i in range(len(words)):
        left = words[max(0,i-window):i] 
        right = words[i+1:i+window+1]
        target = words[i]
        words2context.append((target, left+right)) 
    return words2context

corpus_10_tokenized = [tokenize(sent) for sent in corpus_10]
get_meanings(corpus_10_tokenized, corpus_10)

1 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
1 SENTENCE:  Dog's pitch invasion forces early tea break
2 DEFINITION: 2	(geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other
2 SENTENCE:  Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
3 DEFINITION: 9	an abrupt change in the tone or register of the voice (as at puberty or due to emotion)
3 SENTENCE:  The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
4 DEFINITION: 2	(geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other
4 SENTENCE:  But Iran

0

### 3. Попробуем вместо описания значений в алгоритме Леска использовать только примеры из Wordnet (без стоп-слов). 

In [28]:
def tokenize(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word and word not in stops]
    words = [word for word in words if word]

    return words


def lesk(word: type = str, sentence: type = list):
    bestsense = 0
    maxoverlap = 0
    
    for i, synset in enumerate(wn.synsets(word)):
        examples = synset.examples()
        examples_list = []
        for j in examples:
            j = tokenize(j)
            for k in j:
                examples_list.append(k)
        examples_final = set(examples_list)
        
        sentence = set(sentence)
        
        overlap = len(examples_final & sentence)          
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i  
    return bestsense

In [29]:
corpus_10_tokenized = [tokenize(sent) for sent in corpus_10]
get_meanings(corpus_10_tokenized, corpus_10)

1 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
1 SENTENCE:  Dog's pitch invasion forces early tea break
2 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
2 SENTENCE:  Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
3 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
3 SENTENCE:  The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
4 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
4 SENTENCE:  But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
5 DE

0

##### Пока 0/10

### 4. Попробуем совместить в алгоритме Леска описания и примеры (со стоп-словами)

In [30]:
def tokenize(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word]
    words = [word for word in words if word]

    return words


def lesk(word: type = str, sentence: type = list):
    bestsense = 0
    maxoverlap = 0
    
    for i, synset in enumerate(wn.synsets(word)):
        
        examples = synset.examples()
        examples_list = []
        for j in examples:
            j = tokenize(j)
            for k in j:
                examples_list.append(k)
        
        definition = tokenize(synset.definition())
    
        ex_plus_def = set(examples_list + definition)

        sentence = set(sentence)
        
        overlap = len(ex_plus_def & sentence)          
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i 
            
    return bestsense

In [31]:
corpus_10_tokenized = [tokenize(sent) for sent in corpus_10]
get_meanings(corpus_10_tokenized, corpus_10)

1 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
1 SENTENCE:  Dog's pitch invasion forces early tea break
2 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
2 SENTENCE:  Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
3 DEFINITION: 9	an abrupt change in the tone or register of the voice (as at puberty or due to emotion)
3 SENTENCE:  The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
4 DEFINITION: 23	scatter or part
4 SENTENCE:  But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
5 DEFINITION: 2	(

0

Лучше не стало. 

### 4. Попробуем выкинуть стоп-слова 
- снова ничего

In [32]:
def tokenize(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word and word not in stops]
    words = [word for word in words if word]

    return words

corpus_10_tokenized = [tokenize(sent) for sent in corpus_10]
get_meanings(corpus_10_tokenized, corpus_10)

1 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
1 SENTENCE:  Dog's pitch invasion forces early tea break
2 DEFINITION: 43	happen or take place
2 SENTENCE:  Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
3 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
3 SENTENCE:  The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
4 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
4 SENTENCE:  But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
5 DEFINITION: 43	happen or take place
5 S

0

### 5. Попробуем стеммировать и сами предложения, и определения с примерами. Со стоп-словами

In [33]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")


def tokenize(text):
    words = [word.strip(punct) for word in text.lower().split() if word]
    words = [word for word in words if word]
    return words


def stem_(words):
    words = [stemmer.stem(i) for i in words]
    return words


def lesk(word: type = str, sentence: type = list):
    bestsense = 0
    maxoverlap = 0
    
    sentence = set(stem_(sentence))
    
    for i, synset in enumerate(wn.synsets(word)):   
        examples = synset.examples()
        examples_list = []
        for j in examples:
            j = stem_(tokenize(j))
            for word in j:
                examples_list.append(word)
        
        definition = stem_(tokenize(synset.definition()))
        ex_plus_def = set(examples_list + definition)
        overlap = len(ex_plus_def & sentence)          
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i 
            
    return bestsense

In [34]:
get_meanings(corpus_10_tokenized, corpus_10)

1 DEFINITION: 20	destroy the integrity of; usually by force; cause to separate into pieces or fragments
1 SENTENCE:  Dog's pitch invasion forces early tea break
2 DEFINITION: 43	happen or take place
2 SENTENCE:  Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
3 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
3 SENTENCE:  The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
4 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
4 SENTENCE:  But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
5 DEFINITION

0

Ура! Восьмое предложение имеет 23-е значение глагола "break", которое мы допустили как правильное. Но значение в 9-м предложении не распозналось. 
#### Снова результат - 1/10, но уже с другим предложением. 

### 6. Попробуем сделать то же самое, выкинув стоп-слова
- ничего не поменялось

In [35]:
def tokenize(text):
    words = [word.strip(punct) for word in text.lower().split() if word and word not in stops]
    words = [word for word in words if word]
    return words

In [36]:
get_meanings(corpus_10_tokenized, corpus_10)

1 DEFINITION: 20	destroy the integrity of; usually by force; cause to separate into pieces or fragments
1 SENTENCE:  Dog's pitch invasion forces early tea break
2 DEFINITION: 43	happen or take place
2 SENTENCE:  Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
3 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
3 SENTENCE:  The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
4 DEFINITION: 0	some abrupt occurrence that interrupts an ongoing activity
4 SENTENCE:  But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
5 DEFINITION

0

### 7. Тегируем по частям речи слово "break"

In [37]:
import nltk

def get_tags_for_break(sent: type = str):
    
    sent_postagged = []
    
    if type(sent) == str: 
        sent = word_tokenize(sent)
        for i in nltk.pos_tag(sent):
            word, tag = i
            if word in punct:
                continue
            if word.lower() == "break":
                sent_postagged.append(word.lower() + '_' + tag)
            else:
                sent_postagged.append(word.lower())
                
    return sent_postagged

In [38]:
corpus_10_tagged = [get_tags_for_break(sent) for sent in corpus_10]

In [39]:
corpus_10_tagged

[['dog', "'s", 'pitch', 'invasion', 'forces', 'early', 'tea', 'break_NN'],
 ['let',
  "'s",
  'talk',
  'about',
  'what',
  'changes',
  'trump',
  'might',
  'make',
  'regarding',
  'tax',
  'cuts',
  'the',
  'iran',
  'nuclear',
  'agreement',
  'trade',
  'and',
  'what',
  'the',
  'consequences',
  'might',
  'be',
  'if',
  'trump',
  'starts',
  'a',
  'trade',
  'war',
  'with',
  'china',
  'after',
  'we',
  'take',
  'a',
  'short',
  'break_NN'],
 ['the',
  'insurgents',
  'had',
  'seized',
  'a',
  'couple',
  'of',
  'strategic',
  'areas',
  'in',
  'western',
  'aleppo',
  'after',
  'launching',
  'an',
  'offensive',
  'on',
  'oct.',
  '28',
  'in',
  'an',
  'attempt',
  'to',
  'break_VB',
  'the',
  'siege',
  'imposed',
  'in',
  'july',
  'on',
  'rebel-held',
  'eastern',
  'aleppo',
  'which',
  'has',
  'also',
  'been',
  'targeted',
  'by',
  'waves',
  'of',
  'syrian',
  'and',
  'russian',
  'airstrikes'],
 ['but',
  'iranian',
  'international',
  '

In [40]:
def get_break_meaning_from_sent(corpus):
    breaks = []
    for sent in corpus:
        sent = word_tokenize(sent)
        for i in nltk.pos_tag(sent):
            word, tag = i
            if word.lower() == "break":
                breaks.append(word.lower() + '_' + tag)
    return breaks

In [41]:
breaks = get_break_meaning_from_sent(corpus_10)
breaks

['break_NN',
 'break_NN',
 'break_VB',
 'break_NN',
 'break_NN',
 'break_NN',
 'break_VB',
 'break_VB',
 'break_VB',
 'break_NN']

In [42]:
def lesk(word: type = str, context: type = list):
    bestsense = 0
    maxoverlap = 0
    
    context = set(context)
    
    for i, synset in enumerate(wn.synsets(word)):   
        examples = synset.examples()
        examples_list = []
        for j in examples:
            j = get_tags_for_break(j)
            for word in j:
                examples_list.append(word)
                
        definition = get_tags_for_break(synset.definition())
        
        ex_plus_def = set(examples_list + definition)
        overlap = len(ex_plus_def & context)          
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i 
            
    return bestsense

In [43]:
def get_meanings(corpus_list, corpus_str):
    
    for i, sent in enumerate(corpus_list):
        contexts = get_words_in_context(sent)
        for word, context in contexts:
            if word == "break":
                context += breaks[i]
                meaning_number = lesk(word, context)
                print(i+1, "DEFINITION:" + str(meaning_number) + '\t'+ wn.synsets(word)[meaning_number].definition())
                print(i+1, "SENTENCE: ", corpus_str[i])
    return 0

In [44]:
get_meanings(corpus_10_tokenized, corpus_10)

1 DEFINITION:0	some abrupt occurrence that interrupts an ongoing activity
1 SENTENCE:  Dog's pitch invasion forces early tea break
2 DEFINITION:0	some abrupt occurrence that interrupts an ongoing activity
2 SENTENCE:  Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
3 DEFINITION:0	some abrupt occurrence that interrupts an ongoing activity
3 SENTENCE:  The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
4 DEFINITION:0	some abrupt occurrence that interrupts an ongoing activity
4 SENTENCE:  But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
5 DEFINI

0

### Лучше не стало

### 8. Попробуем комбо: учитывать вместо контекста просто все слова в предложении + всё лемматизируем + сохраним тегирование слова break + уберем стоп-слова

In [45]:
def get_tags_for_break(sent):
    sent_tagged = [] 
    sent = word_tokenize(sent)
    for i in nltk.pos_tag(sent):
        word, tag = i
        if word in punct:
            continue
        if word.lower() == "break":
            sent_tagged.append(word.lower() + '_' + tag)
        else:
            sent_tagged.append(stemmer.stem(word.lower()))
    return sent_tagged

In [46]:
def lesk(word: type = str, context: type = list):
    bestsense = 0
    maxoverlap = 0
    
    context = set([i for i in context if i not in stops])
    
    for i, synset in enumerate(wn.synsets(word)):  
        examples = synset.examples()
        examples_list = []
        for j in examples:
            j = get_tags_for_break(j)
            for word in j:
                examples_list.append(word)
                
        definition = get_tags_for_break(synset.definition())
        
        ex_plus_def = examples_list + definition
        ex_plus_def = set([w for w in ex_plus_def if w not in stops])
        overlap = len(ex_plus_def & context)          
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i 
            
    return bestsense

In [47]:
def get_meanings(corpus_list, corpus_str):
    for i, sent in enumerate(corpus_list):
        meaning_number = lesk("break", sent)
        print(i+1, "DEFINITION:" + str(meaning_number) + '\t'+ wn.synsets("break")[meaning_number].definition())
        print(i+1, "SENTENCE: ", corpus_str[i])
    return 0

In [48]:
corpus_tagged = [get_tags_for_break(sent) for sent in corpus_10]
corpus_tagged

[['dog', "'s", 'pitch', 'invas', 'forc', 'earli', 'tea', 'break_NN'],
 ['let',
  "'s",
  'talk',
  'about',
  'what',
  'chang',
  'trump',
  'might',
  'make',
  'regard',
  'tax',
  'cut',
  'the',
  'iran',
  'nuclear',
  'agreement',
  'trade',
  'and',
  'what',
  'the',
  'consequ',
  'might',
  'be',
  'if',
  'trump',
  'start',
  'a',
  'trade',
  'war',
  'with',
  'china',
  'after',
  'we',
  'take',
  'a',
  'short',
  'break_NN'],
 ['the',
  'insurg',
  'had',
  'seiz',
  'a',
  'coupl',
  'of',
  'strateg',
  'area',
  'in',
  'western',
  'aleppo',
  'after',
  'launch',
  'an',
  'offens',
  'on',
  'oct.',
  '28',
  'in',
  'an',
  'attempt',
  'to',
  'break_VB',
  'the',
  'sieg',
  'impos',
  'in',
  'juli',
  'on',
  'rebel-held',
  'eastern',
  'aleppo',
  'which',
  'has',
  'also',
  'been',
  'target',
  'by',
  'wave',
  'of',
  'syrian',
  'and',
  'russian',
  'airstrik'],
 ['but',
  'iranian',
  'intern',
  'azmoun',
  'level',
  'for',
  'rostov',
  'just

In [49]:
get_meanings(corpus_10_tagged, corpus_10)

1 DEFINITION:62	cease an action temporarily
1 SENTENCE:  Dog's pitch invasion forces early tea break
2 DEFINITION:62	cease an action temporarily
2 SENTENCE:  Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
3 DEFINITION:16	terminate
3 SENTENCE:  The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
4 DEFINITION:0	some abrupt occurrence that interrupts an ongoing activity
4 SENTENCE:  But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
5 DEFINITION:0	some abrupt occurrence that interrupts an ongoing activity
5 SENTENCE:  They represent a grand break 

0

Более-менее верно распознались первые 3 предложения. 
#### 3/10

P.S. Ради эксперимента потестируем lesk nltk. В результате ни одного совпавшего значения. 

In [50]:
from nltk.wsd import lesk
corpus_10_tokenized = [word_tokenize(sent) for sent in corpus_10]

for i, sent in enumerate(corpus_10_tokenized):
    definition = lesk(sent, 'break').definition()
    print("SENT:" + corpus_10[i], "DEF: " + definition, sep='\n')
    print('\n')

SENT:Dog's pitch invasion forces early tea break
DEF: become fractured; break or crack on the surface only


SENT:Let's talk about what changes Trump might make regarding tax cuts, the Iran nuclear agreement, trade and what the consequences might be if Trump starts a trade war with China after we take a short break.
DEF: make known to the public information that was previously known only to a few people or that was meant to be kept a secret


SENT:The insurgents had seized a couple of strategic areas in western Aleppo after launching an offensive on Oct. 28 in an attempt to break the siege imposed in July on rebel-held eastern Aleppo, which has also been targeted by waves of Syrian and Russian airstrikes.
DEF: (geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other


SENT:But Iranian international Azmoun levelled for Rostov just before the break after turning Jerome Boateng to coolly slot past Ulreich.
DEF: make known to the public i