# Task description

Word Sense Induction (WSI) is the process of automatic identification of the word
senses. 

The goal of this task is to use methods of distributional semantics and word embeddings to solve word sense induction. 

You can obtain additional information at the web site of the competition https://russe.nlpub.org/2018/wsi/ and https://competitions.codalab.org/competitions/27331#learn_the_details .

In this notebook we consider obtained solution on wiki-wiki dataset.

# Installation

Install requirements and upload github repo on google disk

In [None]:
!pip install pymorphy2
!pip install tensorflow-hub==0.7.0
!pip install tensorflow==1.15.2
!pip install deeppavlov
!git clone https://github.com/nlpub/russe-wsi-kit.git

fatal: destination path 'russe-wsi-kit' already exists and is not an empty directory.


Restart the environment after installation of libraries

# Import libraries

In [None]:
from deeppavlov import build_model
from deeppavlov import configs
from deeppavlov.core.common.file import read_json
from deeppavlov.core.commands.infer import build_model

# Load pretrained model

**Try pretrained models:**



*   ELMO model pretrained on wiki gives the highest score on wiki dataset. ARI score is 0.65;
*   ELMO model pretrained on news gives very low score;
*   Bert_sentence_embedder gives ARI score  0.6169.
*   Baseline adagram model gives ARI score 0.6278

Finally, choose ELMO model, pretrained on wikipedia  dataset from DeepPavlov library.

https://github.com/deepmipt/DeepPavlov

In [None]:
# build pretrained ELMO model
faq = build_model(configs.embedder.elmo_ru_wiki, download = True)

2020-11-21 07:37:53.255 INFO in 'deeppavlov.core.data.utils'['utils'] at line 94: Downloading from http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-wiki_600k_steps.tar.gz to /root/.deeppavlov/downloads/embeddings/elmo_ru-wiki_600k_steps.tar.gz
100%|██████████| 694M/694M [02:00<00:00, 5.78MB/s]
2020-11-21 07:39:53.630 INFO in 'deeppavlov.core.data.utils'['utils'] at line 268: Extracting /root/.deeppavlov/downloads/embeddings/elmo_ru-wiki_600k_steps.tar.gz archive into /root/.deeppavlov/downloads/embeddings/elmo_ru_wiki
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package perluniprops to /root/nltk_data...
[nltk_data]   Unzipping misc/perluniprops.zip.
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping corpora/nonbreaking_prefixes




INFO:tensorflow:Saver not created because there are no variables in the graph to restore



# Data loading

Test our model on wiki dataset

In [None]:
data_path = 'russe-wsi-kit/data/main/wiki-wiki/test.csv'

In [None]:
import pandas as pd
df = pd.read_csv(data_path, sep='\t')

In [None]:
df

Unnamed: 0,context_id,word,gold_sense_id,predict_sense_id,positions,context
0,2074,давление,,,0-9,Давление пара создается движением поршня в цил...
1,2075,давление,,,13-22,"«У тебя что, давление поднялось?» Я сказал, чт..."
2,2076,давление,,,56-65,Я жалуюсь Никоновичу наконец на головокружение...
3,2077,давление,,,0-9,Давление в котле не менялось
4,2078,давление,,,25-34,Он каждые два часа мерил давление и сахар в крови
...,...,...,...,...,...,...
3724,5798,зуд,,,43-47,Многих американцев одолевает романтический зуд...
3725,5799,зуд,,,23-27,Если на нее не находил зуд рассказывания истор...
3726,5800,зуд,,,27-33,"С раздражающей завистью, с зудом неудовлетворе..."
3727,5801,зуд,,,12-16,Нестерпимый зуд любопытства


# **Preprocessing**

**Preprocess context of words:** 


*   split text into words;
*   lemmatize;
*   lowercase;
*   remove stopwords;
*   remove punctuation;
*  remove foreign words;
*  remove numbers;
*  remove prepositions, pronouns and other not so important words.

Using not all preprocessing steps didn't improve the result.



In [None]:
import pymorphy2
import re
import nltk
from nltk.corpus import stopwords
from tqdm import tqdm
import string
import itertools

In [None]:
def pos(word, morth=pymorphy2.MorphAnalyzer()):
    "Return a likely part of speech for the *word*."""
    return morth.parse(word)[0].tag.POS

In [None]:
stop_words = nltk.download('stopwords')
morph = pymorphy2.MorphAnalyzer()
stop_words = stopwords.words('russian')

words = [] #tokens of sentence
embs = []  #embedings 

for j in tqdm(range(df.shape[0])):
  emb = [morph.parse(word)[0].normal_form.lower() for word in re.findall(r'[А-я]+', df['context'][j])]
  token = [i for i in emb if ( i not in stop_words)]
  functors_pos = {'INTJ','NPRO', 'PRCL', 'CONJ', 'PREP'}  # function words
  new_token = [word for word in token if pos(word) not in functors_pos] 
  words.append(new_token)
  embs.append(faq([' '.join(token)])[0])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
100%|██████████| 3729/3729 [20:24<00:00,  3.05it/s]


# Build text embeddings

From pretrained contextual embeddings for words build embedding representations for sentences.

**Try different approaches:**



*   Find target words in sentence and then average their vectors. G
*   Average all words in sentence. 
*   Average neighbours to target words vectors and target vectors in sentence.
*   Average k-nearest to target word vectors. Try k = 5 and k = 3. 

Get following results:


\begin{array}{ccc}
\text{method}&\text{ARI score}\\
avg\_target\_words& 0.652\\
avg\_all\_words&  0.652\\
avg\_neighbours & 	0.616\\
3-nearest\_neighbours & 0.59\\
5-nearest\_neighbours& 0.616 
\end{array}




Finally decided to use approach with averaging all words in sentence.


In [None]:
from sklearn.neighbors import NearestNeighbors

def add_indexes_of_neighbours(indexes_of_word, num_words_context):
  # function for including in embedding neighbour words with the same weight
  new_indexes = []
  for i in indexes_of_word:
    if (i==0  and  num_words_context>2): #target word is first
      # add only neighbours from right
      new_indexes.append([1,2])
    elif (i==1 and  num_words_context>2):
      new_indexes.append([0, 2])
    elif (i == num_words_context-1):
      new_indexes.append([i-1, i-2])
    elif (i ==  num_words_context-2):
      new_indexes.append([i-1, i+1])
    else:
      new_indexes.append([i-1, i+1])
  
  final_list = indexes_of_word + list(itertools.chain(*new_indexes))
  return list(set(final_list))

def k_nearest_neighbours(word, embs, k=5):
  """ for each word search in sentence k-nearest words and returns its indexes"""
  samples = embs
  neigh = NearestNeighbors(n_neighbors=k)
  neigh.fit(samples)
  closest_words_idx = neigh.kneighbors(faq([morph.parse(word)[0].normal_form.lower()])[0])[1]
  return list(itertools.chain(*closest_words_idx.tolist()))


In [None]:
import numpy as np
from more_itertools import locate
import itertools

context_embs = []
for z in tqdm(range(df.shape[0])):
#   # use nearest neighbours
  # if len(embs[z])>15:
  #   indexes_of_word = k_nearest_neighbours(df.word[z], embs[z], k=10)
  #   context_embs.append(list(embs[z][indexes_of_word,:].mean(axis=0))) #average only target words in context
  # elif len(embs[z])>6:
  #   indexes_of_word = k_nearest_neighbours(df.word[z], embs[z], k=5)
  #   context_embs.append(list(embs[z][indexes_of_word,:].mean(axis=0))) #average only target words in context
  # else:
  #   indexes_of_word = k_nearest_neighbours(df.word[z], embs[z], k=1)
  #   context_embs.append(list(embs[z][indexes_of_word,:].mean(axis=0)))
  context_embs.append(list(embs[z].mean(axis=0)))    # average words in full sentence

100%|██████████| 3729/3729 [09:02<00:00,  6.87it/s]


In [None]:
# define indexes for each unique word
unique_words = df.word.unique()
# find indexes of unique words
indexes = [list(df.word).index(unique_words[i]) for i in range(unique_words.shape[0])] + [df.shape[0]]

# Number of clusters

Choose the number of clusters from 2 to 5, using silhouette scores.

In [None]:
from sklearn.metrics import silhouette_score 
def define_num_clusters(X_array):
  algs = [AgglomerativeClustering(n_clusters=i) for i in range(2, 6, 1)]
  k = [2, 3, 4, 5, 6, 7] 
  # Appending the silhouette scores of the different models to the list 
  silhouette_scores = [] 
  for j in range(4):
    silhouette_scores.append(silhouette_score(X_array, algs[j].fit_predict(X_array)))
  return k[np.argmax(silhouette_scores)]

# Clusterization approaches

**Test different clusterization approaches**



1.  Agglomerating clusterization gives the highest score;
2. Affinity propogation gives very low score;
3.   KMeans clusterization gives the same score as Agglomerative clustering.

Finally decided to use Agglomerative clustering with number of clusters, defined with silhouette scores,  cosine affinity and use average linkage.



In [None]:
from sklearn.cluster import AgglomerativeClustering
clusters = []
for i in tqdm(range(unique_words.shape[0])):
  if np.array(context_embs)[indexes[i]:(indexes[i+1])].shape[0]< 7:
    num_clusters = 2
  else:
    num_clusters = define_num_clusters(np.array(context_embs)[indexes[i]:(indexes[i+1])])
  print(num_clusters)
  clusters.append(list(AgglomerativeClustering(n_clusters=num_clusters).fit(np.array(context_embs)[indexes[i]:(indexes[i+1])]).labels_))

  1%|          | 1/168 [00:02<06:36,  2.37s/it]

2
3


  2%|▏         | 3/168 [00:07<06:32,  2.38s/it]

3


  2%|▏         | 4/168 [00:09<06:30,  2.38s/it]

2
4


  4%|▎         | 6/168 [00:14<06:26,  2.39s/it]

2


  4%|▍         | 7/168 [00:16<06:23,  2.38s/it]

2


  5%|▍         | 8/168 [00:19<06:23,  2.40s/it]

5


  5%|▌         | 9/168 [00:21<06:20,  2.39s/it]

5


  6%|▌         | 10/168 [00:23<06:16,  2.38s/it]

5


  7%|▋         | 11/168 [00:26<06:13,  2.38s/it]

2
2


  8%|▊         | 13/168 [00:31<06:08,  2.38s/it]

3
2


  9%|▉         | 15/168 [00:35<06:06,  2.39s/it]

3


 10%|▉         | 16/168 [00:38<06:03,  2.39s/it]

2


 10%|█         | 17/168 [00:40<06:00,  2.39s/it]

2
2


 11%|█▏        | 19/168 [00:45<05:57,  2.40s/it]

2
2


 12%|█▎        | 21/168 [00:50<05:51,  2.39s/it]

2
3


 14%|█▎        | 23/168 [00:55<05:48,  2.40s/it]

5


 14%|█▍        | 24/168 [00:57<05:44,  2.39s/it]

2


 15%|█▍        | 25/168 [00:59<05:40,  2.38s/it]

2


 15%|█▌        | 26/168 [01:02<05:40,  2.39s/it]

3


 16%|█▌        | 27/168 [01:04<05:39,  2.41s/it]

3


 17%|█▋        | 28/168 [01:07<05:39,  2.42s/it]

2


 17%|█▋        | 29/168 [01:09<05:34,  2.41s/it]

4


 18%|█▊        | 30/168 [01:11<05:30,  2.40s/it]

2


 18%|█▊        | 31/168 [01:14<05:27,  2.39s/it]

4


 19%|█▉        | 32/168 [01:16<05:23,  2.38s/it]

4


 20%|█▉        | 33/168 [01:18<05:19,  2.37s/it]

2


 20%|██        | 34/168 [01:21<05:21,  2.40s/it]

3


 21%|██        | 35/168 [01:23<05:17,  2.38s/it]

2


 21%|██▏       | 36/168 [01:26<05:13,  2.38s/it]

2


 22%|██▏       | 37/168 [01:28<05:11,  2.37s/it]

5


 23%|██▎       | 38/168 [01:30<05:08,  2.37s/it]

5


 23%|██▎       | 39/168 [01:33<05:05,  2.37s/it]

2


 24%|██▍       | 40/168 [01:35<05:02,  2.37s/it]

2


 24%|██▍       | 41/168 [01:37<05:00,  2.37s/it]

2
2


 26%|██▌       | 43/168 [01:42<04:58,  2.39s/it]

3


 26%|██▌       | 44/168 [01:45<04:54,  2.37s/it]

3


 27%|██▋       | 45/168 [01:47<04:51,  2.37s/it]

2


 27%|██▋       | 46/168 [01:49<04:48,  2.37s/it]

4
2


 29%|██▊       | 48/168 [01:54<04:45,  2.38s/it]

3
2


 30%|██▉       | 50/168 [01:59<04:41,  2.38s/it]

2


 30%|███       | 51/168 [02:01<04:37,  2.37s/it]

2
2


 32%|███▏      | 53/168 [02:06<04:31,  2.36s/it]

2
3


 33%|███▎      | 55/168 [02:11<04:28,  2.37s/it]

2
2


 34%|███▍      | 57/168 [02:15<04:23,  2.38s/it]

2


 35%|███▍      | 58/168 [02:18<04:21,  2.38s/it]

4


 35%|███▌      | 59/168 [02:20<04:18,  2.37s/it]

3


 36%|███▌      | 60/168 [02:23<04:16,  2.37s/it]

2


 36%|███▋      | 61/168 [02:25<04:13,  2.37s/it]

2


 37%|███▋      | 62/168 [02:27<04:10,  2.36s/it]

2
2


 38%|███▊      | 64/168 [02:32<04:06,  2.37s/it]

2
2


 39%|███▉      | 66/168 [02:37<04:01,  2.37s/it]

2


 40%|███▉      | 67/168 [02:39<03:59,  2.37s/it]

2


 40%|████      | 68/168 [02:42<03:56,  2.37s/it]

3


 41%|████      | 69/168 [02:44<03:54,  2.37s/it]

2
3


 42%|████▏     | 71/168 [02:49<03:50,  2.37s/it]

2


 43%|████▎     | 72/168 [02:51<03:48,  2.38s/it]

3


 43%|████▎     | 73/168 [02:53<03:46,  2.38s/it]

3


 44%|████▍     | 74/168 [02:56<03:52,  2.47s/it]

2


 45%|████▍     | 75/168 [02:58<03:46,  2.44s/it]

3
5


 46%|████▌     | 77/168 [03:03<03:39,  2.41s/it]

2


 46%|████▋     | 78/168 [03:06<03:35,  2.40s/it]

2


 47%|████▋     | 79/168 [03:08<03:32,  2.38s/it]

2
2


 48%|████▊     | 81/168 [03:13<03:27,  2.39s/it]

3


 49%|████▉     | 82/168 [03:15<03:25,  2.38s/it]

3


 49%|████▉     | 83/168 [03:18<03:21,  2.38s/it]

4


 50%|█████     | 84/168 [03:20<03:21,  2.40s/it]

2
2


 51%|█████     | 86/168 [03:25<03:18,  2.42s/it]

2


 52%|█████▏    | 87/168 [03:27<03:16,  2.43s/it]

2


 52%|█████▏    | 88/168 [03:30<03:13,  2.42s/it]

2


 53%|█████▎    | 89/168 [03:32<03:12,  2.43s/it]

2


 54%|█████▎    | 90/168 [03:35<03:10,  2.44s/it]

2


 54%|█████▍    | 91/168 [03:37<03:07,  2.44s/it]

2


 55%|█████▍    | 92/168 [03:39<03:04,  2.43s/it]

2


 55%|█████▌    | 93/168 [03:42<03:02,  2.43s/it]

4
4


 57%|█████▋    | 95/168 [03:47<02:58,  2.44s/it]

3


 57%|█████▋    | 96/168 [03:49<02:54,  2.42s/it]

3


 58%|█████▊    | 97/168 [03:52<02:50,  2.40s/it]

5


 58%|█████▊    | 98/168 [03:54<02:47,  2.39s/it]

3


 59%|█████▉    | 99/168 [03:56<02:44,  2.38s/it]

3
4


 60%|██████    | 101/168 [04:01<02:39,  2.39s/it]

2


 61%|██████    | 102/168 [04:03<02:36,  2.38s/it]

2
4


 62%|██████▏   | 104/168 [04:08<02:31,  2.36s/it]

3


 62%|██████▎   | 105/168 [04:11<02:29,  2.37s/it]

3


 63%|██████▎   | 106/168 [04:13<02:26,  2.36s/it]

4


 64%|██████▎   | 107/168 [04:15<02:23,  2.36s/it]

3


 65%|██████▍   | 109/168 [04:19<02:09,  2.19s/it]

2
3


 65%|██████▌   | 110/168 [04:21<02:09,  2.24s/it]

2


 66%|██████▌   | 111/168 [04:24<02:09,  2.27s/it]

3


 67%|██████▋   | 112/168 [04:26<02:08,  2.30s/it]

2
5


 68%|██████▊   | 114/168 [04:31<02:06,  2.35s/it]

2


 68%|██████▊   | 115/168 [04:33<02:04,  2.35s/it]

2


 69%|██████▉   | 116/168 [04:36<02:02,  2.35s/it]

3


 70%|██████▉   | 117/168 [04:38<02:00,  2.36s/it]

2


 70%|███████   | 118/168 [04:40<01:58,  2.37s/it]

2


 71%|███████   | 119/168 [04:43<01:55,  2.36s/it]

3
2


 72%|███████▏  | 121/168 [04:48<01:50,  2.36s/it]

3
2


 73%|███████▎  | 123/168 [04:52<01:45,  2.35s/it]

2


 74%|███████▍  | 124/168 [04:55<01:43,  2.36s/it]

2


 74%|███████▍  | 125/168 [04:57<01:41,  2.35s/it]

3


 75%|███████▌  | 126/168 [04:59<01:39,  2.36s/it]

3


 76%|███████▌  | 127/168 [05:02<01:36,  2.36s/it]

2


 76%|███████▌  | 128/168 [05:04<01:34,  2.37s/it]

2
2


 77%|███████▋  | 130/168 [05:09<01:30,  2.39s/it]

2


 78%|███████▊  | 131/168 [05:11<01:28,  2.38s/it]

2


 79%|███████▊  | 132/168 [05:14<01:25,  2.37s/it]

3


 79%|███████▉  | 133/168 [05:16<01:22,  2.37s/it]

2


 80%|███████▉  | 134/168 [05:18<01:20,  2.36s/it]

2


 80%|████████  | 135/168 [05:21<01:18,  2.36s/it]

2


 81%|████████  | 136/168 [05:23<01:15,  2.36s/it]

3


 82%|████████▏ | 137/168 [05:25<01:13,  2.36s/it]

2


 82%|████████▏ | 138/168 [05:28<01:10,  2.36s/it]

2
2


 83%|████████▎ | 139/168 [05:30<01:09,  2.38s/it]

2


 84%|████████▍ | 141/168 [05:35<01:04,  2.37s/it]

4
2


 85%|████████▌ | 143/168 [05:40<01:00,  2.42s/it]

2


 86%|████████▌ | 144/168 [05:42<00:57,  2.40s/it]

2


 86%|████████▋ | 145/168 [05:45<00:54,  2.38s/it]

2


 87%|████████▋ | 146/168 [05:47<00:52,  2.38s/it]

4


 88%|████████▊ | 147/168 [05:49<00:49,  2.37s/it]

2


 88%|████████▊ | 148/168 [05:52<00:47,  2.37s/it]

2


 89%|████████▊ | 149/168 [05:54<00:44,  2.36s/it]

2


 89%|████████▉ | 150/168 [05:56<00:42,  2.36s/it]

4


 90%|████████▉ | 151/168 [05:59<00:40,  2.36s/it]

2


 90%|█████████ | 152/168 [06:01<00:37,  2.37s/it]

2


 91%|█████████ | 153/168 [06:03<00:35,  2.36s/it]

2


 92%|█████████▏| 154/168 [06:06<00:33,  2.36s/it]

2
3


 93%|█████████▎| 156/168 [06:11<00:28,  2.40s/it]

3


 93%|█████████▎| 157/168 [06:13<00:26,  2.41s/it]

5


 94%|█████████▍| 158/168 [06:15<00:24,  2.40s/it]

2


 95%|█████████▍| 159/168 [06:18<00:21,  2.39s/it]

3
2


 95%|█████████▌| 160/168 [06:20<00:18,  2.37s/it]

2


 96%|█████████▋| 162/168 [06:25<00:14,  2.38s/it]

2
3


 98%|█████████▊| 164/168 [06:30<00:09,  2.38s/it]

3


 98%|█████████▊| 165/168 [06:32<00:07,  2.37s/it]

2


 99%|█████████▉| 166/168 [06:34<00:04,  2.36s/it]

2


 99%|█████████▉| 167/168 [06:37<00:02,  2.36s/it]

4


100%|██████████| 168/168 [06:39<00:00,  2.38s/it]

2





In [None]:
# from sklearn.cluster import AffinityPropagation
# clusters = []
# for i in range(unique_words.shape[0]):
#   clusters.append(list(AffinityPropagation().fit(np.array(context_embs)[indexes[i]:(indexes[i+1])]).labels_))

# from sklearn.cluster import KMeans
# clusters = []
# for i in range(unique_words.shape[0]):
#   clusters.append(list(KMeans(n_clusters=2).fit(np.array(context_embs)[indexes[i]:(indexes[i+1])]).labels_))

## Final results 

In [None]:
predict_list = list(itertools.chain(*clusters))
df['predict_sense_id'] = predict_list

# Save

In [None]:
save_path = 'russe-wsi-kit/data/main/wiki-wiki/elmo.csv'
df.to_csv(save_path, sep='\t', index=None)

In [None]:
cd russe-wsi-kit/data/main/wiki-wiki/

/content/russe-wsi-kit/data/main/active-dict


In [None]:
!zip wiki-wiki.zip elmo.csv

  adding: f_elmo.csv (deflated 69%)


In [None]:
cd /content/

/content


# Test score

In [None]:
from sklearn.metrics import adjusted_rand_score
score = []
for j in range(unique_words.shape[0]):
  score.append(adjusted_rand_score(clusters[j], np.array(list(df['gold_sense_id'])[indexes[j]:(indexes[j+1])])))
print(score)