# Task description

Word Sense Induction (WSI) is the process of automatic identification of the word
senses. 

The goal of this task is to use methods of distributional semantics and word embeddings to solve word sense induction. 

You can obtain additional information at the web site of the competition https://russe.nlpub.org/2018/wsi/ and https://competitions.codalab.org/competitions/27331#learn_the_details .

In this notebook we consider obtained solution on bts-rnc dataset. 

# Installation

Install requirements and upload github repo on google disk

In [None]:
!pip install pymorphy2
!pip install tensorflow-hub==0.7.0
!pip install tensorflow==1.15.2
!pip install deeppavlov
!git clone https://github.com/nlpub/russe-wsi-kit.git

Collecting pymorphy2
[?25l  Downloading https://files.pythonhosted.org/packages/07/57/b2ff2fae3376d4f3c697b9886b64a54b476e1a332c67eee9f88e7f1ae8c9/pymorphy2-0.9.1-py3-none-any.whl (55kB)
[K     |████████████████████████████████| 61kB 3.2MB/s 
[?25hCollecting pymorphy2-dicts-ru<3.0,>=2.4
[?25l  Downloading https://files.pythonhosted.org/packages/3a/79/bea0021eeb7eeefde22ef9e96badf174068a2dd20264b9a378f2be1cdd9e/pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl (8.2MB)
[K     |████████████████████████████████| 8.2MB 5.9MB/s 
[?25hCollecting dawg-python>=0.7.1
  Downloading https://files.pythonhosted.org/packages/6a/84/ff1ce2071d4c650ec85745766c0047ccc3b5036f1d03559fd46bb38b5eeb/DAWG_Python-0.7.2-py2.py3-none-any.whl
Installing collected packages: pymorphy2-dicts-ru, dawg-python, pymorphy2
Successfully installed dawg-python-0.7.2 pymorphy2-0.9.1 pymorphy2-dicts-ru-2.4.417127.4579844
Collecting tensorflow-hub==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages

Restart the environment after installation of libraries

# Import libraries

In [None]:
from deeppavlov import build_model
from deeppavlov import configs
from deeppavlov.core.common.file import read_json
from deeppavlov.core.commands.infer import build_model

# Load pretrained model

**Try pretrained models:**

Choose ELMO model, pretrained on wikipedia  dataset from DeepPavlov library.

https://github.com/deepmipt/DeepPavlov

In [None]:
# build pretrained ELMO model
faq = build_model(configs.embedder.elmo_ru_wiki, download = True)

2020-11-20 20:13:19.930 INFO in 'deeppavlov.core.data.utils'['utils'] at line 94: Downloading from http://files.deeppavlov.ai/deeppavlov_data/elmo_ru-wiki_600k_steps.tar.gz to /root/.deeppavlov/downloads/embeddings/elmo_ru-wiki_600k_steps.tar.gz
100%|██████████| 694M/694M [03:18<00:00, 3.49MB/s]
2020-11-20 20:16:39.237 INFO in 'deeppavlov.core.data.utils'['utils'] at line 268: Extracting /root/.deeppavlov/downloads/embeddings/elmo_ru-wiki_600k_steps.tar.gz archive into /root/.deeppavlov/downloads/embeddings/elmo_ru_wiki
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package perluniprops to /root/nltk_data...
[nltk_data]   Unzipping misc/perluniprops.zip.
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping corpora/nonbreaking_prefixes




INFO:tensorflow:Saver not created because there are no variables in the graph to restore



# Data loading

Test our model on wiki dataset

In [None]:
data_path = 'russe-wsi-kit/data/main/bts-rnc/test.csv'

In [None]:
import pandas as pd
df = pd.read_csv(data_path, sep='\t')

In [None]:
df

Unnamed: 0,context_id,word,gold_sense_id,predict_sense_id,positions,context
0,2074,давление,,,0-9,Давление пара создается движением поршня в цил...
1,2075,давление,,,13-22,"«У тебя что, давление поднялось?» Я сказал, чт..."
2,2076,давление,,,56-65,Я жалуюсь Никоновичу наконец на головокружение...
3,2077,давление,,,0-9,Давление в котле не менялось
4,2078,давление,,,25-34,Он каждые два часа мерил давление и сахар в крови
...,...,...,...,...,...,...
3724,5798,зуд,,,43-47,Многих американцев одолевает романтический зуд...
3725,5799,зуд,,,23-27,Если на нее не находил зуд рассказывания истор...
3726,5800,зуд,,,27-33,"С раздражающей завистью, с зудом неудовлетворе..."
3727,5801,зуд,,,12-16,Нестерпимый зуд любопытства


# **Preprocessing**

**Preprocess context of words:** 


*   split text into words;
*   lemmatize;
*   lowercase;
*   remove stopwords;
*   remove punctuation;
*  remove foreign words;
*  remove numbers;
*  remove prepositions, pronouns and other not so important words.

Using not all preprocessing steps didn't improve the result.



In [None]:
import pymorphy2
import re
import nltk
from nltk.corpus import stopwords
from tqdm import tqdm
import string
import itertools

In [None]:
def pos(word, morth=pymorphy2.MorphAnalyzer()):
    "Return a likely part of speech for the *word*."""
    return morth.parse(word)[0].tag.POS

In [None]:
stop_words = nltk.download('stopwords')
morph = pymorphy2.MorphAnalyzer()
stop_words = stopwords.words('russian')

words = [] #tokens of sentence
embs = []  #embedings 

for j in tqdm(range(df.shape[0])):
  emb = [morph.parse(word)[0].normal_form.lower() for word in re.findall(r'[А-я]+', df['context'][j])]
  token = [i for i in emb if ( i not in stop_words)]
  functors_pos = {'INTJ','NPRO', 'PRCL', 'CONJ', 'PREP'}  # function words
  new_token = [word for word in token if pos(word) not in functors_pos] 
  words.append(new_token)
  embs.append(faq([' '.join(token)])[0])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
 22%|██▏       | 811/3729 [04:09<14:58,  3.25it/s]


KeyboardInterrupt: ignored

# Build text embeddings

From pretrained contextual embeddings for words build embedding representations for sentences.

**Try different approaches:**



 
*   Average all words in sentence. 
*   Average k-nearest to target word vectors. 




Finally decided to use approach with k-nearest neighbours, where k depends on length of the sentence. If length of context is bigger than 15, find 10 neighbours, else if more than 6, use 5 neighbours, else use only 1 neighbour.


In [None]:
from sklearn.neighbors import NearestNeighbors
def k_nearest_neighbours(word, embs, k=5):
  """ for each word search in sentence k-nearest words and returns its indexes"""
  samples = embs
  neigh = NearestNeighbors(n_neighbors=k)
  neigh.fit(samples)
  closest_words_idx = neigh.kneighbors(faq([morph.parse(word)[0].normal_form.lower()])[0])[1]
  return list(itertools.chain(*closest_words_idx.tolist()))


In [None]:
import numpy as np
from more_itertools import locate
import itertools

context_embs = []
for z in tqdm(range(df.shape[0])):
  # use nearest neighbours
  if len(embs[z])>15:
    indexes_of_word = k_nearest_neighbours(df.word[z], embs[z], k=10)
    context_embs.append(list(embs[z][indexes_of_word,:].mean(axis=0))) #average only target words in context
  elif len(embs[z])>6:
    indexes_of_word = k_nearest_neighbours(df.word[z], embs[z], k=5)
    context_embs.append(list(embs[z][indexes_of_word,:].mean(axis=0))) #average only target words in context
  else:
    indexes_of_word = k_nearest_neighbours(df.word[z], embs[z], k=1)
    context_embs.append(list(embs[z][indexes_of_word,:].mean(axis=0)))
  # context_embs.append(list(embs[z].mean(axis=0)))    # average words in full sentence

100%|██████████| 6556/6556 [16:05<00:00,  6.79it/s]


In [None]:
# define indexes for each unique word
unique_words = df.word.unique()
# find indexes of unique words
indexes = [list(df.word).index(unique_words[i]) for i in range(unique_words.shape[0])] + [df.shape[0]]

# Number of clusters

Choose the number of clusters from 2 to 5, using silhouette scores.

In [None]:
from sklearn.metrics import silhouette_score 
def define_num_clusters(X_array):
  algs = [AgglomerativeClustering(n_clusters=i) for i in range(2, 6, 1)]
  k = [2, 3, 4, 5] 
  # Appending the silhouette scores of the different models to the list 
  silhouette_scores = [] 
  for j in range(4):
    silhouette_scores.append(silhouette_score(X_array, algs[j].fit_predict(X_array)))
  return k[np.argmax(silhouette_scores)]

# Clusterization approaches

**Test different clusterization approaches**



1.  Agglomerating clusterization with the number of clusters, defined with silhouette scores,  cosine affinity and use average linkage. (best parameters for wiki dataset). This approach give bad result.

2. Agglomerative clusterization with the number of clusters, defined with silhouette scores and default parameters. It give very good score improvement.

Results you could see in following table: 

Agglomerative clusterization with defined from 1 parameters:

\begin{array}{ccc}
\text{method}&\text{ARI score}\\
avg\_mean& 0.06\\
5-nearest\_neighbours& 0.02
\end{array}

Agglomerative clusterization with default parameters:

\begin{array}{ccc}
\text{method}&\text{ARI score}\\
avg\_mean& 0.176\\
5-nearest\_neighbours& 0.18\\
k-nearest\_neighbours (*)& 0.241\\
obtained\_baseline& 0.225\\
Autors\_baseline& 0.261\\
\end{array}

(*)  Approach with k-nearest neighbours, where k depends on length of the sentence. If length of context is bigger than 15, find 10 neighbours, else if more than 6, use 5 neighbours, else use only 1 neighbour.





In [None]:
from sklearn.cluster import AgglomerativeClustering
clusters = []
for i in tqdm(range(unique_words.shape[0])):
  num_clusters = define_num_clusters(np.array(context_embs)[indexes[i]:(indexes[i+1])])
  clusters.append(list(AgglomerativeClustering(n_clusters=num_clusters).fit(np.array(context_embs)[indexes[i]:(indexes[i+1])]).labels_))

100%|██████████| 51/51 [02:28<00:00,  2.92s/it]


## Final results 

In [None]:
predict_list = list(itertools.chain(*clusters))
df['predict_sense_id'] = predict_list

# Save

In [None]:
save_path = 'russe-wsi-kit/data/main/bts-rnc/elmo.csv'
df.to_csv(save_path, sep='\t', index=None)

In [None]:
cd russe-wsi-kit/data/main/bts-rnc/

/content/russe-wsi-kit/data/main/bts-rnc


In [None]:
!zip bts_rnc.zip elmo.csv

  adding: gg_elmo.csv (deflated 68%)


In [None]:
cd /content/

/content


# Test score

In [None]:
from sklearn.metrics import adjusted_rand_score
score = []
for j in range(unique_words.shape[0]):
  score.append(adjusted_rand_score(clusters[j], np.array(list(df['gold_sense_id'])[indexes[j]:(indexes[j+1])])))
print(score)