<a href="https://colab.research.google.com/github/ValeriyaKuznetsova/collocation_extraction/blob/main/German_English_collocation_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install multivec

https://github.com/alex-berard/multivec

In [None]:
!git clone https://github.com/eske/multivec.git
!mkdir multivec/build
%cd multivec/build
!cmake ..
!make
%cd ..
!mkdir models
!mkdir data

Cloning into 'multivec'...
remote: Enumerating objects: 1167, done.[K
remote: Total 1167 (delta 0), reused 0 (delta 0), pack-reused 1167[K
Receiving objects: 100% (1167/1167), 726.90 KiB | 2.86 MiB/s, done.
Resolving deltas: 100% (777/777), done.
/content/multivec/build
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- 

## Get the corpus

Corpus: http://www.statmt.org/wmt14/quality-estimation-task.html

This corpus consists of parallel sentences in different languages. We need German and English.

First, let's download the corpus.

In [None]:
!wget http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz -P data
!tar xzf data/training-parallel-nc-v9.tgz -C data

--2021-05-26 06:38:02--  http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz
Resolving www.statmt.org (www.statmt.org)... 129.215.197.184
Connecting to www.statmt.org (www.statmt.org)|129.215.197.184|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80418416 (77M) [application/x-gzip]
Saving to: ‘data/training-parallel-nc-v9.tgz’


2021-05-26 06:38:51 (1.57 MB/s) - ‘data/training-parallel-nc-v9.tgz’ saved [80418416/80418416]



In [None]:
with open('data/training/news-commentary-v9.de-en.de', 'r') as f:
    german_corpus = f.read()

In [None]:
with open('data/training/news-commentary-v9.de-en.en', 'r') as f:
    english_corpus = f.read()

As we can see, the sentences are separated by `\n`.

In [None]:
german_corpus[:100]

'Steigt Gold auf 10.000 Dollar?\nSAN FRANCISCO – Es war noch nie leicht, ein rationales Gespräch über '

In [None]:
english_corpus[:100]

'$10,000 Gold?\nSAN FRANCISCO – It has never been easy to have a rational conversation about the value'

# Preprocess the corpus

In this part, sentences are tokenized and lemmatized; punctuation marks, digits are removed.

For this we'll use the Python library `spacy`.

All tags (POS and DEP): https://github.com/explosion/spaCy/blob/master/spacy/glossary.py

Dependency tags explained: https://universaldependencies.org/u/dep/

## Install libraries

First, we need to install and import all necessary libraries.

In [None]:
!pip install spacy==3.0.0

In [None]:
import spacy

As we work with English and German, we need to download necessary spacy models for these languages.

In [None]:
!python3 -m spacy download en_core_web_sm

In [None]:
!python3 -m spacy download de_core_news_sm

## Visualize dependency trees

Let's see how `spacy` works. We'll draw English and German dependency trees with lemmas and tags.

In [None]:
from nltk import Tree

def tok_format(tok):
    return "_".join([tok.lemma_, tok.pos_])


def to_nltk_tree(node):
    if node.n_lefts + node.n_rights > 0:
        return Tree(tok_format(node), [to_nltk_tree(child) for child in node.children])
    else:
        return tok_format(node)

In [None]:
en_nlp = spacy.load("en_core_web_sm")
doc =  en_nlp("So why is it so hard to pay attention")
[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]

                be_AUX                                        
   _______________|______________                              
  |       |       |           hard_ADJ                        
  |       |       |       _______|________                     
  |       |       |      |             pay_VERB               
  |       |       |      |        ________|___________         
so_ADV why_ADV it_PRON so_ADV to_PART           attention_NOUN



[None]

In [None]:
de_nlp = spacy.load("de_core_news_sm")
doc = de_nlp("Warum ist es also so schwer aufmerksam zu sein")
[to_nltk_tree(sent.root).pretty_print() for sent in doc.sents]

                   sein_AUX                                
     _________________|________________                     
    |        |        |             mein_AUX               
    |        |        |         _______|___________         
    |        |        |        |             aufmerksam_ADV
    |        |        |        |                   |        
    |        |        |        |               schwer_ADV  
    |        |        |        |                   |        
warum_ADV ich_PRON also_ADV zu_PART              so_ADV    



[None]

The explanation of tags can be found at the beginning of this section.

In [None]:
spacy.explain('PART')

'particle'

## Tokenize and lemmatize the corpus

Now we can tokenize the German and English sentences.

In [None]:
from tqdm import tqdm_notebook

In [None]:
def tokenize_corpus(corpus: str, language : str = 'english') -> tuple:
    """
    Accepted languages are 'english' and 'german'.
    It returns a tuple with tokens and a tuple with tokenized sentences.
    """
    
    if language == 'german':
        nlp = spacy.load("de_core_news_sm")
    elif language == 'english':
        nlp = spacy.load("en_core_web_sm")
    
    tokens = []
    sentences = []
    for sentence in tqdm_notebook(corpus.split('\n')):
        doc = nlp(sentence)
        tokenized_sentence = [token.lemma_ for token in doc if token.pos_ != 'PUNCT' and not token.text.isdigit()]
        if tokenized_sentence:
            tokens.extend(tokenized_sentence)
            sentences.append(tuple(tokenized_sentence))

    return tuple(tokens), tuple(sentences)

In [None]:
german_tokens, german_sentences = tokenize_corpus(german_corpus, 'german')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=0.0, max=201855.0), HTML(value='')))




In [None]:
print(german_tokens[:10])

('Steigt', 'Gold', 'auf', '10.000', 'Dollar', 'SAN', 'FRANCISCO', 'ich', 'sein', 'noch')


In [None]:
print(german_sentences[:2])

(('Steigt', 'Gold', 'auf', '10.000', 'Dollar'), ('SAN', 'FRANCISCO', 'ich', 'sein', 'noch', 'nie', 'leicht', 'einen', 'rational', 'Gespräch', 'über', 'der', 'Wert', 'von', 'Gold', 'zu', 'fahren'))


In [None]:
english_tokens, english_sentences = tokenize_corpus(english_corpus, 'english')

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=0.0, max=201996.0), HTML(value='')))




The problem is that there is a different number of sentences. We suppose that there are some extra sentences in English at the end, and all other sentences are parallel.

In [None]:
print('NUMBER OF WORDS (GERMAN):', len(german_tokens))
print('NUMBER OF SENTENCES (GERMAN):', len(german_sentences))

NUMBER OF WORDS (GERMAN): 4456395
NUMBER OF SENTENCES (GERMAN): 201570


In [None]:
print('NUMBER OF WORDS (ENGLISH):', len(english_tokens))
print('NUMBER OF SENTENCES (ENGLISH):', len(english_sentences))

NUMBER OF WORDS (ENGLISH): 4462014
NUMBER OF SENTENCES (ENGLISH): 201584


In [None]:
cd ..

/content


We need to save our data that we do not need to calculate it each time.

In [None]:
with open("data/german_corpus", "w") as fp:
    fp.write('\n'.join([' '.join(sentence) for sentence in german_sentences])) 

In [None]:
with open("data/english_corpus", "w") as fp:
    fp.write('\n'.join([' '.join(sentence) for sentence in english_sentences])) 

In [None]:
import json

In [None]:
with open("german_tokens.json", "w") as fp:
    json.dump(german_tokens, fp) 

In [None]:
with open("german_sentences.json", "w") as fp:
    json.dump(german_sentences, fp) 

In [None]:
with open("english_tokens.json", "w") as fp:
    json.dump(english_tokens, fp) 

In [None]:
with open("english_sentences.json", "w") as fp:
    json.dump(english_sentences, fp) 

This way we can download it later.

In [None]:
with open('german_tokens.json') as json_file:
    german_tokens = json.load(json_file)

In [None]:
with open('german_sentences.json') as json_file:
    german_sentences = json.load(json_file)

In [None]:
with open('english_tokens.json') as json_file:
    english_tokens = json.load(json_file)

In [None]:
with open('english_sentences.json') as json_file:
    english_sentences = json.load(json_file)

# Apply the statistical method

In this section, association measures are applied to extract possible collocations.

## Get collocation candidates with association measures

We will use nltk.collocations to get collocation with association measures.

Overview: https://www.nltk.org/howto/collocations.html

https://www.nltk.org/_modules/nltk/collocations.html

https://www.nltk.org/_modules/nltk/metrics/association.html

In [None]:
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
bigram_measures = BigramAssocMeasures()

We can find all bigram collocations in the English and German corpora

In [None]:
english_finder = BigramCollocationFinder.from_words(english_tokens)

In [None]:
german_finder = BigramCollocationFinder.from_words(german_tokens)

Let's see how many possible bigram collocations are there.

In [None]:
english_bigrams = english_finder.nbest(BigramAssocMeasures.mi_like, 10000000)
german_bigrams = german_finder.nbest(BigramAssocMeasures.mi_like, 10000000)
len(english_bigrams), len(german_bigrams)

(1004833, 1326789)

Let's look at top 10 English and German bigram collocations.

In [None]:
english_bigrams[:10]

[('United', 'States'),
 ('Middle', 'East'),
 ('of', 'the'),
 ('Prime', 'Minister'),
 ('in', 'the'),
 ('Saudi', 'Arabia'),
 ('Federal', 'Reserve'),
 ('Hong', 'Kong'),
 ('$', 'billion'),
 ('do', 'not')]

In [None]:
german_bigrams[:10]

[('in', 'der'),
 ('Nahe', 'Osten'),
 ('&', '#'),
 ('Vereinigte', 'Staat'),
 ('NEW', 'YORK'),
 ('Darüber', 'hinaus'),
 ('Vereinte', 'Nation'),
 ('George', 'W.'),
 ('New', 'York'),
 ('und', 'der')]

## Check frequency of a candidate

We can check frequency of any collocation. 

Later we will use it to test whether some candidates occur several times within the corpus or not.

In [None]:
def check_frequency(candidate: tuple, threshold : int = 10, language : str = 'english'):
    if language == 'english':
        frequency = english_finder.ngram_fd[candidate] + english_finder.ngram_fd[candidate[::-1]]
    elif language == 'german':
        frequency = german_finder.ngram_fd[candidate] + german_finder.ngram_fd[candidate[::-1]]
    # print('Occured', frequency, 'times')
    return frequency >= threshold

In [None]:
check_frequency(('interest', 'rate'), 3, 'english')

Occured 1767 times


True

In [None]:
check_frequency(('Ansicht', 'nach'), 3, 'german')

Occured 112 times


True

# Apply the syntax-based method

In this section `spacy` is used to extract candidates of particular patterns, for example, verb + noun.

For the syntactical analysis non-lemmatized corpus is needed; otherwise, a dependency tree would be incorrect.

## Get collocations using the dependency tree

The first step is to get English and German collocation candidates.

In [None]:
def extract_dependency_tree(sentence: str, language : str = 'english'):
    """
    Accepted languages are 'english' and 'german'.
    """
    if language == 'english':
        syntax_parser = spacy.load("en_core_web_sm")
    elif language == 'german':
        syntax_parser = spacy.load("de_core_news_sm")
    return syntax_parser(sentence)

In [None]:
def find_candidates(sentence: str, base_pos: str, collocate_pos: str,
                    language : str = 'english'):
    """
    Accepted languages are 'english' and 'german'.
    Returns lemmatized and tokenized candidates.
    The first word is the base, the second – the collocate.
    """
    tree = extract_dependency_tree(sentence, language)
    candidates = []
    for token in tree:
        if token.pos_ == base_pos:
            relevant_children = [child for child in token.children if child.pos_ == collocate_pos]
            if relevant_children:
                candidates.extend([(token.lemma_, child.lemma_) for child in relevant_children])
    return candidates

Let's look at some examples.

In [None]:
find_candidates('because it is really hard work', 'NOUN', 'ADJ', 'english')

[('work', 'hard')]

In [None]:
find_candidates('weil es wirklich hartes Arbeit ist', 'NOUN', 'ADJ', 'german')

[('Arbeit', 'hart')]

# Train bilingual embeddings

Bilingual embeddings are trained using multivec.

For training a non-lemmatized corpus is used (found empirically).

In [None]:
!scripts/prepare-data.py data/training/news-commentary-v9.de-en data/news-commentary de en --tokenize --normalize-punk

In [None]:
!bin/multivec-bi --train-src data/news-commentary.de --train-trg data/news-commentary.en --save models/news-commentary.de-en.bin --threads 16

MultiVec-bi
dimension:   100
window size: 5
min count:   5
alpha:       0.05
iterations:  5
threads:     16
subsampling: 0.001
skip-gram:   false
HS:          false
negative:    5
sent vector: false
beta:        1
Training files: data/news-commentary.de, data/news-commentary.en
tcmalloc: large alloc 1073741824 bytes == 0x55c310bc0000 @  0x7f457790f887 0x55c2cd794768 0x55c2cd7a2454 0x55c2cd78f835 0x55c2cd7897cf 0x7f45769acbf7 0x55c2cd789ada
Training time: 180.485


In [None]:
# cython/makefile python2 ->  python3

In [None]:
%cd cython

/content/multivec/cython


In [None]:
!make

python3 setup.py build
Compiling multivec.pyx because it changed.
[1/1] Cythonizing multivec.pyx
  tree = Parsing.p_module(s, pxd, full_module_name)
running build
running build_ext
building 'multivec' extension
creating build
creating build/temp.linux-x86_64-3.7
creating build/multivec
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fdebug-prefix-map=/build/python3.7-OGiuun/python3.7-3.7.10=. -fstack-protector-strong -Wformat -Werror=format-security -g -fdebug-prefix-map=/build/python3.7-OGiuun/python3.7-3.7.10=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -UNDEBUG -I/usr/local/lib/python3.7/dist-packages/numpy/core/include -I/usr/include/python3.7m -c multivec.cpp -o build/temp.linux-x86_64-3.7/multivec.o --std=c++11 -w -I../multivec -O3
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fdebug-prefix-map=/build/python3.7-OGiuun/python3.7

In [None]:
from multivec import MonolingualModel, BilingualModel

In [None]:
news_model = BilingualModel(b'../models/news-commentary.de-en.bin')

In [None]:
news_model

<multivec.BilingualModel at 0x7fb7830794b0>

The results seem promising.

With the lemmatized corpus results were worse, that is why the tokenized version is used for training.

In [None]:
news_model.trg_closest(b'Beispiel')

[(b'example', 0.6098706126213074),
 (b'instance', 0.5744057297706604),
 (b'Take', 0.43994414806365967),
 (b'Dalit', 0.4334424138069153),
 (b'instructive', 0.3968197703361511),
 (b'Iceland', 0.3956241011619568),
 (b'Consider', 0.38580814003944397),
 (b'Straw', 0.38547778129577637),
 (b'Witness', 0.3854774534702301),
 (b'Ghanaian', 0.37877997756004333)]

In [None]:
news_model.trg_closest(b'gut')

[(b'good', 0.6450780034065247),
 (b'reasonably', 0.5584521293640137),
 (b'advised', 0.5345669984817505),
 (b'poorly', 0.5315302014350891),
 (b'performing', 0.5220630764961243),
 (b'well-functioning', 0.5138837099075317),
 (b'equipped', 0.5013648271560669),
 (b'sufficiently', 0.4940258264541626),
 (b'suited', 0.49237918853759766),
 (b'well-informed', 0.4908822178840637)]

# Combine all the methods

After extracting some candidates with the syntax-based tool, we need to check their frequency.

In [None]:
def find_freq_candidates(sentence: str, base_pos: str,
                         collocate_pos: str, language : str = 'english'):
    candidates = find_candidates(sentence, base_pos, collocate_pos, language)
    checked_candidates = []
    for candidate in candidates:
        if check_frequency(candidate, 10, language):
            checked_candidates.append(candidate)
    
    return checked_candidates

Let's look at some examples.

In [None]:
find_freq_candidates('because it is really hard work', 'NOUN', 'ADJ', 'english')

[('work', 'hard')]

Then, we need to align candidates (in other words, find equivalent English and German candidates).

In [None]:
def align_candidates(english_candidates: list, german_candidates: list):
    aligned_candidates = []
    for candidate in english_candidates:
        target_translations = [word[0].decode('UTF-8') for word in news_model.src_closest(bytes(candidate[0], 'utf-8'))][:5]
        for de_candidate in german_candidates:
            if de_candidate[0] in target_translations:
                collocate_translations = [word[0].decode('UTF-8') for word in news_model.trg_closest(bytes(de_candidate[1], 'utf-8'))][:5]
                if candidate[1] in collocate_translations:
                    aligned_candidates.append((candidate, de_candidate))
    return aligned_candidates

In [None]:
english_candidates = find_freq_candidates('because it is really hard work', 'NOUN', 'ADJ', 'english')
german_candidates = find_freq_candidates('weil es wirklich hartes Arbeit ist', 'NOUN', 'ADJ', 'german')
align_candidates(english_candidates, german_candidates)

Here is the final pipeline that combines all previous functions.

In [None]:
def find_bilingual_collocations(english_sentence: str, german_sentence: str,
                                base_pos: str, collocate_pos: str):
    english_candidates = find_freq_candidates(english_sentence, base_pos, collocate_pos, 'english')
    german_candidates = find_freq_candidates(german_sentence, base_pos, collocate_pos, 'german')
    return align_candidates(english_candidates, german_candidates)

In [None]:
english_sentence = "i think it is the best time of the year and it is really hard work"
german_sentence = 'ich denke, es ist die beste Zeit des Jahres und es ist wirklich hartes Arbeit'

In [None]:
find_bilingual_collocations(english_sentence, german_sentence, 'NOUN', 'ADJ')

[(('time', 'good'), ('Zeit', 'gut')), (('work', 'hard'), ('Arbeit', 'hart'))]

In [None]:
find_bilingual_collocations('I read books', 'Ich lese Bücher', 'VERB', 'NOUN')

[(('read', 'book'), ('lesen', 'Buch'))]

# Extract collocations for evaluation

We will evaluate verb + noun and noun + adjective collocations.

It takes 50 seconds to analyse 10 sentences.

In [None]:
from tqdm import tqdm_notebook

feedback = []

en_sents = english_corpus.split('\n')[7000:7100]
de_sents = german_corpus.split('\n')[7000:7100]

k = 0
for index, sentence in tqdm_notebook(enumerate(en_sents)):
    
    collocations = []
    try:
        collocations = find_bilingual_collocations(sentence, de_sents[index],
                                                   'VERB', 'NOUN')
        # different_collocations = find_bilingual_collocations(sentence, de_sents[index],
        #                                                      'VERB', 'NOUN')
    except:
        continue
    else:
        if collocations:
          k += 1
          print(k)
          print(collocations)
          feedback.append(sentence)
          feedback.append(de_sents[index])
          for pair in collocations:
              feedback.append(" ".join([' '.join(part) for part in pair]))
          feedback.append('\n')
        # if different_collocations:
        #   k += 1
        #   print(k)
        #   print(different_collocations)
        #   feedback.append(sentence)
        #   feedback.append(de_sents[index])
        #   for pair in different_collocations:
        #       feedback.append(" ".join([' '.join(part) for part in pair]))
        #   feedback.append('\n')

with open('feedback_7100', 'w') as f:
    f.write('\n'.join(feedback))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if __name__ == '__main__':


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

1
[(('show', 'poll'), ('zeigen', 'Meinungsumfragen'))]
2
[(('have', 'power'), ('haben', 'Macht'))]
3
[(('play', 'role'), ('spielen', 'Rolle'))]
4
[(('start', 'crisis'), ('beginnen', 'Krise'))]
5
[(('implement', 'reform'), ('umsetzen', 'Reform'))]

