<a href="https://colab.research.google.com/github/acheronw/mondoreg/blob/master/similar_words_in_hungarian.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setup

Import Libraries

In [1]:
!pip install --upgrade gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting gensim
  Downloading gensim-4.2.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 4.4 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.2.0


In [2]:
import numpy as np
import gzip
import gensim

print(gensim.__version__)

import logging
import multiprocessing

4.2.0


Some basic setup steps:

In [3]:
# np.random.seed(7)
logging.basicConfig()

logger = logging.getLogger('swih')
logger.setLevel(logging.INFO)
logger.info("test")

no_cores = multiprocessing.cpu_count()
logger.info("Using {0} processors".format (no_cores))



INFO:swih:test
INFO:swih:Using 2 processors


## Hyperparameters

In [4]:
vector_size = 150
training_epochs = 5
window_size = 5
word_freq_cutoff = 1

model_name = "plain_corpus_w2v.model"

## Import Corpus

Using the Hungarian webkorpusz 2.0

https://hlt.bme.hu/hu/resources/webcorpus2

For the first iteration, we will be using the clear text version of the corpus.

Let's do with only a single gzip for now:

In [5]:
!wget https://nessie.ilab.sztaki.hu/~ndavid/Webcorpus2_text/2017_2018_0001.txt.gz

--2022-11-27 07:19:13--  https://nessie.ilab.sztaki.hu/~ndavid/Webcorpus2_text/2017_2018_0001.txt.gz
Resolving nessie.ilab.sztaki.hu (nessie.ilab.sztaki.hu)... 195.111.1.193
Connecting to nessie.ilab.sztaki.hu (nessie.ilab.sztaki.hu)|195.111.1.193|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2769774 (2.6M) [application/x-gzip]
Saving to: ‘2017_2018_0001.txt.gz.1’


2022-11-27 07:19:14 (3.88 MB/s) - ‘2017_2018_0001.txt.gz.1’ saved [2769774/2769774]



Let's peak into its contents:

In [6]:
data_file = "2017_2018_0001.txt.gz"

with gzip.open (data_file, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break


b'\xc3\x81jurv\xc3\xa9da \xe2\x80\x93 A hossz\xc3\xba \xc3\xa9s boldog \xc3\xa9let titka\n'


We have checked the first line. Unicode is displayed in an unreadable way, but it is not supposed to cause any problems.

The tutorial suggests to use gensim's simple preprocessing. It would discard too short and too long tokens and also downcase everything.
For the time being let's not do that. Case sensitivity might help us with the task.

Gensim also has preprocessing.read_file and read_files method. Check those out!

In [7]:
def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    logging.info("reading file {0}...this may take a while".format(input_file))
    with gzip.open (input_file, 'rb') as f:
        for i, line in enumerate (f): 

            if (i%10000==0):
                logger.info ("read {0} lines".format (i))
            yield gensim.parsing.preprocess_string(line)

# Read the corpus into a list
documents = list(read_input (data_file))
logger.info("Done reading data file")

print(documents[3])

INFO:swih:read 0 lines
INFO:swih:read 10000 lines
INFO:swih:read 20000 lines
INFO:swih:read 30000 lines
INFO:swih:read 40000 lines
INFO:swih:read 50000 lines
INFO:swih:read 60000 lines
INFO:swih:read 70000 lines
INFO:swih:read 80000 lines
INFO:swih:Done reading data file


['ájurvéda', 'alapelv', 'ismerd', 'meg', 'önmagad', 'válj', 'önmagad', 'gyógyítójává']


## Training the vectors

Initializing the model:

In [8]:
# documents = [['first', 'sentence'], ['second', 'sentence']]

In [9]:
model = gensim.models.Word2Vec (documents, vector_size= vector_size, window= window_size, min_count=word_freq_cutoff, workers=no_cores)

In [10]:
model.build_vocab(documents)



In [11]:
model.train(documents, total_examples=len(documents), epochs = 5)
model.save(model_name)



In [12]:
model.wv['könyv']

array([ 0.6452424 ,  0.00774061, -0.05378435,  0.16109864, -0.5652805 ,
       -0.20422785, -0.33523738,  0.10901169, -0.1486586 ,  0.07550424,
        0.0486619 , -0.24080387,  0.07243193,  0.19740862,  0.02868237,
       -0.16969346, -0.11952788, -0.00348568,  0.37977925,  0.51710653,
        0.03576323,  0.06393986,  0.03835683,  0.27300382,  0.20126796,
       -0.18871824, -0.46391612, -0.05353734,  0.09570438, -0.08357211,
       -0.22457963, -0.10164903, -0.20384295, -0.09874244, -0.28652927,
       -0.14548737,  0.07028065, -0.12599069, -0.09836089, -0.46259758,
       -0.20326053, -0.17200723, -0.32518145, -0.21403748, -0.14741762,
        0.16457254,  0.1783335 , -0.01016353, -0.24974939,  0.03501152,
       -0.18650782,  0.38612494, -0.29372042,  0.0189934 , -0.1772119 ,
       -0.0660537 ,  0.03066284, -0.16756846, -0.3507783 , -0.33846372,
        0.04048376,  0.06850696, -0.56933594,  0.14382774,  0.17993955,
       -0.27806306, -0.02857106, -0.33935022, -0.26541725, -0.00

In [15]:
model.wv.most_similar("könyv")

[('tőled', 0.9765174388885498),
 ('szeretni', 0.975473940372467),
 ('képzelni', 0.970724880695343),
 ('foglalkozni', 0.9704634547233582),
 ('hova', 0.9704198837280273),
 ('csinálunk', 0.9690806865692139),
 ('abba', 0.9675807952880859),
 ('szerettem', 0.9674118161201477),
 ('gondolj', 0.9659615755081177),
 ('egyet', 0.9657123684883118)]

# References

* Magyar Webkorpusz 2.0: *Nemeskey, Dávid Márk (2020). “Natural Language Processing methods for Language Modeling”. PhD thesis. Eötvös Loránd University.*

* Tutorial for implementing word2vec: https://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/