# Getting Started

First, install the library with extras to train models:
```
pip install -e git+https://github.com/bennokr/minimel.git#egg=minimel[train]
```

In [1]:
wiki = 'iawiki-latest' # use Interlingua language Wikipedia version to test
root = 'wiki/' + wiki
!mkdir -p $root
!wikimapper download $wiki --dir $root
outdb = f'{root}/index_{wiki}.db'
!wikimapper create $wiki --dumpdir $root --target $outdb

2024-05-22 23:31:53,738 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-page.sql.gz] to [wiki/iawiki-latest/iawiki-latest-page.sql.gz]
2024-05-22 23:32:02,031 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-page_props.sql.gz] to [wiki/iawiki-latest/iawiki-latest-page_props.sql.gz]
2024-05-22 23:32:04,885 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-redirect.sql.gz] to [wiki/iawiki-latest/iawiki-latest-redirect.sql.gz]
2024-05-22 23:32:06,819 - wikimapper.processor - INFO - Creating index for [iawiki-latest] in [wiki/iawiki-latest/index_iawiki-latest.db]
2024-05-22 23:32:06,822 - wikimapper.processor - INFO - Parsing pages dump
2024-05-22 23:32:07,209 - wikimapper.processor - INFO - Creating database index on 'wikipedia_title'
2024-05-22 23:32:07,237 - wikimapper.processor - INFO - Parsing page properties dump
2024-05-22 23:32:07,

In [2]:
!minimel -v index $outdb

Loading mapping...: 100%|█████████████| 34570/34570 [00:00<00:00, 329200.66it/s]
INFO:root:Building IntDAWG trie...
INFO:root:Saving to wiki/iawiki-latest/index_iawiki-latest.dawg...


In [3]:
wikiname = wiki.split('-')[0]
!wget -P $root https://dumps.wikimedia.org/$wikiname/latest/$wiki-pages-articles.xml.bz2
!bunzip2 $root/$wiki-pages-articles.xml.bz2

--2024-05-22 23:35:13--  https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-pages-articles.xml.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.71
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.71|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10654986 (10M) [application/octet-stream]
Saving to: ‘wiki/iawiki-latest/iawiki-latest-pages-articles.xml.bz2’


2024-05-22 23:39:26 (34,7 KB/s) - Connection closed at byte 8781489. Retrying.

--2024-05-22 23:39:27--  (try: 2)  https://dumps.wikimedia.org/iawiki/latest/iawiki-latest-pages-articles.xml.bz2
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.71|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 10654986 (10M), 1873497 (1,8M) remaining [application/octet-stream]
Saving to: ‘wiki/iawiki-latest/iawiki-latest-pages-articles.xml.bz2’

iawiki-latest-pages 100%[++++++++++++++++===>]  10,16M   404KB/s    in 4,5s    



In [4]:
dump = f'{root}/{wiki}-pages-articles.xml'
dawg = f'{root}/index_{wiki}.dawg'
!minimel -v get-paragraphs -n 100 $dump $dawg 

INFO:root:Finished in 30s################] | 100% Completed | 30.5s[2K
INFO:root:Wrote 100 partitions


In [7]:
lang = wiki.split('wiki')[0]
disambigpages = f'{root}/ents-disambig.txt'
!minimel -v query-pages $lang -o $disambigpages

INFO:root:Writing to wiki/iawiki-latest/ents-disambig.txt


In [8]:
!minimel -v get-disambig -n 100 $dump $dawg $disambigpages

INFO:root:Extracting disambiguation links...
INFO:root:Finished in 2s#################] | 100% Completed |  2.5s[2K
INFO:root:Writing to wiki/iawiki-latest/disambig.json


In [9]:
paragraphlinks = f'{root}/{wiki}-paragraph-links/'
!minimel -v count $paragraphlinks

INFO:root:Counting links...
INFO:root:Finished in 6s#################] | 100% Completed |  6.8s[2K
INFO:root:Got 32602 counts.
INFO:root:Aggregating...
INFO:root:Finished in 10s################] | 100% Completed | 10.5s[2K
INFO:root:Writing to wiki/iawiki-latest/count.min2.json


In [10]:
# Get Wikidata IDs for disambiguation and list articles
badent = f'{root}/badent.txt'
!minimel query-pages $lang -q -o $badent

In [11]:
disambigfile = f'{root}/disambig.json'
countfile = f'{root}/count.min2.json'
!minimel -v clean -b $badent $outdb $disambigfile $countfile

Counting entities...: 100%|███████████| 11560/11560 [00:00<00:00, 178917.02it/s]
INFO:root:Removing 133 bad entities
Loading labels...: 100%|███████████████| 34570/34570 [00:00<00:00, 97792.93it/s]
Filtering names...: 100%|██████████████| 11498/11498 [00:00<00:00, 17444.66it/s]
INFO:root:Filtering out 1 bad names
INFO:root:Keeping 11497 good names
INFO:root:Writing to wiki/iawiki-latest/clean.json


In [12]:
cleanfile = f'{root}/clean.json'
!minimel -v vectorize $paragraphlinks $cleanfile

INFO:root:Vectorizing training examples for 286 ambiguous names
INFO:root:Writing to wiki/iawiki-latest/vec.clean.dat.parts
INFO:root:Finished in 3s#################] | 100% Completed |  3.4s[2K
INFO:root:Wrote 34 partitions
INFO:root:Concatenating to wiki/iawiki-latest/vec.clean.dat
Concatenating: 100%|██████████████████████████| 34/34 [00:00<00:00, 3840.94it/s]


In [13]:
vecfile = f'{root}/vec.clean.dat'
!minimel -v train $vecfile

INFO:root:Writing to wiki/iawiki-latest/model.20b.vw
creating quadratic features for pairs: ls sf
final_regressor = wiki/iawiki-latest/model.20b.vw
creating cache_file = wiki/iawiki-latest/vec.clean.dat.cache
Reading datafile = wiki/iawiki-latest/vec.clean.dat
num sources = 1
Num weight bits = 20
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
Enabled learners: gd, scorer-identity, csoaa_ldf-prob, shared_feature_merger
Input label = CS
Output pred = SCALARS
average  since         example        example        current        current  current
loss     last          counter         weight          label        predict features
0.000000 0.000000            1            1.0        unknown              0     1414
0.000000 0.000000            2            2.0        unknown              0       24
0.000000 0.000000            4            4.0        unknown              0      348
0.000000 0.000000            8            8.0        unknown              0       12
0.12

In [16]:
modelfile = f'{root}/model.20b.vw'
!minimel -v run --evaluate -o /dev/null $dawg $cleanfile $modelfile $paragraphlinks/*

Predicting: 100%|███████████████████████| 59765/59765 [00:17<00:00, 3471.64it/s]
INFO:root:,,0
micro,precision,0.909326061550448
micro,recall,0.909326061550448
micro,fscore,0.909326061550448
macro,precision,0.9236526246023489
macro,recall,0.9062367026135526
macro,fscore,0.9121998587060755
,support,192525.0

