# nlg: a Python package for analogy
This notebook contains examples on usage of `nlg` package.  
`nlg` package is a Python3 package which contains modules and functions related to analogy.  
The main usage is to extract analogies from a given text.

___
## Installation
Please follow the instruction on __README__ file
- You need to first install the Python package: `fast_distance`.
- Depending on your environment, you may need to install __Cython__.

After successfully installing the package, then we import the package.

In [53]:
import nlg

___
## Analogy

### Analogy class
`nlg.Analogy` contains `Analogy` class which represents analogy between 4 objects *A*, *B*, *C* and *D*.  
The analogy is noted as follows.  
> *A* : *B* :: *C* : *D*  

It is read as follows.  

> *A* is to *B* as *C* is to *D*


In [54]:
from nlg.Analogy import Analogy
term_A = "makan"
term_B = "dimakan"
term_C = "minum"
term_D = "diminum"
analogy = Analogy.fromTerms(term_A, term_B, term_C, term_D)
print(analogy)

makan : dimakan :: minum : diminum


### Solving analogies
We can use the `solve_analogy` function in `nlg.Analogy` to solve analogical equation:  
> given 3 terms *A*, *B* and *C*; coin the fourth term *D*  

This function wrap the `solvenlg` function provided by the C module.

In [55]:
from nlg.Analogy import solve_analogy
term_A = "makan"
term_B = "dimakan"
term_C = "minum"
term_D_candidates = solve_analogy(term_A, term_B, term_C)
print(term_D_candidates)

['diminum']


___
## Representing strings as vectors
To get the vector representation of a given string, you may use Python script available within the package: __*Lines2Vectors.py*__  
Let us produce vector representation for a set of words contained in a file.

In [56]:
!cat toy_data/id.test.words

air
anto
beli
bola
cilok
dan
di
dia
dibeli
dimakan
diminum
enak
es
itu
juga
main
mainan
makan
makanan
melihat
memakan
memang
meminum
minum
minuman
nasi
olahraga
pasar
selesai
senang
setelah
suka

In [57]:
! python scripts/Strings2Vectors.py <toy_data/id.test.words -V

# Reading file...
# Number of lines read: 32	
# Building vector with feature...
#	- char : True
#	- token: False
#	- morph: False
#	- lemma: False
air	1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
anto	1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0
beli	0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0
bola	1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0
cilok	0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0
dan	1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
di	0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
dia	1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
dibeli	0 1 0 1 1 0 0 2 0 0 1 0 0 0 0 0 0 0 0
dimakan	2 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0
diminum	0 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0 0 1
enak	1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
es	0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
itu	0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1
juga	1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1
main	1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0
mainan	2 0 0 0 0 0 0 1 0 0 0 1 2 0 0 0 0 0 0
makan	2 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0
makanan	3 0 0 0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 0
melihat	1 0 0 0 1 0 1 1 0 0 1

The format of the output is `term<TAB>vector` where the each value inside the vector is separated by a `<SPACE>`.  
Now, we will do it by using the `Vectors` class in the `nlg.Vector` module.

In [58]:
from nlg.Vector import Vectors

In [59]:
filename = "toy_data/id.test.words"
set_of_words = [line.strip() for line in open(filename)]

print(f"# Number of lines: {len(set_of_words)}")
for i, elem in enumerate(set_of_words):
    print(f"\t{i+1:2}. {elem}")

# Number of lines: 32
	 1. air
	 2. anto
	 3. beli
	 4. bola
	 5. cilok
	 6. dan
	 7. di
	 8. dia
	 9. dibeli
	10. dimakan
	11. diminum
	12. enak
	13. es
	14. itu
	15. juga
	16. main
	17. mainan
	18. makan
	19. makanan
	20. melihat
	21. memakan
	22. memang
	23. meminum
	24. minum
	25. minuman
	26. nasi
	27. olahraga
	28. pasar
	29. selesai
	30. senang
	31. setelah
	32. suka


### Feature vector: characters
In this example, we used the number of occurrences of characters in the alphabet as feature for the vectors.

In [60]:
vectors = Vectors.fromFile(lines=set_of_words, char_feature=True)
print(vectors)

air	1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
anto	1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0
beli	0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0
bola	1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0
cilok	0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0
dan	1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
di	0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
dia	1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
dibeli	0 1 0 1 1 0 0 2 0 0 1 0 0 0 0 0 0 0 0
dimakan	2 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0
diminum	0 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0 0 1
enak	1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
es	0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
itu	0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1
juga	1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1
main	1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0
mainan	2 0 0 0 0 0 0 1 0 0 0 1 2 0 0 0 0 0 0
makan	2 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0
makanan	3 0 0 0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 0
melihat	1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 0
memakan	2 0 0 0 1 0 0 0 0 1 0 2 1 0 0 0 0 0 0
memang	1 0 0 0 1 1 0 0 0 0 0 2 1 0 0 0 0 0 0
meminum	0 0 0 0 1 0 0 1 0 0 0 3 1 0 0 0

Or, we can use the function `words2vectors` provided in the `nlg.pipeline` module

In [61]:
from nlg import pipeline
vectors = pipeline.words2vectors(open(filename), verbose=True)
print(vectors)


# Building vector with feature...
#	- char : True
#	- token: False
#	- morph: False
#	- lemma: False


air	1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
anto	1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0
beli	0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0
bola	1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0
cilok	0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0
dan	1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
di	0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
dia	1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
dibeli	0 1 0 1 1 0 0 2 0 0 1 0 0 0 0 0 0 0 0
dimakan	2 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0
diminum	0 0 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0 0 1
enak	1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
es	0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
itu	0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1
juga	1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1
main	1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0
mainan	2 0 0 0 0 0 0 1 0 0 0 1 2 0 0 0 0 0 0
makan	2 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0
makanan	3 0 0 0 0 0 0 0 0 1 0 1 2 0 0 0 0 0 0
melihat	1 0 0 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 0
memakan	2 0 0 0 1 0 0 0 0 1 0 2 1 0 0 0 0 0 0
memang	1 0 0 0 1 1 0 0 0 0 0 2 1 0 0 0 0 0 0
meminum	0 0 0 0 1 0 0 1 0 0 0 3 1 0 0 0

### Feature vector: morphosyntactic description
We can also use morphosyntactic description of the word as features. For example:

> the Indonesian word *makanan* has a part of speech tag of __noun__.

To demonstrate how to do it, let us use the SIGMORPHON data for English (unfortunately, they do not have Indonesian).  
It is formatted as follows.

> __LEMMA__ *tabulation* __INFLECTED_FORM__ *tabulation* __MORPHOSYNTACTIC_DESCRIPTION__

In [62]:
sigmorphon_filename = 'toy_data/english-sigmorphon'
sigmorphon_lines = [line.strip() for line in open(sigmorphon_filename)]

print(f"# Number of lines: {len(sigmorphon_lines)}")
for i, elem in enumerate(sigmorphon_lines):
    print(f"\t{i+1:2}. {elem}")

# Number of lines: 100
	 1. dreep	dreep	V;NFIN
	 2. charcoal	charcoal	V;NFIN
	 3. stodge	stodges	V;3;SG;PRS
	 4. biotransform	biotransform	V;NFIN
	 5. disallow	disallowing	V;V.PTCP;PRS
	 6. precut	precut	V;V.PTCP;PST
	 7. outmanœuvre	outmanœuvred	V;PST
	 8. unsnib	unsnibbing	V;V.PTCP;PRS
	 9. Afghanize	Afghanized	V;PST
	10. redescribe	redescribes	V;3;SG;PRS
	11. overspeculate	overspeculates	V;3;SG;PRS
	12. reënter	reënters	V;3;SG;PRS
	13. waller	wallering	V;V.PTCP;PRS
	14. carboxylate	carboxylating	V;V.PTCP;PRS
	15. imprison	imprisoned	V;PST
	16. helicopt	helicopted	V;PST
	17. tut	tutted	V;V.PTCP;PST
	18. misdoom	misdooms	V;3;SG;PRS
	19. mush	mush	V;NFIN
	20. billhook	billhook	V;NFIN
	21. ingrave	ingraved	V;PST
	22. estheticize	estheticize	V;NFIN
	23. off-split	off-split	V;PST
	24. excecate	excecating	V;V.PTCP;PRS
	25. hegemonise	hegemonised	V;V.PTCP;PST
	26. overregularize	overregularized	V;PST
	27. innoculate	innoculates	V;3;SG;PRS
	28. mopy	mopying	V;V.PTCP;PRS
	29. unhyphenate	unhy

We can now represent them as vectors using the `fromSigmorphonFile` constructor function of the `Vectors` class.  
Notice that we can control what kind of feature do we want to be embedded into the vectors.

In [63]:
char_feature = True
morph_feature = True
lemma_feature = True
lemma_dim = True

sigmorphon_vectors = Vectors.fromSigmorphonFile(lines=sigmorphon_lines,
							char_feature=char_feature,
							morph_feature=morph_feature,
							lemma_feature=lemma_feature,
							lemma_dim=lemma_dim)
print(sigmorphon_vectors)

motorcycled	0 0 0 0 2 1 1 0 0 0 0 0 0 1 1 0 2 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
gang-rapes	1 0 2 0 0 0 1 0 2 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
yabby	0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
quicksaving	0 0 1 0 1 0 0 0 1 0 2 0 1 0 0 1 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

### Feature vector: affixes
Instead of characters, we may also use affixes as features in the vector. For example:  

> the Indonesian word *makanan* is a derivation from
> - stem *makan*, which is a verb, with
> - suffix *-an*, which transform a verb into a noun

To demonstrate this, we will use the MALINDO-Morph dataset for Indonesian affixes.  
The data is already preprocessed to follow the SIGMORPHON format so we can use the previous function to create the vector representation.

In [64]:
malindo_filename = 'toy_data/malindo-500.txt'
malindo_lines = [line.strip() for line in open(malindo_filename)]

print(f"# Number of lines: {len(malindo_lines)}")
for i, elem in enumerate(malindo_lines):
    print(f"\t{i+1:2}. {elem}")

# Number of lines: 500
	 1. abad	abadnya	-nya
	 2. ada	mengada-adakan	meN-;-kan;R-penuh
	 3. adeni	adeni
	 4. adik	adik-adik	R-penuh
	 5. afgan	afgan
	 6. akhlak	akhlaknya	-nya
	 7. aki	akinya	-nya
	 8. alang	teralang	ter-
	 9. alat	memperalat	meN-;per-
	10. alih	dialihkah	di-;-kah
	11. amin	mengaminkannya	meN-;-kan;-nya
	12. ampuh	mengampuhkan	meN-;-kan
	13. ampun	terampuninya	ter-;-i;-nya
	14. ancam	ancam-mengancam	meN-;R-penuh
	15. andil	berandil	ber-
	16. anggap	dianggapnya	di-;-nya
	17. anggar	teranggar-anggar	ter-;R-penuh
	18. angguk	dianggukkannya	di-;-kan;-nya
	19. angkat	diangkatkan	di-;-kan
	20. antar	pengantar	peN-
	21. antarklub	antarklub
	22. anti-rezim	anti-rezim
	23. antigen	antigen
	24. apit	mengapit	meN-
	25. apresiasi	terapresiasinya	ter-;-nya
	26. arak	diaraknya	di-;-nya
	27. archi	archi
	28. asah	diasah	di-
	29. atrium	atrium
	30. australia-indonesia	australia-indonesia
	31. babak	membabak	meN-
	32. baca	pembacaan	peN--an
	33. badari	badari
	34. baik	pembaik	peN-
	3

In [65]:
char_feature = True
morph_feature = True
lemma_feature = False
lemma_dim = False

malindo_vectors = Vectors.fromSigmorphonFile(lines=malindo_lines,
							char_feature=char_feature,
							morph_feature=morph_feature,
							lemma_feature=lemma_feature,
							lemma_dim=lemma_dim)
print(malindo_vectors)

## Unparsed lines: 82


posisi	0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 1 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
saksi	0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
curang	0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
murka	0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
singgung	0 0 0 0 0 0 0 3 0 1 0 0 0 0 2 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
tapak	0 2 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
susur	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 2 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
iring	0 0 0 0 0 0 0 1 0 2 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
suasana	0 3 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

### Feature vector: combination of features
Remember that we can always combine all of the features mentioned above to have rich representation of the words.  
By using __different kind of feature vector__, we will also extract __different kind of analogical clusters and grids__.

___
## Extracting analogical clusters
There are several Python scripts that can be used to extract the analogical clusters:
- __*Words2Clusters.py*__ (from a set of words)
- __*Vectors2Clusters.py*__ (from vectors)

### *Words2CLusters.py*
This Python script receives a file contains a word on each line and gives a list of analogical clusters.

In [66]:
!python scripts/Words2Clusters.py <toy_data/id.test.words -V

# Reading words and computing feature vectors (features=characters)...

# Building vector with feature...
#	- char : True
#	- token: False
#	- morph: False
#	- lemma: False
# Clustering the words according to their feature vectors...
#	- min cluster size: 2
#	- max cluster size: None
# Adding the indistinguishables...
# Checking distance constraints...
minum : makan :: meminum : memakan :: diminum : dimakan :: minuman : makanan
minum : minuman :: main : mainan :: makan : makanan
minum : diminum :: beli : dibeli :: makan : dimakan
beli : makan :: dibeli : dimakan
minum : main :: minuman : mainan
main : makan :: mainan : makanan
meminum : minuman :: memakan : makanan
diminum : minuman :: dimakan : makanan
minum : beli :: diminum : dibeli
meminum : diminum :: memakan : dimakan
minum : meminum :: makan : memakan
# Words2Clusters.py - Processing time: 0:00:00.637924


### *Vectors2CLusters.py*
This Python script receives a file contains a word and its vector representation on each line and gives a list of analogical clusters.  
In this example, we use __*Lines2Vectors.py*__ to produce the vector representation.

Notice that we can make use of the notion of *pipeline* to have the programs communicate between each other through `stdin` and `stdout`.

In [67]:
!python scripts/Strings2Vectors.py <toy_data/id.test.words -V | python scripts/Vectors2Clusters.py -V

# Reading file...
# Number of lines read: 32	
# Building vector with feature...
#	- char : True
#	- token: False
#	- morph: False
#	- lemma: False
# Strings2Vectors.py - Processing time: 0:00:00.050594
# Reading words and their vector representations...
# Clustering the words according to their feature vectors...
#	- min cluster size: 2
#	- max cluster size: None
# Adding the indistinguishables...
# Checking distance constraints...
minum : makan :: meminum : memakan :: diminum : dimakan :: minuman : makanan
minum : minuman :: main : mainan :: makan : makanan
minum : diminum :: beli : dibeli :: makan : dimakan
beli : makan :: dibeli : dimakan
minum : main :: minuman : mainan
main : makan :: mainan : makanan
meminum : minuman :: memakan : makanan
diminum : minuman :: dimakan : makanan
minum : beli :: diminum : dibeli
meminum : diminum :: memakan : dimakan
minum : meminum :: makan : memakan
# Vectors2Clusters.py - Processing time: 0:00:00.004579


Let us now extract the analogical clusters from vectors that is created before using the module available in the package.

In [68]:
from nlg.Cluster import ListOfClusters
from nlg.nlgCluster.StrCluster import ListOfStrClusters

In [69]:
min_clu_size=2
max_clu_size=None

distinguishable_vectors = vectors.get_distinguishables()
list_of_clusters = ListOfClusters.fromVectors(distinguishable_vectors,
			minimal_size=min_clu_size,
			maximal_size=max_clu_size)
list_of_clusters.set_indistinguishables(vectors.indistinguishables)

We then verify the distance constraints.

In [70]:
list_of_strclusters = ListOfStrClusters.fromListOfClusters(clusters=list_of_clusters,
			minimal_size=min_clu_size,
			maximal_size=max_clu_size)
print(list_of_strclusters)

minum : makan :: meminum : memakan :: diminum : dimakan :: minuman : makanan
minum : minuman :: main : mainan :: makan : makanan
minum : diminum :: beli : dibeli :: makan : dimakan
beli : makan :: dibeli : dimakan
minum : main :: minuman : mainan
main : makan :: mainan : makanan
meminum : minuman :: memakan : makanan
diminum : minuman :: dimakan : makanan
minum : beli :: diminum : dibeli
meminum : diminum :: memakan : dimakan
minum : meminum :: makan : memakan


Now, let us try the `vectors2clusters` function provided by the `nlg.pipeline` which does the same thing.

In [71]:
list_of_strclusters = pipeline.vectors2clusters(vectors,
		min_cluster_size=min_clu_size,
		max_cluster_size=max_clu_size,
		verbose=True)
print(list_of_strclusters)

minum : makan :: meminum : memakan :: diminum : dimakan :: minuman : makanan
minum : minuman :: main : mainan :: makan : makanan
minum : diminum :: beli : dibeli :: makan : dimakan
beli : makan :: dibeli : dimakan
minum : main :: minuman : mainan
main : makan :: mainan : makanan
meminum : minuman :: memakan : makanan
diminum : minuman :: dimakan : makanan
minum : beli :: diminum : dibeli
meminum : diminum :: memakan : dimakan
minum : meminum :: makan : memakan


# Clustering the words according to their feature vectors...
#	- min cluster size: 2
#	- max cluster size: None
# Adding the indistinguishables...
# Checking distance constraints...


___
## Extracting analogical grids
There are several Python scripts that can be used to extract the analogical grids:
- __*Words2Grids.py*__ (from a set of words)
- __*Vectors2Grids.py*__ (from vectors)
- __*Clusters2Grids.py*__ (from analogical clusters)

### *Words2Grids.py*
This Python script receives a file contains a word on each line and gives a list of analogical grids.

In [72]:
!python scripts/Words2Grids.py <toy_data/id.test.words -V

# Reading words and computing feature vectors (features=characters)...

# Building vector with feature...
#	- char : True
#	- token: False
#	- morph: False
#	- lemma: False
# Clustering the words according to their feature vectors...
#	- min cluster size: 2
#	- max cluster size: None
# Adding the indistinguishables...
# Checking distance constraints...
# Building grids...
#	- saturation ≥ 0.000
#	- cluster size ≥ 2
minum : meminum : diminum : minuman :: makan : memakan : dimakan : makanan :: main : None : None : mainan :: beli : None : dibeli : None

# Words2Grids.py - Processing time: 0:00:01.259341


### *Vectors2Grids.py*
This Python script receives a file contains a word and its vector representation on each line and gives a list of analogical grids.  
In this example, again, we use __*Lines2Vectors.py*__ to produce the vector representation.

In [73]:
!python scripts/Strings2Vectors.py <toy_data/id.test.words -V | python scripts/Vectors2Grids.py -V

# Reading file...
# Number of lines read: 32	
# Building vector with feature...
#	- char : True
#	- token: False
#	- morph: False
#	- lemma: False
# Strings2Vectors.py - Processing time: 0:00:00.051008
# Reading words and their vector representations...
# Clustering the words according to their feature vectors...
#	- min cluster size: 2
#	- max cluster size: None
# Adding the indistinguishables...
# Checking distance constraints...
# Building grids...
#	- saturation ≥ 0.000
#	- cluster size ≥ 2
minum : meminum : diminum : minuman :: makan : memakan : dimakan : makanan :: main : None : None : mainan :: beli : None : dibeli : None

# Vectors2Grids.py - Processing time: 0:00:00.641477


### *Clusters2Grids.py*
This Python script receives a file contains an analogical cluster on each line and gives a list of analogical grids.  
In this example, we use the previous Python script __*Vectors2Clusters.py*__ to produce the clusters.

In [74]:
!python scripts/Strings2Vectors.py <toy_data/id.test.words -V | python scripts/Vectors2Clusters.py -V | python scripts/Clusters2Grids.py -V

# Reading file...
# Number of lines read: 32	
# Building vector with feature...
#	- char : True
#	- token: False
#	- morph: False
#	- lemma: False
# Strings2Vectors.py - Processing time: 0:00:00.064292
# Reading words and their vector representations...
# Clustering the words according to their feature vectors...
#	- min cluster size: 2
#	- max cluster size: None
# Reading clusters...
# Adding the indistinguishables...
# Checking distance constraints...
# Vectors2Clusters.py - Processing time: 0:00:00.004773
# Building grids...
#	- saturation ≥ 0.000
#	- cluster size ≥ 2

# Clusters2Grids.py - Processing time: 0:00:00.098057


Let us now extract the analogical grids from list of analogical clusters using the module.

In [75]:
from nlg.Grid import ListOfGrids

In [76]:
min_saturation = 0.0 
list_of_grids = ListOfGrids.fromClusters(list_of_strclusters, min_saturation)
print(list_of_grids.pretty_print())

# Grid no.: 1 - {'length': 4, 'width': 4, 'size': 16, 'filled': 12, 'saturation': 0.75}
minum : meminum : diminum : minuman
makan : memakan : dimakan : makanan
beli  :         : dibeli  :
main  :         :         : mainan




Now, let us use the `clusters2grids` function from `nlg.pipeline`.

In [77]:
list_of_grids = pipeline.clusters2grids(list_of_strclusters, saturation=min_saturation, verbose=True)
print(list_of_grids.pretty_print())

# Building grids...
#	- saturation ≥ 0.000
#	- cluster size ≥ 2


# Grid no.: 1 - {'length': 4, 'width': 4, 'size': 16, 'filled': 12, 'saturation': 0.75}
minum : meminum : diminum : minuman
makan : memakan : dimakan : makanan
beli  :         : dibeli  :
main  :         :         : mainan




### Properties of analogical grids
There are two properties of analogical grids: *size* and *saturation*.
- *Size* is simply the total number of cells inside the grid.
- *Saturation* is the ratio of non-empty cells against the total number of cells.

In [78]:
for i, grid in enumerate(list_of_grids):
    print(f'Grid no {i+1}: {grid.attributes}')
    print(f'  - size = {grid.attributes["size"]}')
    print(f'  - saturation = {grid.attributes["saturation"]}')

Grid no 1: {'length': 4, 'width': 4, 'size': 16, 'filled': 12, 'saturation': 0.75}
  - size = 16
  - saturation = 0.75


# Notes
In this notebook, we only showed the basic use of the scripts and modules.  
There are many parameters available in both the Python scripts and modules.  
Please look inside the scripts to perform more interesting experiments.  
Functions available in the `nlg.pipeline` could be a relatively easy starting point to understand how to use this `nlg` module.