# Clustering, Most Informative Binary Cluster

In this notebook, we seek the most informative feature selection for grouping words into one or another cluster. This builds on the principles of the initial experiments notebook wherein we saw that the selection of certain basis elements (e.g. subject, object, adjunct, etc.) resulted in different kinds of clusters. 

In this notebook, we test individual bases at the exclusion of all others. The idea here is that collocatability with different basis elements operates on various different levels. In order to find the information most useful for verbal class, we cluster subject-, object-, and complement- only verbal spaces into two clusters and measure the average size of the two clusters. The goal is to find the cooccurring feature which reveals the most basic and polarizing division between the verb classes.

As a thought experiment, an example might be living versus non-living entities. Hypothetically, subject-only verbal space could show a strong division between verbs that require living subjects (e.g. קרא or הלך) versus those that do not. Though, as we have seen in the experiment notebook, verbs such as הלך show remarkable flexibility and extensibility in this area. Thus, it remains to be seen whether the hypothesis will stand. This notebook intends to test and develop that thesis.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
from project_code.experiments import VerbSubjOnly, VerbObjOnly, VerbCmplOnly
from project_code.semspace import SemSpace, get_lex
from project_code.kmedoids.kmedoids import kMedoids

In [2]:
bhsa_data_paths=['~/github/etcbc/bhsa/tf/c',
                 '~/github/semantics/project_code/lingo/heads/tf/c']
TF = Fabric(bhsa_data_paths)
tf_api = TF.load('''
                function lex vs language
                pdp freq_lex gloss domain ls
                heads prep_obj mother rela
                typ sp st
              ''', silent=True)

tf_api.makeAvailableIn(globals())
B = Bhsa(api=tf_api, name='phase2_initial_experiments', version='c')

This is Text-Fabric 3.4.6
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

116 features found and 0 ignored


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/Bhsa" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/api" title="text-fabric-api">Text-Fabric API</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/api#search-template-introduction" title="Search Templates Introduction and Reference">Search Reference</a>

## Initial Go

Set up the spaces and cluster with KMeans, where n_cluster = 2.

In [4]:
subj_only = VerbSubjOnly(tf_api=tf_api)
obj_only = VerbObjOnly(tf_api=tf_api)
cmpl_only = VerbCmplOnly(tf_api=tf_api)
for name, exp in (('subj', subj_only), ('obj', obj_only), ('cmpl', cmpl_only)):
    print(f'{name} experiment is ready with a size of {exp.data.shape}')
    
print('\nbuilding subject-only space...')
s_space = SemSpace(subj_only, info=200000)
print('\nbuilding object-only space...')
o_space = SemSpace(obj_only, info=300000)
print('\nbuilding complement-only space...')
c_space = SemSpace(cmpl_only, info=200000)

subj experiment is ready with a size of (1053, 513)
obj experiment is ready with a size of (2033, 465)
cmpl experiment is ready with a size of (3462, 601)

building subject-only space...
  0.00s Beginning all calculations...
  0.00s beginning Loglikelihood calculations...
   |     3.17s at iteration 200000
   |     5.83s at iteration 400000
  7.36s FINISHED loglikelihood at iteration 540189
  7.36s beginning PMI calculations...
  9.48s FINISHED PMI...
  9.99s data gathering complete!

building object-only space...
  0.00s Beginning all calculations...
  0.00s beginning Loglikelihood calculations...
   |     3.78s at iteration 300000
   |     7.77s at iteration 600000
   |       11s at iteration 900000
    12s FINISHED loglikelihood at iteration 945345
    12s beginning PMI calculations...
    14s FINISHED PMI...
    15s data gathering complete!

building complement-only space...
  0.00s Beginning all calculations...
  0.01s beginning Loglikelihood calculations...
   |     3.24s at iter

In [10]:
for name, space in (('subj', s_space), ('obj', o_space), ('cmpl', c_space)):
    
    kmeans = KMeans(n_clusters=2, random_state=0).fit(np.nan_to_num(space.pairwise_pmi, 0))
    cluster_1_count = kmeans.labels_[kmeans.labels_ == 0].shape[0]
    cluster_2_count = kmeans.labels_[kmeans.labels_ == 1].shape[0]
    
    print(f'{name} space sizes:')
    print(f'\tcluster_1 size: {cluster_1_count} ({round(cluster_1_count / kmeans.labels_.shape[0], 3)})')
    print(f'\tcluster_2 size: {cluster_2_count} ({round(cluster_2_count / kmeans.labels_.shape[0], 3)})\n')

subj space sizes:
	cluster_1 size: 20 (0.039)
	cluster_2 size: 493 (0.961)

obj space sizes:
	cluster_1 size: 447 (0.961)
	cluster_2 size: 18 (0.039)

cmpl space sizes:
	cluster_1 size: 586 (0.975)
	cluster_2 size: 15 (0.025)



Based on these basic categories, there is hardly any major separation amongst the verb groups for subject-, object-, and complement- only verb spaces. Below we try the same with jaccardian distance.

In [18]:
for name, space in (('subj', s_space), ('obj', o_space), ('cmpl', c_space)):
    
    kmeans = KMeans(n_clusters=2, random_state=0).fit(np.nan_to_num(space.pairwise_jaccard, 0))
    cluster_1_count = kmeans.labels_[kmeans.labels_ == 0].shape[0]
    cluster_2_count = kmeans.labels_[kmeans.labels_ == 1].shape[0]
    
    print(f'{name} space sizes:')
    print(f'\tcluster_1 size: {cluster_1_count} ({round(cluster_1_count / kmeans.labels_.shape[0], 3)})')
    print(f'\tcluster_2 size: {cluster_2_count} ({round(cluster_2_count / kmeans.labels_.shape[0], 3)})\n')

subj space sizes:
	cluster_1 size: 37 (0.072)
	cluster_2 size: 476 (0.928)

obj space sizes:
	cluster_1 size: 411 (0.884)
	cluster_2 size: 54 (0.116)

cmpl space sizes:
	cluster_1 size: 30 (0.05)
	cluster_2 size: 571 (0.95)



The model creates quiet a bit more separation with jaccardian distance. The object-only space has 89% and 11% proportions in the respective clusters. The Subject space performs second best and the complement performs the worst. 