# Clustering, Most Informative Binary Cluster

In this notebook, we seek the most informative feature selection for grouping words into one or another cluster. This builds on the principles of the initial experiments notebook wherein we saw that the selection of certain basis elements (e.g. subject, object, adjunct, etc.) resulted in different kinds of clusters. 

In this notebook, we test individual bases at the exclusion of all others. The idea here is that collocatability with different basis elements operates on various different levels. In order to find the information most useful for verbal class, we cluster subject-, object-, and complement- only verbal spaces into two clusters and measure the average size of the two clusters. The goal is to find the cooccurring feature which reveals the most basic and polarizing division between the verb classes.

As a thought experiment, an example might be living versus non-living entities. Hypothetically, subject-only verbal space could show a strong division between verbs that require living subjects (e.g. קרא or הלך) versus those that do not. Though, as we have seen in the experiment notebook, verbs such as הלך show remarkable flexibility and extensibility in this area. Thus, it remains to be seen whether the hypothesis will stand. This notebook intends to test and develop that thesis.

In [23]:
import numpy as np
import pandas as pd
import collections
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
from project_code.experiments import VerbSubjOnly, VerbObjOnly, VerbCmplOnly, VerbSubjOnlyMinLex, VerbObjOnlyMinLex, VerbCmplOnlyMinLex
from project_code.semspace import SemSpace, get_lex
from project_code.kmedoids.kmedoids import kMedoids

In [4]:
bhsa_data_paths=['~/github/etcbc/bhsa/tf/c',
                 '~/github/semantics/project_code/lingo/heads/tf/c']
TF = Fabric(bhsa_data_paths)
tf_api = TF.load('''
                function lex vs language
                pdp freq_lex gloss domain ls
                heads prep_obj mother rela
                typ sp st
              ''', silent=True)

tf_api.makeAvailableIn(globals())
B = Bhsa(api=tf_api, name='phase2_initial_experiments', version='c')

This is Text-Fabric 3.4.6
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

116 features found and 0 ignored


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/Bhsa" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/api" title="text-fabric-api">Text-Fabric API</a> <a target="_blank" href="https://github.com/Dans-labs/text-fabric/wiki/api#search-template-introduction" title="Search Templates Introduction and Reference">Search Reference</a>

## Initial Go

Set up the spaces and cluster with KMeans, where n_cluster = 2.

In [6]:
subj_only = VerbSubjOnly(tf_api=tf_api)
obj_only = VerbObjOnly(tf_api=tf_api)
cmpl_only = VerbCmplOnly(tf_api=tf_api)
for name, exp in (('subj', subj_only), ('obj', obj_only), ('cmpl', cmpl_only)):
    print(f'{name} experiment is ready with a size of {exp.data.shape}')
    
print('\nbuilding subject-only space...')
s_space = SemSpace(subj_only, info=200000)
print('\nbuilding object-only space...')
o_space = SemSpace(obj_only, info=300000)
print('\nbuilding complement-only space...')
c_space = SemSpace(cmpl_only, info=400000)

subj experiment is ready with a size of (1052, 513)
obj experiment is ready with a size of (2031, 449)
cmpl experiment is ready with a size of (3421, 551)

building subject-only space...
  0.00s Beginning all calculations...
  0.00s beginning Loglikelihood calculations...
   |     3.03s at iteration 200000
   |     5.82s at iteration 400000
  7.43s FINISHED loglikelihood at iteration 539676
  7.43s beginning PMI calculations...
  9.79s FINISHED PMI...
    13s data gathering complete!

building object-only space...
  0.00s Beginning all calculations...
  0.01s beginning Loglikelihood calculations...
   |     3.73s at iteration 300000
   |     7.53s at iteration 600000
   |       11s at iteration 900000
    11s FINISHED loglikelihood at iteration 911919
    11s beginning PMI calculations...
    14s FINISHED PMI...
    15s data gathering complete!

building complement-only space...
  0.00s Beginning all calculations...
  0.01s beginning Loglikelihood calculations...
   |     5.88s at iter

In [7]:
for name, space in (('subj', s_space), ('obj', o_space), ('cmpl', c_space)):
    
    kmeans = KMeans(n_clusters=2, random_state=0).fit(np.nan_to_num(space.pairwise_pmi, 0))
    cluster_1_count = kmeans.labels_[kmeans.labels_ == 0].shape[0]
    cluster_2_count = kmeans.labels_[kmeans.labels_ == 1].shape[0]
    
    print(f'{name} space sizes:')
    print(f'\tcluster_1 size: {cluster_1_count} ({round(cluster_1_count / kmeans.labels_.shape[0], 3)})')
    print(f'\tcluster_2 size: {cluster_2_count} ({round(cluster_2_count / kmeans.labels_.shape[0], 3)})\n')

subj space sizes:
	cluster_1 size: 20 (0.039)
	cluster_2 size: 493 (0.961)

obj space sizes:
	cluster_1 size: 436 (0.971)
	cluster_2 size: 13 (0.029)

cmpl space sizes:
	cluster_1 size: 14 (0.025)
	cluster_2 size: 537 (0.975)



Based on these basic categories, there is hardly any major separation amongst the verb groups for subject-, object-, and complement- only verb spaces. But the subject-only space has a slightly larger (+1%) cluster. Below we try the same with jaccardian distance.

In [16]:
for name, space in (('subj', s_space), ('obj', o_space), ('cmpl', c_space)):
    
    kmeans = KMeans(n_clusters=2, random_state=0).fit(np.nan_to_num(space.pairwise_jaccard, 0))
    cluster_1_count = kmeans.labels_[kmeans.labels_ == 0].shape[0]
    cluster_2_count = kmeans.labels_[kmeans.labels_ == 1].shape[0]
    
    print(f'{name} space sizes:')
    print(f'\tcluster_1 size: {cluster_1_count} ({round(cluster_1_count / kmeans.labels_.shape[0], 3)})')
    print(f'\tcluster_2 size: {cluster_2_count} ({round(cluster_2_count / kmeans.labels_.shape[0], 3)})\n')

subj space sizes:
	cluster_1 size: 37 (0.072)
	cluster_2 size: 476 (0.928)

obj space sizes:
	cluster_1 size: 437 (0.973)
	cluster_2 size: 12 (0.027)

cmpl space sizes:
	cluster_1 size: 8 (0.015)
	cluster_2 size: 543 (0.985)



The model creates a bit more separation with jaccardian distance. The most separation is seen in the subject-only cluster with a secondary cluster consisting of 7% of all terms. The second least separation is found in the object-only space with secondary cluster of 3%. Finally, the complement has the smallest secondary cluster with 1.5% of all terms. It is perhaps noteworthy that this order holds true for the number of samples, with subject-only having the most, object-only the second most, and complement-only the third. So the sample size may be influencing the numbers.

Below we investigate the terms contained within the subject-only's secondary cluster; we also look at their top most common arguments.

In [17]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(np.nan_to_num(s_space.pairwise_jaccard, 0))

clusters_glossed = pd.DataFrame(kmeans.labels_, 
                                index=[f'{subj_only.target2gloss[w]}.{F.vs.v(subj_only.target2node[w])}' for w in s_space.pmi.columns], 
                                columns=['cluster']).fillna(0)
clusters = pd.DataFrame(kmeans.labels_, 
                        index=s_space.pmi.columns, 
                        columns=['cluster']).fillna(0)

clusters_glossed[clusters_glossed.cluster == 0]

Unnamed: 0,cluster
be angry.hit,0
entreat.nif,0
"warn, to witness.hif",0
be awake.hif,0
restrain.qal,0
be angry.hit,0
"moisten, confound.qal",0
look.qal,0
forget.piel,0
destroy.hif,0


In [76]:
clust_2 = clusters[clusters.cluster == 0]
clust_2_args = collections.Counter()

#look at the top arguments within cluster 2
for lex in clust_2.index:
    for arg in s_space.raw[lex][s_space.raw[lex] > 0].index:
        clust_2_args[arg] += s_space.raw[lex][arg]
        
clust_2_args.most_common(10)

[('Pred.Subj.JHWH/', 50.0),
 ('Pred.Subj.>LHJM/', 8.0),
 ('Pred.Subj.RWX/', 7.0),
 ('Pred.Subj.MGPH/', 5.0),
 ('PreO.Subj.JHWH/', 4.0),
 ('Pred.Subj.HW>', 3.0),
 ('Pred.Subj.LB/', 3.0),
 ('Pred.Subj.ML>K/', 2.0),
 ('Pred.Subj.<M/', 2.0),
 ('Pred.Subj.XJL/', 2.0)]

What is common amongst this cluster is an overwhelming preference for יהוה as the subject. 

**This data shows that this cluster is more informative for character use than verb class! There can then be no significance drawn from the larger secondary cluster size of the subject-only space.**

### Analysis

The problem we are witnessing above is described in [this wikipedia article](https://en.wikipedia.org/wiki/Clustering_high-dimensional_data) on clustering high-dimensional data:

> A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient...**Recent research indicates that the discrimination problems only occur when there is a high number of irrelevant dimensions, and that shared-nearest-neighbor approaches can improve results.** (Wikipedia, Clustering High Dimensional Data, emphasis added)



## With Muted Lexemes

In [4]:
subj_min = VerbSubjOnlyMinLex(tf_api=tf_api)
obj_min = VerbObjOnlyMinLex(tf_api=tf_api)
cmpl_min = VerbCmplOnlyMinLex(tf_api=tf_api)
for name, exp in (('subj', subj_min), ('obj', obj_min), ('cmpl', cmpl_min)):
    print(f'{name} experiment is ready with a size of {exp.data.shape}')
    
print('\nbuilding subject-only space...')
minsspace = SemSpace(subj_min, info=200000)
print('\nbuilding object-only space...')
minospace = SemSpace(obj_min, info=300000)
print('\nbuilding complement-only space...')
mincspace = SemSpace(cmpl_min, info=200000)

subj experiment is ready with a size of (19, 513)
obj experiment is ready with a size of (57, 465)
cmpl experiment is ready with a size of (347, 601)

building subject-only space...
  0.00s Beginning all calculations...
  0.00s beginning Loglikelihood calculations...
  0.45s FINISHED loglikelihood at iteration 9747
  0.46s beginning PMI calculations...
  0.78s FINISHED PMI...
  1.03s data gathering complete!

building object-only space...
  0.00s Beginning all calculations...
  0.00s beginning Loglikelihood calculations...
  0.70s FINISHED loglikelihood at iteration 26505
  0.70s beginning PMI calculations...
  1.05s FINISHED PMI...
  1.40s data gathering complete!

building complement-only space...
  0.00s Beginning all calculations...
  0.00s beginning Loglikelihood calculations...
   |     3.32s at iteration 200000
  3.44s FINISHED loglikelihood at iteration 208547
  3.44s beginning PMI calculations...
  4.53s FINISHED PMI...
  4.96s data gathering complete!


In [7]:
for name, space in (('subj', minsspace), ('obj', minospace), ('cmpl', mincspace)):
    
    kmeans = KMeans(n_clusters=2, random_state=0).fit(np.nan_to_num(space.pairwise_pmi, 0))
    cluster_1_count = kmeans.labels_[kmeans.labels_ == 0].shape[0]
    cluster_2_count = kmeans.labels_[kmeans.labels_ == 1].shape[0]
    
    print(f'{name} space sizes:')
    print(f'\tcluster_1 size: {cluster_1_count} ({round(cluster_1_count / kmeans.labels_.shape[0], 3)})')
    print(f'\tcluster_2 size: {cluster_2_count} ({round(cluster_2_count / kmeans.labels_.shape[0], 3)})\n')

subj space sizes:
	cluster_1 size: 278 (0.542)
	cluster_2 size: 235 (0.458)

obj space sizes:
	cluster_1 size: 310 (0.667)
	cluster_2 size: 155 (0.333)

cmpl space sizes:
	cluster_1 size: 54 (0.09)
	cluster_2 size: 547 (0.91)



In [9]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(np.nan_to_num(minospace.pairwise_jaccard, 0))

cglossedmin = pd.DataFrame(kmeans.labels_, 
                                index=[f'{obj_only.target2gloss[w]}.{F.vs.v(obj_only.target2node[w])}' for w in minospace.pmi.columns], 
                                columns=['cluster']).fillna(0)

clustmin = pd.DataFrame(kmeans.labels_, 
                        index=o_space.pmi.columns, 
                        columns=['cluster']).fillna(0)

cglossedmin[cglossedmin.cluster == 0]

Unnamed: 0,cluster
"work, serve.qal",0
pass.hif,0
make.qal,0
ascend.hif,0
stand.qal,0
fine.qal,0
be lowly.hit,0
answer.qal,0
bind.qal,0
root up.piel,0


In [11]:
minsspace.pmi.index

Index(['PreO.Subj.nmpr', 'PreO.Subj.prps', 'PreO.Subj.subs',
       'Pred.Subj.<D_subs', 'Pred.Subj.>T', 'Pred.Subj.>T_nmpr',
       'Pred.Subj.>T_nmpr|>T_nmpr',
       'Pred.Subj.>T_nmpr|>T_nmpr|>T_nmpr|>T_nmpr', 'Pred.Subj.>T_prde',
       'Pred.Subj.>T_subs', 'Pred.Subj.K_subs', 'Pred.Subj.MN_subs',
       'Pred.Subj.MN_subs|MN_subs', 'Pred.Subj.nmpr', 'Pred.Subj.prde',
       'Pred.Subj.prin', 'Pred.Subj.prps', 'Pred.Subj.subs', 'PtcO.Subj.subs'],
      dtype='object')