[Lucy and Bamman 2021](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00383/101877/Characterizing-English-Variation-across-Social) uses KMeans clustering over BERT representations to learn word senses in order to characterize their distinctive use within online communities.  In this notebook, we'll explore inferring distinct senses using clustering.

In [1]:
from transformers import BertModel, BertTokenizer
import numpy as np
from sklearn.cluster import KMeans
from collections import Counter

In [2]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
def get_bert_for_token(string, term):
    
    # tokenize
    inputs = tokenizer(string, return_tensors="pt")
    
    # convert input ids to words
    tokens=tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    # find the first location of the query term among those tokens (so we know which BERT rep to use)
    term_idx=tokens.index(term)
    
    outputs = model(**inputs)

    # return the BERT rep for that token index
    # The output is a pytorch tensor object, but let's convert it to a numpy object to work with numpy functions
    
    return outputs.last_hidden_state[0][term_idx].detach().numpy()

In [4]:
def read_data(filename):
    data=[]
    with open(filename, encoding="utf-8") as file:
        for line in file:
            data.append(line.rstrip())
    return data

First, let's examine uses of the word "cabinet" from several contemporary novels.

In [5]:
data=read_data("../data/cabinet.txt")
reps=[]
for sentence in data:
    reps.append(get_bert_for_token(sentence, "cabinet"))
    
# matrices for every word, using the term "cabinet"

In [6]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(reps)

In [8]:
for idx in np.argsort(kmeans.labels_):
    print("%s\t%s" % (kmeans.labels_[idx], data[idx]))

0	He studiously ignored it as his fierce blue eyes swept around the cabinet room.
0	The Englishman walked over to Murray’s cabinet and found a bottle of whiskey—a Christmas present, still unopened on New Year’s Eve.
0	She opened the mirrored door to the medicine cabinet.
0	Roscoe nodded and the doctor got up and walked around to a file cabinet.
0	The doctor stepped back to the cabinet and put the file away.
0	I heard her open this cabinet.
0	In the china cabinet in the dining room, there was some china like the china Mrs. Hartley’s mother had owned.
0	Except for the background noise of a tactical channel coming from a scanner on a file cabinet in the back, the place could have passed for a real estate office.
0	Four empty grass mats, a cabinet of supplies, all just sitting in the middle of this tunnel.
0	So she goes in and while she’s running the water she looks through the cabinet below the sink, probably to see if there is anything worth lifting.
1	Ryan had sailed through confirmatio

Now let's examine a word that has slightly more polysemy: *right*.  Explore clustering with different number of clusters; how many clusters do you need to settle on what you would consider to be the right number of distinct senses?

In [16]:
data=read_data("../data/right200.txt")
reps=[]
for sentence in data:
    reps.append(get_bert_for_token(sentence, "right"))

In [23]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(reps)

In [24]:
cluster_counts

Counter({0: 5, 1: 5, 2: 5, 3: 5, 4: 5})

In [25]:
max_per_class=5
cluster_counts=Counter()
last_lab=None
for idx in np.argsort(kmeans.labels_): # take the indices for all values that are 0, then 1
    clusterID=kmeans.labels_[idx] # get the class, 0 or 1, for that first index
    if cluster_counts[clusterID] < max_per_class: # if values in the cluster aren't 5 yet
        cluster_counts[clusterID]+=1 # add 1 to the cluster
        if clusterID != last_lab and last_lab is not None:
            print()
        last_lab=clusterID # max out when we have all the clusters
        print("%s\t%s" % (clusterID, data[idx]))
        

0	" Offers me , " he went on , tapping his foot upon the floor , " the little inheritance she is certain of so soon -- just as little and as much as I have wasted -- and begs and prays me to take it , set myself right with it , and remain in the service . "
0	' Spend it , sir , ' says I. ' But I shall be taken in , ' he says , ' they wo n't give me the right change , I shall lose it , it 's no use to me . '
0	He has more leisure for musing in Staple Inn and in the Rolls Yard during the long vacation than at other seasons , and he says to the two ' prentices , what a thing it is in such hot weather to think that you live in an island with the sea a-rolling and a-bowling right round you .
0	I must n't go into court and say , ' My Lord , I beg to know this from you -- is this right or wrong ?
0	She went one way , and Jenny went another ; one went right to Lunnun , and t ' other went right from it .

1	One might have supposed that the course was straight on -- over everything , neither to 