# The Semantics of "Language Models"

There is a lot of debate in the field and in the industry about whether large language models "understand language" (CITE), whether they are "capable of reference" (CITE). Some, like Geoffrey Hinton, would take a victory lap---decades later, we've proven the first connectionists right. Champions of industry speak of AGI as if it's right around the corner. Others, like Emily Bender, warn that the "AI hype" is unfounded. Language models can never "understand" language without a way to ground the symbols they manipulate. At its most sinister, AI hype (and its close counterpart AI doom---often entertained by the exact same parties) misdirects the attention of the public away from the real and present harms of machine learning and NLP applications in today's society, from amplifying systemic racism to overhauling labor and expectations of productivity. 

In some sense, the arguments of both camps require that there is no animacy within the language model---yet. (The AI dehypers decouple animacy/agency from harm.) 

This all speaks to our overt metapragmatics of language models---the debate in lets call it philosopy of language models resembled is an exercise of power. It determines who has the authority to declare a speaker as a legitimate subject.

There's an idea that the legitimacy of the subject inheres in it somehow, in its properties, and not in interaction, or perception / understanding of others. 

One philosopher with much to say on the subject of subjecthood was Hegel. In hegel's phenomenology, the subject can not exist without also being object. The self exists as a self inasmuch as it relates to an other and another relates to it. His view of subjecthood relies on mutual recognition. 

Does what we say about language models match up with how we talk about them? I suspect that over the last decade, models have become (linguistically, grammatically) more like subjects than objects. 

I hypothesize that over time we will see

1. higher number of collocations with more agentive verbs, esp. verbs of cognition (e.g. acquiring concepts)
2. increased use in subject-y positions
3. higher association with subject-like semantic features
4. (potential)Compared to coca (controlling for fashion model sense) we will see language models getting more animate



A point about anthropology

This is a test of overt/covert language ideology. The concept of ideology has been turned all about---the idea of ideology in the marxist sense of an unconscious bias away from the truth has been criticized. Along with the idea of a 'gap' between overt political/linguistic ideologies that people profess and ideologies they enact. This is a tricky problem and we want to kind of sidestep it. But we also want to say, look---these people say language models are display 'human-like' abilities, these dont. But they both talk about them in the same way. 

Where does negation come into this? As in somebody stating in a paper that 'language models *don't* display human-like abilities. 

I also suspect we'll see a shift from verbal to nominal uses, and that the nominal uses will have new senses. 



In [1]:
import pandas as pd
import numpy as np
import pickle
import spacy



import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

from tqdm import tqdm



pd.set_option('display.max_colwidth', 500)

2024-10-01 16:17:29.639386: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Loading and building the dataset

Let's look at some uses of the word 'model'

In [2]:
word = 'model'

tokens = pd.read_csv('./collected_tokens/acl/{}.csv'.format(word))

In [3]:
tokens.head(5)

Unnamed: 0.1,Unnamed: 0,corpus_id,sentence,start_idx,end_idx
0,0,18022704,"Since the similarity measure based on the vector space model is a rough estimation, minor errors made at the stage of context vector extraction are acceptable.",9,10
1,1,18022704,e) Words with similar contexts might not be synonyms: A disadvantage does exist when the context vector model is used.,19,20
2,2,18022704,"Therefore, the vector space model should incorporate the taxonomy approach to solve this phenomenon.",5,6
3,3,18022704,"Conclusions In this paper, we have adopted the context vector model to measure word similarity.",11,12
4,4,16703040,"Based on a review of our misclassified instances, we are surprised that our classifier did not learn a better model based on style features (F 1 =.60).",20,21


In [4]:
# number of tokens of 'model
len(tokens)

868208

In [5]:
# join the paper metadata from our ACL corpus to the tokens 

df = parquet_file = "/Volumes/data_gabriella_chronis/corpora/acl-publication-info.74k.parquet"

df = pd.read_parquet(parquet_file, engine='pyarrow')
data = tokens.join(df.set_index("corpus_paper_id"), on="corpus_id")

In [6]:
# cast year from string to int so we can order it
data["year"] = data["year"].astype(int)

# create decade bins
bins = [1950, 1960, 1970, 1980, 1990, 2000, 2005, 2010, 2012, 2014, 2016, 2018, 2020]
data["decade"] = pd.cut(data['year'], bins)

In [7]:
# for dev purposes let's work with a smaller section of the data for now

#data = data.iloc[:5000]

### Look at ten sentences from each decade

In [8]:
#df.style.set_properties(subset=['sentence'], **{'width': '300px'})
pd.set_option('display.max_rows', 1000)


#data.groupby('decade').sample(frac=.1, replace=True) [['decade', 'sentence' ]]
data.groupby('decade').sample(5) [['decade', 'sentence' ]]

Unnamed: 0,decade,sentence
606708,"(1950, 1960]",We did write this paper about model English entirely in model English to show how much familiarity can be combined with complete regularity in a model language.
612857,"(1950, 1960]","From here on, however, -that is, in the MT from the pivot language into any of the model output languages -we would in every case have a mechanical correlation between two regularized languages."
606703,"(1950, 1960]",Rules such as these can make English as free of inflections as Chinese or as the most model artificial language.
612855,"(1950, 1960]","His paper caused a very lively discussion as a result of which I can say that ""model TL-s,"" especially his ""model target English"" will constitute an important item in the mechanization of the translation process."
612862,"(1950, 1960]","A model language, as defined by Dr. Dodd, means any language in which the rules of syntax have been regularized, and in which familiarity of words is a governing criterion."
833619,"(1960, 1970]","In the framework of Tarski's formulation, the method can be stated as follows: Given a ""model"" of the language L, consisting of an individual domain D and a semantic interpretation function ~b for L over P, the extension of any well-formed expression E in L is defined as the set of values for E of all semantic interpretation functions #' over P which differ from # at most on their assignments to the free variables of E. Now if one considers the domain of possible models for L, the intension ..."
613146,"(1960, 1970]","The model draws on ii) iii) iv) and v) of the technological devices mentioned above, i) As is standarg practise now on Information Retrieval, the model uses a Thesaurus."
618061,"(1960, 1970]","In this model we distinguisL five structural levels and tzo binary relations, ex:~ansion and coordination."
833965,"(1960, 1970]","From this (and other corroborative forms) we postulate that an earlier stage of Russian had one form for this verb stem, namely /mog/, and that before the vowel /i/ /g/ later became /~/. The proposal in this paper is to reverse the bottom-to-top model of the comparative method and that of internal reconstruction into a top-to-bottom generative model where the input forms are reconstructed leKical items and the rules are the set of postulated sound changes for the language."
833476,"(1960, 1970]","Accordingly, I propose initially cert@in extensions and modifications of the theory to make it in some sense 1-2 a model of performance."


How many tokens of 'meaning' do we have?

In [9]:
data.groupby('decade').size()

decade
(1950, 1960]        58
(1960, 1970]       263
(1970, 1980]      1223
(1980, 1990]      7771
(1990, 2000]     26080
(2000, 2005]     32413
(2005, 2010]     80479
(2010, 2012]     47085
(2012, 2014]     55540
(2014, 2016]     72666
(2016, 2018]    115437
(2018, 2020]    218505
dtype: int64

## Collocations

Let's build all the collocations, for one year


In [10]:
year_data = data[data['year'] ==2008]
sentences = year_data['sentence'].str.cat(sep=' ')
toks = nltk.word_tokenize(sentences)

In [11]:
toks[:5]

['The', 'low', 'resources', 'we', 'use']

In [12]:
# Ngrams with 'creature' as a member
model_filter = lambda *w: 'model' not in w

## Bigrams
finder = BigramCollocationFinder.from_words(
   toks )
# only bigrams that appear 3+ times
finder.apply_freq_filter(3)
# only bigrams that contain 'creature'
finder.apply_ngram_filter(model_filter)
# return the 10 n-grams with the highest PMI
print(finder.nbest(bigram_measures.likelihood_ratio, 40))


## Trigrams
finder = TrigramCollocationFinder.from_words(
   toks )
# only trigrams that appear 3+ times
finder.apply_freq_filter(3)
# only trigrams that contain 'creature'

# filter = lambda w: "model" not in w
# finder.apply_ngram_filter(filter)
# # return the 10 n-grams with the highest PMI
# print(finder.nbest(trigram_measures.likelihood_ratio, 25))

[('language', 'model'), ('translation', 'model'), ('model', 'is'), ('model', '.'), ('our', 'model'), ('model', ','), (',', 'model'), ('reordering', 'model'), ('.', 'model'), ('log-linear', 'model'), ('model', 'the'), ('CRF', 'model'), ('(', 'model'), ('model', 'for'), ('alignment', 'model'), ('model', 'a'), ('probabilistic', 'model'), ('model', 'was'), ('joint', 'model'), ('distortion', 'model'), ('model', 'can'), ('this', 'model'), ('generative', 'model'), ('of', 'model'), ('acoustic', 'model'), ('twin-candidate', 'model'), ('and', 'model'), ('transliteration', 'model'), ('model', 'with'), ('model', 'trained'), ('IBM', 'model'), ('model', 'has'), ('regression', 'model'), ('graphical', 'model'), ('single-candidate', 'model'), ('model', '('), ('in', 'model'), ('model', "'s"), ('entropy', 'model'), ('model', 'and')]


# Dependency Parsing

Lets do a dependency parse of the word 'model' in these sentences

In [13]:
nlp = spacy.load("en_core_web_sm")
 

Let's look at the spacy attributes for one sentence from the dataset.

In [14]:
doc = nlp("In our experiment, the model trained on a largest data i.e., English model, has the lowest WER, whereas, the model trained on the smallest data i.e., Amharic, has the highest WER.")
print("{:25} {:10} {:10} {:20} {:10} {:10} {:50} {:50}".format('Token','Pos-Tag','Dep','Head','Pos-Head','NER', 'Lefts', 'Rights'))
print("_" * 150)
for token in doc:
    print("{:25} {:10} {:10} {:20} {:10} {:10} {:50} {:50}".format(token.text, token.pos_ , token.dep_, token.head.text, token.head.pos_, token.ent_type_,  "-".join([w.text for w in token.lefts]), "-".join([w.text for w in token.rights])))

Token                     Pos-Tag    Dep        Head                 Pos-Head   NER        Lefts                                              Rights                                            
______________________________________________________________________________________________________________________________________________________
In                        ADP        prep       has                  VERB                                                                     experiment                                        
our                       PRON       poss       experiment           NOUN                                                                                                                       
experiment                NOUN       pobj       In                   ADP                   our                                                                                                  
,                         PUNCT      punct      has                  VERB    

We want to extract the dependency info for model where we have model in the sentence

In [15]:
spacy_docs = []
for idx, row in tqdm(data.iterrows()):
    doc = nlp(row.sentence)
    spacy_docs.append(doc)

868208it [2:55:06, 82.63it/s] 


In [16]:
# # # Open a file and use dump() 
# # with open('model_tokens_parsed.pkl', 'wb') as file: 
      
# #     # A new file will be created 
# #     pickle.dump(spacy_docs, file) 

# # with open('model_tokens_parsed.pkl', 'wb') as file: 
      
# #     # A new file will be created 
# #     pickle.dump(spacy_docs, file) 



# ##########
# # with open('model_tokens_parsed.pkl', 'rb') amodel_tokens[0].leftss file: 
      
# #     spacy_docs = pickle.load(file) 


# spacy_docs = pd.read_pickle(r'model_tokens_parsed.pkl')

Now we want to extract some of the token attributes into a dataframe of potentially useful information about each usage of the target word 'model'

In [17]:
# lets get out the token info for just the model tokens
model_tokens = []
for spacy_doc in spacy_docs:
    model_toks = [tok for tok in spacy_doc if tok.text == "model"]
    token = model_toks[0] # extract the first token of 'model' in the sentence
    token_attributes = {
        "token_text": token.text,
        "token_pos": token.pos_ ,
        "token_dep": token.dep_,
        "token_head_text": token.head.text,
        "token_head_pos": token.head.pos_,
        "token_ent_type": token.ent_type_, 
        "token_lefts": "-".join([w.text for w in token.lefts]),
        "token_rights": "-".join([w.text for w in token.rights])
    }
    
    model_tokens.append(token_attributes) 

Confirm we got spaCy token data looking how we want in a nice cute dictionary

In [18]:
model_tokens[0]

{'token_text': 'model',
 'token_pos': 'NOUN',
 'token_dep': 'pobj',
 'token_head_text': 'on',
 'token_head_pos': 'ADP',
 'token_ent_type': '',
 'token_lefts': 'the-vector-space',
 'token_rights': ''}

In [19]:
# convert to a dataframe

tokens_df = pd.DataFrame.from_records(model_tokens)
tokens_df.iloc[0]

token_text                    model
token_pos                      NOUN
token_dep                      pobj
token_head_text                  on
token_head_pos                  ADP
token_ent_type                     
token_lefts        the-vector-space
token_rights                       
Name: 0, dtype: object

In [22]:
tokens_df.to_csv('model_spacy_tokens.csv')

In [20]:
# join the datadrame to our data


tokens_data = pd.concat([data, tokens_df], axis=1)
tokens_data.head(1)

Unnamed: 0.1,Unnamed: 0,corpus_id,sentence,start_idx,end_idx,acl_id,abstract,full_text,pdf_hash,numcitedby,...,isbn,decade,token_text,token_pos,token_dep,token_head_text,token_head_pos,token_ent_type,token_lefts,token_rights
0,0,18022704,"Since the similarity measure based on the vector space model is a rough estimation, minor errors made at the stage of context vector extraction are acceptable.",9,10,O02-2002,"There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required an...","There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required an...",0b09178ac8d17a92f16140365363d8df88c757d0,14,...,,"(2000, 2005]",model,NOUN,pobj,on,ADP,,the-vector-space,


What are the different types of dependency relations 'model' shows up in?

In [23]:
tokens_data.token_dep.unique()

array(['pobj', 'nsubjpass', 'nsubj', 'dobj', 'attr', 'conj', 'ROOT',
       'compound', 'appos', 'npadvmod', 'xcomp', 'poss', 'nmod', 'advcl',
       'oprd', 'ccomp', 'acl', 'amod', 'punct', 'dative', 'relcl', 'dep',
       'acomp', 'meta', 'pcomp', 'parataxis', 'advmod', 'csubj',
       'csubjpass', 'agent', 'intj'], dtype=object)

### Analyzing the 'Subjectivity' of 'Model'

Now that we finally have data in the right shape, we are set up to ask, for instance, the following:

For a given decade, what is the proportion of usages of 'model' used as a noun or a verb?




In [24]:
# Step 1: Group by year and property, then count occurrences
count_df = tokens_data.groupby(['decade', 'token_pos']).size().reset_index(name='count')

# # Step 2: Calculate total counts for each year
total_counts = tokens_data.groupby('decade').size().reset_index(name='total')
total_counts

# # Step 3: Merge the counts with the total counts
merged_df = pd.merge(count_df, total_counts, on='decade')

# # Step 4: Calculate proportions
merged_df['proportion'] = merged_df['count'] / merged_df['total']
merged_df['proportion'] = merged_df['proportion'].round(2)

# merged_df

# # Display the final result
#print(merged_df[['decade', 'token_dep', 'proportion']])

pivot = pd.pivot_table(merged_df, values='proportion', index=['decade'],
                       columns=['token_pos'], aggfunc="sum")
pivot

token_pos,ADJ,ADV,NOUN,PROPN,VERB
decade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(1950, 1960]",0.02,0.0,0.97,0.0,0.02
"(1960, 1970]",0.0,0.0,0.97,0.0,0.03
"(1970, 1980]",0.0,0.0,0.96,0.0,0.03
"(1980, 1990]",0.01,0.0,0.95,0.0,0.04
"(1990, 2000]",0.0,0.0,0.96,0.0,0.04
"(2000, 2005]",0.0,0.0,0.97,0.0,0.03
"(2005, 2010]",0.0,0.0,0.96,0.0,0.03
"(2010, 2012]",0.0,0.0,0.97,0.0,0.03
"(2012, 2014]",0.0,0.0,0.96,0.0,0.04
"(2014, 2016]",0.0,0.0,0.96,0.0,0.04


for a given decade, what is the proportion of usages of 'model' that appear as a subject versus an object, or some other syntactic position?

In [25]:
# Step 1: Group by year and property, then count occurrences
count_df = tokens_data.groupby(['decade', 'token_dep']).size().reset_index(name='count')

# # Step 2: Calculate total counts for each year
total_counts = tokens_data.groupby('decade').size().reset_index(name='total')
total_counts

# # Step 3: Merge the counts with the total counts
merged_df = pd.merge(count_df, total_counts, on='decade')

# # Step 4: Calculate proportions
merged_df['proportion'] = merged_df['count'] / merged_df['total']
merged_df['proportion'] = merged_df['proportion'].round(2)

# merged_df

# # Display the final result
#print(merged_df[['decade', 'token_dep', 'proportion']])

pivot = pd.pivot_table(merged_df, values='proportion', index=['decade'],
                       columns=['token_dep'], aggfunc="sum")
pivot

token_dep,ROOT,acl,acomp,advcl,advmod,agent,amod,appos,attr,ccomp,...,nsubj,nsubjpass,oprd,parataxis,pcomp,pobj,poss,punct,relcl,xcomp
decade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"(1950, 1960]",0.0,0.02,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,...,0.0,0.05,0.0,0.0,0.0,0.17,0.0,0.0,0.0,0.0
"(1960, 1970]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.03,0.02,...,0.22,0.06,0.0,0.0,0.0,0.44,0.0,0.0,0.01,0.0
"(1970, 1980]",0.04,0.01,0.0,0.0,0.0,0.0,0.01,0.05,0.04,0.0,...,0.14,0.07,0.0,0.0,0.0,0.34,0.01,0.01,0.0,0.02
"(1980, 1990]",0.01,0.0,0.0,0.01,0.0,0.0,0.02,0.01,0.02,0.0,...,0.16,0.08,0.0,0.0,0.0,0.38,0.0,0.0,0.0,0.02
"(1990, 2000]",0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.02,0.0,...,0.18,0.07,0.0,0.0,0.0,0.36,0.01,0.0,0.0,0.02
"(2000, 2005]",0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,...,0.19,0.07,0.0,0.0,0.0,0.33,0.01,0.0,0.0,0.01
"(2005, 2010]",0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.02,0.0,...,0.19,0.07,0.0,0.0,0.0,0.32,0.01,0.0,0.0,0.01
"(2010, 2012]",0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,...,0.18,0.06,0.0,0.0,0.0,0.31,0.01,0.0,0.0,0.01
"(2012, 2014]",0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.02,0.0,...,0.19,0.06,0.0,0.0,0.0,0.3,0.01,0.0,0.0,0.01
"(2014, 2016]",0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.02,0.0,...,0.21,0.06,0.0,0.0,0.0,0.28,0.01,0.0,0.0,0.01


In [None]:
This analysis looks at all uses of model, including verbal uses. What happens when we focus only on nominal uses?

In [None]:
# Step 1: Group by year and property, then count occurrences
noun_data = tokens_data[tokens_data.token_pos != "VERB"]
count_df = noun_data.groupby(['decade', 'token_dep']).size().reset_index(name='count')

# # Step 2: Calculate total counts for each year
total_counts = noun_data.groupby('decade').size().reset_index(name='total')
total_counts

# # Step 3: Merge the counts with the total counts
merged_df = pd.merge(count_df, total_counts, on='decade')

# # Step 4: Calculate proportions
merged_df['proportion'] = merged_df['count'] / merged_df['total']
merged_df['proportion'] = merged_df['proportion'].round(2)

# merged_df

# # Display the final result
#print(merged_df[['decade', 'token_dep', 'proportion']])

pivot = pd.pivot_table(merged_df, values='proportion', index=['decade'],
                       columns=['token_dep'], aggfunc="sum")
pivot