This notebook explores identifying multiword expressions using the part-of-speech filtering technique of Justeson and Katz (1995), "[Technical terminology: some linguistic properties and an algorithm for identification in text](https://brenocon.com/JustesonKatz1995.pdf)".

In [1]:
import spacy, re
from collections import Counter

In [2]:
nlp = spacy.load('en', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

# workaround for those getting an error loading the sapcy 'en' module:
# nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])

('parser', <spacy.pipeline.DependencyParser at 0x1156d6bf8>)

In [3]:
def getTokens(filename):
    
    """ Read the first 1000 lines of an input file """
    tokens=[]
    with open(filename) as file:
        for idx,line in enumerate(file):
            tokens.extend(nlp(line))
            if idx > 1000:
                break
    return tokens

In [4]:
tokens=getTokens("../data/wiki.10K.txt")
print(len(tokens))

465226


In [5]:
words=[x.text for x in tokens]

Let's simplify the POS tags to make the regex easier to understand.

In [6]:
adjectives=set(["JJ", "JJR", "JJS"])
nouns=set(["NN", "NNS", "NNP", "NNPS"])

taglist=[]
for x in tokens:
    if x.tag_ in adjectives:
        taglist.append("ADJ")
    elif x.tag_ in nouns:
        taglist.append("NOUN")
    elif x.tag == "IN":
        taglist.append("PREP")
    else:
        taglist.append("O")
                
tags=' '.join(taglist)        

In [7]:
def getChar2TokenMap(tags):
    
    """  We'll search over the postag sequence, so we need to get the token ID for any
    character to be able to match the word token. """
    
    ws=re.compile(" ")
    char2token={}

    lastStart=0
    for idx, m in enumerate(ws.finditer(tags)):
        char2token[lastStart]=idx
        lastStart=m.start()+1
        
    return char2token

def getToken(tokenId, char2token):
    
    """ Find the token ID for given character in the POS sequence """
    while(tokenId > 0):
        if tokenId in char2token:
            return char2token[tokenId]
        tokenId-=1
    return None

In [8]:
char2token=getChar2TokenMap(tags)

Now let's find all sequences of POS tags that match the Justeson and Katz pattern of `(((ADJ|NOUN) )+|((ADJ|NOUN) )*(NOUN PREP )((ADJ|NOUN) )*)NOUN`

"In words, a candidate term is a multi-word noun phrase; and it either is a string of nouns and/or adjectives, ending in a noun, or it consists of two such strings, separated by a single preposition." (JK 17)

In [9]:
p = re.compile("(((ADJ|NOUN) )+|((ADJ|NOUN) )*(NOUN PREP )((ADJ|NOUN) )*)NOUN")

mweCount=Counter()

for m in p.finditer(tags):
    startToken=getToken(m.start(),char2token)
    endToken=getToken(m.end(),char2token)
    mwe=' '.join(words[startToken:endToken+1])
    mweCount[mwe]+=1

for k,v in mweCount.most_common(100):
    print(k,v)

United States 153
Justice Sung 124
Siu Chui 65
police corruption 65
New York 64
Lai Sam 55
Silver Oak 48
first time 46
Tit Tau 44
World War II 41
Puppet Master 41
Early life 37
United Kingdom 34
Cho Kau 32
Si Fu 31
Los Angeles 30
Kung Chan Yeung 29
New Zealand 28
police officers 27
same year 25
John Redcorn 25
New York City 23
New York Times 23
general election 23
Cairn India 23
Florida College 23
North America 22
police force 22
Personal life 22
European Union 21
same time 20
Ling Lung 20
Summer Olympics 20
Broad Institute 19
South Korea 18
median income 18
Donald Duck 18
Barnett Shale 18
Iskandar Muda 18
following year 17
Western Australia 17
San Francisco 16
Hong Kong 15
Soviet Union 15
Transparency International 15
music video 15
gold medal 15
Sri Lanka 15
Fort Worth Basin 15
Brady Bunch 15
British Columbia 14
sea level 14
same name 14
next day 14
National Register 14
Police corruption 14
Full Moon 14
I. haynei 14
South Africa 13
Second World War 13
20th century 13
Gimmie Dat 13
Mi

We'll define our MWE dictionary to be the 1000 most frequent sequences matched.

In [10]:
my_mwe=[k for (k,v) in mweCount.most_common(1000)]

Now let's transform each MWE into a single token (e.g., replace `New York City` with `New_York_City`)

In [11]:
def replaceMWE(text, mweList):
    
    """ Replace all instances of MWEs in text with single token 
    
    MWEs are ranked from longest to shortest so that longest replacements are made first (e.g.,
    "New York City" is matched first before "New York")
    
    """
    
    sorted_by_length = sorted(mweList, key=len, reverse=True)
    for mwe in sorted_by_length:
        text=re.sub(re.escape(mwe), re.sub(" ", "_", mwe), text)
    return text

In [12]:
processedText=replaceMWE("The New York Times is located in New York City", my_mwe)

In [13]:
print(processedText)

The New_York_Times is located in New_York_City


Q1. Plug in your own data into `getTokens` above and identify the MWE it contains.  How do you think MWE would perform for your classification task?