This notebook explores identifying multiword expressions using the part-of-speech filtering technique of Justeson and Katz (1995), "[Technical terminology: some linguistic properties and an algorithm for identification in text](https://brenocon.com/JustesonKatz1995.pdf)".

In [None]:
import spacy, re
from collections import Counter

In [None]:
nlp = spacy.load('en', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

# workaround for those getting an error loading the sapcy 'en' module:
# nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])

In [None]:
def getTokens(filename):
    
    """ Read the first 1000 lines of an input file """
    tokens=[]
    with open(filename) as file:
        for idx,line in enumerate(file):
            tokens.extend(nlp(line))
            if idx > 1000:
                break
    return tokens

In [None]:
tokens=getTokens("../data/wiki.10K.txt")
print(len(tokens))

In [None]:
words=[x.text for x in tokens]

Let's simplify the POS tags to make the regex easier to understand.

In [None]:
adjectives=set(["JJ", "JJR", "JJS"])
nouns=set(["NN", "NNS", "NNP", "NNPS"])

taglist=[]
for x in tokens:
    if x.tag_ in adjectives:
        taglist.append("ADJ")
    elif x.tag_ in nouns:
        taglist.append("NOUN")
    elif x.tag == "IN":
        taglist.append("PREP")
    else:
        taglist.append("O")
                
tags=' '.join(taglist)        

In [None]:
def getChar2TokenMap(tags):
    
    """  We'll search over the postag sequence, so we need to get the token ID for any
    character to be able to match the word token. """
    
    ws=re.compile(" ")
    char2token={}

    lastStart=0
    for idx, m in enumerate(ws.finditer(tags)):
        char2token[lastStart]=idx
        lastStart=m.start()+1
        
    return char2token

def getToken(tokenId, char2token):
    
    """ Find the token ID for given character in the POS sequence """
    while(tokenId > 0):
        if tokenId in char2token:
            return char2token[tokenId]
        tokenId-=1
    return None

In [None]:
char2token=getChar2TokenMap(tags)

Now let's find all sequences of POS tags that match the Justeson and Katz pattern of `(((ADJ|NOUN) )+|((ADJ|NOUN) )*(NOUN PREP )((ADJ|NOUN) )*)NOUN`

"In words, a candidate term is a multi-word noun phrase; and it either is a string of nouns and/or adjectives, ending in a noun, or it consists of two such strings, separated by a single preposition." (JK 17)

In [None]:
p = re.compile("(((ADJ|NOUN) )+|((ADJ|NOUN) )*(NOUN PREP )((ADJ|NOUN) )*)NOUN")

mweCount=Counter()

for m in p.finditer(tags):
    startToken=getToken(m.start(),char2token)
    endToken=getToken(m.end(),char2token)
    mwe=' '.join(words[startToken:endToken+1])
    mweCount[mwe]+=1

for k,v in mweCount.most_common(100):
    print(k,v)

We'll define our MWE dictionary to be the 1000 most frequent sequences matched.

In [None]:
my_mwe=[k for (k,v) in mweCount.most_common(1000)]

Now let's transform each MWE into a single token (e.g., replace `New York City` with `New_York_City`)

In [None]:
def replaceMWE(text, mweList):
    
    """ Replace all instances of MWEs in text with single token 
    
    MWEs are ranked from longest to shortest so that longest replacements are made first (e.g.,
    "New York City" is matched first before "New York")
    
    """
    
    sorted_by_length = sorted(mweList, key=len, reverse=True)
    for mwe in sorted_by_length:
        text=re.sub(re.escape(mwe), re.sub(" ", "_", mwe), text)
    return text

In [None]:
processedText=replaceMWE("The New York Times is located in New York City", my_mwe)

In [None]:
print(processedText)

Q1. Plug in your own data into `getTokens` above and identify the MWE it contains.  How do you think MWE would perform for your classification task?