# Terminology N-Gram Identification

The basic idea here is to find groups of two or three words (bigrams or trigrams) which together represent some technology. As such it would be ideal to treat these groups of terms as one single token in other NLP operations. A few examples might include "inflation expectations" or "unemployment rate." I plan on adding these ngrams to the documents.

I learned about this idea from Hansen, McMahon, and Prat's 2020 paper "Transparency and Deliberation within the FOMC: a Computational Linguistics Approach." These authors in turn cite Justeson and Kat'z 1995 paper "Technical terminology: some linguistic properties and an algorithm for identification in text."

In [262]:
import pandas as pd
import os

from nltk import word_tokenize, pos_tag, bigrams, trigrams

In [244]:
sdf = pd.read_csv('transcripts/speeches.csv')
tdf = pd.read_csv('transcripts/transcripts.csv')

In [249]:
sdf['tokens'] = sdf['text'].apply(lambda x : word_tokenize(str(x)))

In [250]:
tdf['tokens'] = tdf['content'].apply(lambda x : word_tokenize(str(x)))

*Important Observation*: Part of speech tags are sensitive to capitalization. I previously tried to get all bigrams, lowercase them, and then get the parts of speech, but the POS tagger tags "I" as a personal pronoun as expected, but tags "i" as a noun which is captured by the colocations used here. To combat this, I modofied my code to only lowercase trigrams and store them in a dictionary after recognizing that they fit the POS forms common to terminology.

In [246]:
def keep_bigram_by_pos(bigram):
    tokens = word_tokenize(bigram)
    pos_tags = [tag for word, tag in pos_tag(tokens)]
    if pos_tags[0][0:2] == 'JJ' and pos_tags[1][0:2] == 'NN':
        return True
    elif pos_tags[0][0:2] == 'NN'and pos_tags[1][0:2] == 'NN':
        return True
    return False

In [247]:
def get_terminology_bigrams(documents):
    bigram_dict = {}
    ignore_list = set()
    for doc in documents:
        bigram_list = [' '.join([a,b]) for (a,b) in bigrams(doc)]
        for bigram in bigram_list:
            if bigram in ignore_list or not keep_bigram_by_pos(bigram):
                ignore_list.add(bigram)
                continue
            if bigram in bigram_dict:
                bigram_dict[bigram.lower()] += 1
            else:
                bigram_dict[bigram.lower()] = 1
    return bigram_dict

In [237]:
def keep_trigram_by_pos(trigram):
    tokens = word_tokenize(trigram)
    pos_tags = [tag for word, tag in pos_tag(tokens)]
    if pos_tags[0][0:2] == 'JJ' and pos_tags[1][0:2] == 'JJ' and pos_tags[2][0:2] == 'NN':
        return True
    elif pos_tags[0][0:2] == 'JJ' and pos_tags[1][0:2] == 'NN' and pos_tags[2][0:2] == 'NN':
        return True
    elif pos_tags[0][0:2] == 'NN' and pos_tags[1][0:2] == 'JJ' and pos_tags[2][0:2] == 'NN':
        return True
    elif pos_tags[0][0:2] == 'NN' and pos_tags[1][0:2] == 'NN' and pos_tags[2][0:2] == 'NN':
        return True
    elif pos_tags[0][0:2] == 'NN' and pos_tags[1][0:2] == 'IN' and pos_tags[2][0:2] == 'NN':
        return True
    return False

In [254]:
def get_terminology_trigrams(documents):
    trigram_dict = {}
    ignore_list = set()
    for doc in documents:
        trigram_list = [' '.join([a,b,c]) for (a,b,c) in trigrams(doc)]
        for trigram in trigram_list:
            if trigram in ignore_list or not keep_trigram_by_pos(trigram):
                ignore_list.add(trigram)
                continue
            if trigram in trigram_dict:
                trigram_dict[trigram.lower()] += 1
            else:
                trigram_dict[trigram.lower()] = 1
    return trigram_dict

In [251]:
documents = tdf['tokens'].tolist() + sdf['tokens'].tolist()
bgdict = get_terminology_bigrams(documents)

In [255]:
tgdict = get_terminology_trigrams(documents)

In [None]:
relevant_bigrams = {bg: freq for bg,freq in bgdict.items() if freq > 100}
relevant_bigrams

In [None]:
relevant_trigrams = {tg: freq for tg,freq in tgdict.items() if freq > 50}
relevant_trigrams

In [261]:
with open(os.path.join('ngrams', 'bigrams'), 'w') as bg_file:
    for bg in relevant_bigrams:
        bg_file.write(bg + '\n')

In [260]:
with open(os.path.join('ngrams', 'trigrams'), 'w') as tg_file:
    for tf in relevant_trigrams:
        tg_file.write(tf + '\n')