codelab link: https://colab.research.google.com/github/abzzall/unsupervised_annotator1/blob/main/main.ipynb
github link: https://github.com/abzzall/unsupervised_annotator1.git

1. Extract multiword expression candidates. 
	1. Using the part-of-speech tags we extract multiword expression candidates, consisting of sequences of zero or more adjectives (ADJ followed by nouns (NOUN) or proper nouns (PROPNs) sequences.
	2. To generate training data for sequence tagging use sentence encoder like 
		a. EmbedRank (Bennani- Smires et al., 2018
		b. Key2Vec (Mahata et al., 2018)
    3. We implemented our Unsupervised Annotator using the POS tagger of SpaCy (Honnibal et al., 2020).

In [1]:
import re
!! pip install spacy 



In [2]:
!! python -m spacy download en_core_web_sm
!!pip install sentence-transformers




In [3]:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex
# Load the language model
nlp = spacy.load('en_core_web_sm')

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
from sentence_transformers import SentenceTransformer, util
from typing import List

In [4]:

# Define your text
with open('abstracts.txt') as file:
    text=file.read()


In [5]:
# text = "I love New York and San Francisco. Los Angeles is another great city."


In [6]:
from collections import namedtuple
CandidateMWE = namedtuple('CandidateMWE',['text','head', 'sentence','self_encode', 'sent_encode'])
CandidateW=namedtuple('CandidateW',['text','lemma', 'self_encode' ])
Term=namedtuple('Term',['text','detected' ])

In [7]:
# Process the text with spaCy
doc = nlp(text)

# Extract MWEs (noun phrases) from the text
mwe_list = []
single_noun_list=[]
candidate_list=[]
for sent in doc.sents:
    sent_encode=model.encode(sent.text)
    for chunk in doc.noun_chunks:
        # mwe_list.append(chunk.text)
        
        if len(chunk.text.split()) == 1:
            
            single_noun_list.append(CandidateW(chunk.text, chunk.lemma_, model.encode(chunk.text)))
        if len(chunk.text.split()) > 1:
            noun_appeared=False
            is_candidate=True
            cleared_candidate=''
            word_count=0
            for word in chunk:
                #IGNORING
                if word.pos_ in ['PUNCT', 'DET']:
                    continue
                elif word.pos_ not in ['ADJ', 'PROPN', 'NOUN']:
                    is_candidate=False
                    print(f'{chunk.text} is not candidate 1 {word.text} -- {word.pos_}')
                    break
                elif word.pos_ in ['PROPN', 'NOUN']:
                    noun_appeared=True
                elif not(not noun_appeared and word.pos_=='ADJ'):
                    is_candidate=False
                    print(f'{chunk.text} is not candidate 2 {word.text} -- {word.pos_}, {noun_appeared}')
                    break
                cleared_candidate+=' '+word.text
            if is_candidate and word_count>1:
                candidate_list.append(CandidateMWE(cleared_candidate, chunk.root.text, sent.text, model.encode(cleared_candidate), sent_encode))
            else:
                single_noun_list +=[CandidateW(word.text, word.lemma_, model.encode(word.text)) for word in chunk if word.pos_ in ['NOUN', 'PROPN']]
                    
mwe_list=candidate_list+single_noun_list
# print(mwe_list)


Supervised learning models is not candidate 1 Supervised -- VERB
i.e. tasks is not candidate 1 i.e. -- X
little labeled data is not candidate 1 labeled -- VERB
One prominent approach is not candidate 1 One -- NUM
unsupervised pre-trained neural models is not candidate 1 trained -- VERB
these two approaches is not candidate 1 two -- NUM
two novel methods is not candidate 1 two -- NUM
a more complex context-aware method is not candidate 1 more -- ADV
two English text classification datasets is not candidate 1 two -- NUM
our investigation is not candidate 1 our -- PRON
train-set sizes is not candidate 1 set -- VERB
only a few dozen training instances is not candidate 1 only -- ADV
more powerful dialogue modeling capabilities is not candidate 1 more -- ADV
a pre-trained model is not candidate 1 trained -- VERB
three types is not candidate 1 three -- NUM
two dialogue summarization datasets is not candidate 1 two -- NUM
-trained and non pre-trained models is not candidate 1 trained -- VERB
o

In [8]:
def dist(wi_encode, wj_encode)->float:
    return util.pytorch_cos_sim(
        wi_encode,
        wj_encode
    )


def calculate_topic_score(expression_embedding, sentence_embedding)->float:
    """
    Calculate the topic score between a multiword expression and a sentence.

    Args:
        multiword_expression (str): The multiword expression.
        sentence (str): The sentence containing the expression.

    Returns:
        float: The topic score (cosine similarity) between the two embeddings.
    """
    # Load the distilbert-base-nli-mean-tokens model

    # Encode the multiword expression and sentence into embeddings
    # expression_embedding = model.encode(multiword_expression, convert_to_tensor=True)
    # sentence_embedding = model.encode(sentence, convert_to_tensor=True)

    # Calculate cosine similarity between the two embeddings
    similarity_score = util.pytorch_cos_sim(expression_embedding, sentence_embedding)

    # Extract the cosine similarity value from the tensor
    topic_score = similarity_score[0].item()

    return topic_score

def calculate_specificity_score(mw:CandidateMWE, w:List[CandidateW|CandidateMWE])->float:
    """
    Calculate the specificity score (SP) between a multiword expression (mw) and a list of words/multiword expressions (w).

    Args:
        mw (str): The multiword expression.
        w (list of str): The list of words/multiword expressions in the context.

    Returns:
        float: The specificity score (SP).
    """
    # Load the distilbert-base-nli-mean-tokens model
    # Calculate distances between mw and each word/phrase in w
    distances = [dist(mw.self_encode, wi.self_encode) for wi in w if wi != mw]

    # Calculate the mean of the distances
    specificity_score = sum(distances) / len(w)

    return specificity_score


In [9]:
TSP = 0.05
Ttopic = 0.1

In [15]:
candidate_list

[]

In [10]:
term_mws=[]

In [11]:
for candidate in candidate_list:
    topic_score=calculate_topic_score(candidate.self_encode, candidate.sent_encode)
    sp_score=calculate_specificity_score(candidate, mwe_list)
    if topic_score>Ttopic and sp_score>TSP:
        term_mws.append(Term(candidate.text, 'by_score'))
        print(candidate.text, topic_score, sp_score)
    

3. Upgrade single nouns according to morphological features.
	1. At this stage, we could have nouns that are not part of any multiword expressions, but still relevant.
	2. Check if the lemma of the noun is the same as any of the heads of the multiword expressions.
		a. Yes: we upgrade the noun to term 
		b. No: segment the word using a subword-unit segmentation and a vocabulary trained over a large general purpose corpus.
we use the vocabulary of the BERT-base model from HuggingFace (Wolf et al., 2020) and the corresponding tokenizer.

In [12]:
from transformers import BertTokenizer

# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [13]:
term_nouns=[]
subtoken_threshold=4
for candidate in single_noun_list:
    #Check if the lemma of the noun is the same as any of the heads of the multiword expressions.
    is_term=False
    lemma_is_head=False
    for term_mw in term_mws:
        if term_mw.head==candidate.lemma:
            is_term=True
            break
    if is_term:
        term_nouns.append(Term(candidate.text, 'by_lemma'))
        continue
    #segment the word using a subword-unit segmentation and a vocabulary trained over a large general purpose corpus.
    subtokens=tokenizer.tokenize(candidate.text)
    if len(subtokens)>subtoken_threshold:
        term_nouns.append(Term(candidate.text, 'by_subtokens'))

In [14]:
terms=term_mws+term_nouns
with open('out.txt', 'w') as out_file:
    for term in terms:
        out_file.write(f'"{term.text}" appended {term.detected}\n')
        print(f'"{term.text}" appended {term.detected}\n')
        

"$224\times320" appended by_subtokens

"Disambiguation" appended by_subtokens

"disambiguation" appended by_subtokens

"Unsupervised" appended by_subtokens

"Unsupervised" appended by_subtokens

"Unsupervised" appended by_subtokens

"http://mech.ctb.pku.edu.cn/MetaTISA/." appended by_subtokens

"$224\times320" appended by_subtokens

"Disambiguation" appended by_subtokens

"disambiguation" appended by_subtokens

"Unsupervised" appended by_subtokens

"Unsupervised" appended by_subtokens

"Unsupervised" appended by_subtokens

"http://mech.ctb.pku.edu.cn/MetaTISA/." appended by_subtokens

"$224\times320" appended by_subtokens

"Disambiguation" appended by_subtokens

"disambiguation" appended by_subtokens

"Unsupervised" appended by_subtokens

"Unsupervised" appended by_subtokens

"Unsupervised" appended by_subtokens

"http://mech.ctb.pku.edu.cn/MetaTISA/." appended by_subtokens

"$224\times320" appended by_subtokens

"Disambiguation" appended by_subtokens

"disambiguation" appended by_subt