This is the ipynb where I explain all the steps I took in order to solve the proposed Alert Term Extraction task by Thomas Moser and Michael Heil. The actual working program that encapsulates everything into classes and writes a json file to disk for testing is included in the same folder, here I put all the reasoning behind the final solution along with the code cells.

In [54]:
import requests
import pandas as pd
import json
from cryptography.fernet import Fernet
from spacy.matcher import PhraseMatcher
import spacy
from unidecode import unidecode

First step is to call the API to get the required data for working.

The access key for the API has been encrypted symmetrically to not leave the key in clear sight into the repository.

the first cell decrypts the key from the provided encryption keypair in order to make the API calls.

In [55]:
enckey = open("key.key", "rb").read()
f = Fernet(enckey)
with open("secretkey.key", "rb") as file:
    # read the encrypted data
    encrypted_data = file.read()
# decrypt data
decrypted_data = f.decrypt(encrypted_data).decode()

This cell populates the parameters for the API calls (in our case only the authkey)

In [56]:
payload = {'key': decrypted_data}

Here the calls are made and the obtained data is parsed from the original JSON into a Pandas DataFrame for easier data analysis and manipulation.

In [57]:
try:
    r = requests.get('https://services.prewave.ai/adminInterface/api/testQueryTerm', params=payload)
    queryterms = json.dumps(r.json(), indent = 4)
    print('Query Terms Downloaded!')

except requests.exceptions.HTTPError as errh:
    print(errh)
except requests.exceptions.ConnectionError as errc:
    print(errc)
except requests.exceptions.Timeout as errt:
    print(errt)
except requests.exceptions.RequestException as err:
    print(err)

Query Terms Downloaded!


In [58]:
try:
    r = requests.get('https://services.prewave.ai/adminInterface/api/testAlerts', params=payload)
    alerts = json.dumps(r.json(), indent = 4)
    print('Alerts Downloaded!')

except requests.exceptions.HTTPError as errh:
    print(errh)
except requests.exceptions.ConnectionError as errc:
    print(errc)
except requests.exceptions.Timeout as errt:
    print(errt)
except requests.exceptions.RequestException as err:
    print(err)

Alerts Downloaded!


In [59]:
dfquery = pd.read_json(queryterms)
display(dfquery)

Unnamed: 0,id,target,text,language,keepOrder
0,101,1,IG Metall,de,True
1,102,1,IG Metall,en,True
2,103,1,Industriegewerkschaft Metall,de,False
3,201,2,Arbeitsplatz,de,True
4,202,2,Arbeitsplätze,de,True
5,203,2,job,en,True
6,204,2,jobs,en,True
7,301,3,pollution,en,True
8,302,3,inquinante,it,True
9,401,4,lithium,en,True


The data above shows the query terms. Apart from having their unique id, the terms are grouped together by target, that means that text with the same target number is semantically to be considered about the same entity.

Data varies apart from the text itself on two dimensions: language and order, that means that compounds (word composed of more than one stem) with the keepOrder flag True should appear close one to the other, while where the flag is false the terms could be shifted in position or have other words inbetween (e.g. minimal involvment could appear as 'the involvment is minimal' and still need to be flagged).

In [60]:
dfalerts = pd.read_json(alerts)
with pd.option_context('display.max_colwidth', 200):
    display(dfalerts)

Unnamed: 0,id,contents,date,inputType
0,ui75uz5z4hreg,"[{'text': 'Primo que honda los giles qué opinan de política pero todavía viven con los padres? Hasta que no vivan solos y paguen sus impuestos cierren los panes y no sequen el huevo', 'type': 'tex...",2022-04-18 19:52:15.776000+00:00,tweet
1,u5rgtgz54bz4rr,"[{'text': 'Tata Motors’ JLR extends COVID-19 production pause in UK https://t.co/a91RuzqsbB https://t.co/hkGjpJhtH7', 'type': 'text', 'language': 'en'}]",2022-04-18 19:48:54.907000+00:00,tweet
2,i76u5zvferee,[{'text': '@ryanlangdon_ I don’t have a monologue but I also can’t visualize. So when I close my eyes all I see is black. It’s called aphantasia. Check out Nicola Tesla. He could create worlds in...,2022-04-18 19:45:25.112000+00:00,tweet
3,tz65hg4g5zht4gr,"[{'text': 'the Rai’s a different breed bro', 'type': 'text', 'language': 'en'}]",2022-04-18 19:28:12.735000+00:00,tweet
4,nuzbtju4654r,"[{'text': 'Tesla self-driving traffic light and stop-sign interaction explained in leaked manual https://t.co/8gqNvkl4bS via @FredericLambert', 'type': 'text', 'language': 'en'}]",2022-04-18 19:35:21.476000+00:00,tweet


The alerts contain the ID of the alerts and texts in a JSON nested field with all the required attibutes for searching and matching terms: the text itself, the language is written in and the type of text. The date and input field can be skipped for this specific task.

To better get the data needed for elaboration I am going to flatten the JSON text field to have all the data already in columns for extraction. Another issue is represented by the fact that we can have more than one text in the nested contents column, so we need to extract the text separately and put them in separate rows while sharing the same ID.

In [96]:
#dfalerts['text'] =
alertsflattened = []
for i,row in dfalerts.iterrows():
    flattenedcontentdf = pd.json_normalize(dfalerts['contents'][i])
    flattenedcontentdf['id'] = row['id']
    rowdf = pd.DataFrame(row).transpose()
    mergeddf = pd.merge(rowdf,flattenedcontentdf, how="outer", on="id")
    mergeddf = mergeddf.drop(columns='contents')
    alertsflattened.append(mergeddf)
dfalertsflattened = pd.concat(alertsflattened).reset_index()
dfalertsflattened = dfalertsflattened[dfalertsflattened['text'].notna()]
dfalertsflattened['language'] = dfalertsflattened['language'].apply(lambda x: unidecode(x))
display(dfalertsflattened)


Unnamed: 0,index,id,date,inputType,text,type,language
0,0,ui75uz5z4hreg,2022-04-18 19:52:15.776000+00:00,tweet,Primo que honda los giles qué opinan de política pero todavía viven con los padres? Hasta que no vivan solos y paguen sus impuestos cierren los panes y no sequen el huevo,text,es
1,0,u5rgtgz54bz4rr,2022-04-18 19:48:54.907000+00:00,tweet,Tata Motors’ JLR extends COVID-19 production pause in UK https://t.co/a91RuzqsbB https://t.co/hkGjpJhtH7,text,en
2,0,i76u5zvferee,2022-04-18 19:45:25.112000+00:00,tweet,"@ryanlangdon_ I don’t have a monologue but I also can’t visualize. So when I close my eyes all I see is black. It’s called aphantasia. Check out Nicola Tesla. He could create worlds in this mind that looked real. Sights, sounds, smells. He could see every working part of his inventions.",text,en
3,0,tz65hg4g5zht4gr,2022-04-18 19:28:12.735000+00:00,tweet,the Rai’s a different breed bro,text,en
4,0,nuzbtju4654r,2022-04-18 19:35:21.476000+00:00,tweet,Tesla self-driving traffic light and stop-sign interaction explained in leaked manual https://t.co/8gqNvkl4bS via @FredericLambert,text,en


In the above code I sequentially:
initialized a list to hold the flattened dataframes, iterated through the rows to extract the JSON inside the contents column, flattened it to a dataframe, transposed it in order to merge it with the remaining row columns, dropped the contents column and then hold the result into the initial list.
At the end of the iteration process the dataframes are concatenated into a single dataframe with all the data needed for elaboration.

In the above code sometimes the language code could be miswritten (I noticed in the data exploration that en is sometimes written as ên), a normalization is needed in order to have a consistent output for comparing and initializing the right language pipeline, that is what the line before displaying the dataframe does.

The working process will be carried out through the library spaCy. I am going to inizialize the standard small trained language models that are available through the library, in order to take advantage of the pretrained tokenizers, the text will be fed into the right tokenizer according to the language attribute  extracted in the dataset. Note that because we have only 4 high-resources languages treated in our query terms I assume this approach is viable, in a bigger multilingual context it is better to recur to a rule-based method if resources are not available.



First of all, I extract the tuples of query terms to be fed to the language model for matching: spaCy has an internal PhraseMatcher object that matches query phrases with the content of a Document (the library structure created to hold texts). I am going to initialize a PhraseMatcher for every language with the related query terms according to their language attribute. The PhraseMatcher is going to hold the query term ID for backtracing the linked terms in the extraction phase, the terms themselves and the keepOrder attribute for creating a reversed entry in the matcher in case the attribute is False (e.g. in case of "minimal involvement" the phrase "involvment minimal" will be added too as a separate entry in the PhraseMatcher with the same label to track different phrase order.)

In [62]:
def zipqueryterms(language:str):
    return list(zip(list(dfquery[(dfquery['language'] == language)].id),list(dfquery[(dfquery['language'] == language)].text),list(dfquery[(dfquery['language'] == language)].keepOrder)))

def createphrasematcher(languagemodel:spacy.Language,querytermslist:list):
    matcherobject = PhraseMatcher(languagemodel.vocab, attr="LOWER")
    for id, term, order in querytermslist:
        matcherobject.add(str(id), [languagemodel.make_doc(term.lower())])
        if order == False:
            matcherobject.add(str(id), [languagemodel.make_doc(" ".join(term.lower().split(" ")[::-1]))])
    return matcherobject

In [63]:
eng = spacy.load('en_core_web_sm')
spa = spacy.load('es_core_news_sm')
deu = spacy.load('de_core_news_sm')
ita = spacy.load('it_core_news_sm')

engquery = zipqueryterms('en')
spaquery = zipqueryterms('es')
deuquery = zipqueryterms('de')
itaquery = zipqueryterms('it')

engmatcher = createphrasematcher(eng,engquery)
spamatcher = createphrasematcher(spa,spaquery)
deumatcher = createphrasematcher(deu,deuquery)
itamatcher = createphrasematcher(ita,itaquery)

Now that the matcher objects are ready I am going to create the main algorithm: the advantage of using a language objects is that the texts from the dataset will be already tokenized, lowercased and stop words removal is just a matter of writing a couple of words.
The matchers are set to match on the lowercased version of the words through the attribute "LOWER" that was set in their initialization.
The text are going to be extracted with their ids and, according to their language attribute, a list of lowercased words where stop words and punctuations are removed will be returned and put into a document for feeding them to the phrasematcher. The Phrasematcher will return the matched text spans along with the query id of the matched phrase and this will be joined with the alert id in order to create a unique tuple of keys. The list of matches in the end is returned as a set to remove possible duplicates.

Note that for every language that not falls in the 4 provided english is going to be assumed for tokenization and matching. I considered out of the task scope transliterating or even translating the sentences, as good neural models that are trained on the specific registry and domain of the target text is needed to reach good production quality. This approach make possible to get a marginally better result if entities are contained in an unknown language text, because if latin chars are used we will anyway obtain a correct match (e.g. the phrase "dp world").

In [64]:
def tokenizetext(text:str, languagemodel:spacy.Language):
    return [t.text.lower() for t in languagemodel(text) if (not t.is_stop and not t.is_punct)]

def matchandreturn(tokenizedtext:list, phrasematcher:spacy.matcher.PhraseMatcher, tokenizedtextid:str, languagemodel:spacy.Language):
    matcheslist = []
    matches = phrasematcher(languagemodel(" ".join(tokenizedtext)), as_spans=True)
    for match in matches:
        matcheslist.append((tokenizedtextid,match.label_,match.text))
    return set(matcheslist)

This function and the subsequent loop contain the business logic of the model, they extract every row and run the abovementioned algorithm returning the final match list, with empty sets excluded and sets with more than one match unnested in order to build easily the final dataset.

In [65]:
def tokenizematchandappend(row:pd.Series, phrasematcher:spacy.matcher.PhraseMatcher, languagemodel:spacy.Language, finaldataset:list):
    tokensentence = tokenizetext(row.text,languagemodel)
    matchesset = matchandreturn(tokensentence,phrasematcher,row.id,languagemodel)
    for matchset in matchesset:
        finaldataset.append(matchset)


In [66]:
matchesdataset = []
for i,row in dfalertsflattened.iterrows():
    if row.language == 'es':
        tokenizematchandappend(row, spamatcher, spa, matchesdataset)
    elif row.language == 'de':
        tokenizematchandappend(row, deumatcher, deu, matchesdataset)
    elif row.language == 'it':
        tokenizematchandappend(row, itamatcher, ita, matchesdataset)
    else:
        tokenizematchandappend(row, engmatcher, eng, matchesdataset)

Last step is to put the data in a dataframe and dump it as a JSON to ensure readability and use an universal format that is easy to manipulate as an output.
The format is {index:{alertid:value, queryid:value, text:value}}.

In [67]:
matchesdf = pd.DataFrame(matchesdataset,columns=['alertid','queryid','text'])
matchesdf.to_json(orient='index',force_ascii=False)

'{"0":{"alertid":"u5rgtgz54bz4rr","queryid":"903","text":"covid-19"},"1":{"alertid":"i76u5zvferee","queryid":"501","text":"close"},"2":{"alertid":"i76u5zvferee","queryid":"601","text":"tesla"},"3":{"alertid":"nuzbtju4654r","queryid":"601","text":"tesla"}}'

I leave this test code cell here: I used it to show if the stop words removal is effectively working and if the keepOrder flag equals False mechanism is working correctly.

In [27]:
teststring = 'the involvement is minimal looking for a big deal.'
print(engquery)
#secmatches = engmatcher(eng(dfalertsflattened.loc[3].text))
secmatches = engmatcher(eng(teststring), as_spans=True)
#for match in secmatches:
print(secmatches)
secmatches2 = engmatcher(eng(" ".join(tokenizetext(teststring,eng))), as_spans=True)
print(secmatches2[0], secmatches2[0].label_)

[(102, 'IG Metall', True), (203, 'job', True), (204, 'jobs', True), (301, 'pollution', True), (401, 'lithium', True), (501, 'close', True), (502, 'closure', True), (503, 'closing', True), (601, 'Tesla', True), (701, 'Yuasa', True), (801, 'minimal involvement', False), (901, 'coronavirus', True), (903, 'covid-19', True), (1101, 'fake', True), (1102, 'faking', True), (1201, 'dp world', True), (1204, 'Dubai Ports World', True)]
[]
involvement minimal 801


This is a text output with the query and alerts datasets and the found match with ids and full text for testing purposes

In [95]:
pd.options.display.max_colwidth = 0
print("QUERY TERMS")
print(dfquery[["id","text","language"]])
print("------------------")
print("ALERTS")
print(dfalertsflattened[["id", "text","language"]])
print("------------------")
print("MATCHES")
print("------------------")
for i,row in matchesdf.iterrows():
    print("ALERT ID")
    print(row.alertid)
    print("ALERT TEXT")
    print(dfalertsflattened[(dfalertsflattened['id']==row.alertid)].text)
    print("QUERY TERM ID")
    print(row.queryid)
    print("QUERY TERM MATCHED")
    print(row.text)
    print("------------------")

QUERY TERMS
      id                          text language
0   101   IG Metall                     de     
1   102   IG Metall                     en     
2   103   Industriegewerkschaft Metall  de     
3   201   Arbeitsplatz                  de     
4   202   Arbeitsplätze                 de     
5   203   job                           en     
6   204   jobs                          en     
7   301   pollution                     en     
8   302   inquinante                    it     
9   401   lithium                       en     
10  501   close                         en     
11  502   closure                       en     
12  503   closing                       en     
13  601   Tesla                         en     
14  602   Tesla                         de     
15  603   Tesla                         it     
16  604   Tesla                         es     
17  701   Yuasa                         en     
18  801   minimal involvement           en     
19  901   coronavirus       