# Sentence Finding

Our task is to find a suitable sentence to illustrate how our method does better than our five baselines. This will support our summary statistics. While it will obviously do better in some sentences and worse in others, we wish to find a representative sentence showing how it does better, and potentially highlighting any issues with our metric to the reader.

In [4]:
from os import listdir
from os.path import isfile, join
base_path = "opus_base/best_results"
big_path = "opus_big/best_results"
base_files = [join(base_path, f) for f in listdir(base_path) if isfile(join(base_path, f))] 
big_files = [join(big_path, f) for f in listdir(big_path) if isfile(join(big_path, f))]

In [28]:
base_files = [base_files[0], base_files[2], base_files[3], base_files[5], base_files[1], base_files[4]]

In [31]:
big_files = [big_files[0], big_files[2], big_files[3], big_files[4], big_files[1], big_files[5]]

In [29]:
base_files

['opus_base/best_results\\opus-mt-en-fr_onto_output.txt',
 'opus_base/best_results\\opus_wmt_finetuned_enfr_hpc_onto_output.txt',
 'opus_base/best_results\\opus_wmt_finetuned_enfr_wang_2022_onto_output.txt',
 'opus_base/best_results\\opus_wmt_finetuned_enfr_wu_2022_onto_output.txt',
 'opus_base/best_results\\opus_wmt_finetuned_enfr_choi_2022_onto_output.txt',
 'opus_base/best_results\\opus_wmt_finetuned_enfr_wce_onto_output.txt']

In [32]:
big_files

['opus_big/best_results\\opus-mt-tc-big-en-fr_onto_output.txt',
 'opus_big/best_results\\opus_big_enfr_FT_onto_output.txt',
 'opus_big/best_results\\opus_big_enfr_FT_wang_2022_onto_output.txt',
 'opus_big/best_results\\opus_big_enfr_FT_wu_2022_onto_output.txt',
 'opus_big/best_results\\opus_big_enfr_FT_choi_2022_onto_output.txt',
 'opus_big/best_results\\opus_big_fine_freq_wce_unsampled_onto_output.txt']

In [33]:
#Find sentence with highest number of true positives
import pandas as pd
base_dataframes = []
big_dataframes = []
for filename in base_files:
    name = filename.replace("opus_base/best_results\\", "").replace("_onto_output.txt", "")
    df = pd.read_csv(filename, sep = "\t")
    df = df[["sent_ID", "true_positives"]]
    (df_grouped_tp, tag) = (df.groupby(["sent_ID"], group_keys=False).sum(), name)
    base_dataframes.append((df_grouped_tp, tag))

In [34]:
base_dataframes[0]

(         true_positives
 sent_ID                
 0                     1
 1                     0
 2                     1
 3                     1
 4                     2
 ...                 ...
 582                   3
 584                   3
 585                   1
 586                   3
 587                   3
 
 [559 rows x 1 columns],
 'opus-mt-en-fr')

In [35]:
first_df = base_dataframes[0][0].rename(columns={"sent_ID": "sent_ID", "true_positives": "tp_" + base_dataframes[0][1]})
first_df
for tup in base_dataframes[1:]:
    first_df["tp_" + tup[1]] = tup[0]["true_positives"].astype({"true_positives" : int})

In [37]:
first_df.to_csv("opus_base_collated_tp.txt", sep = "\t")
#Sentence ID 324 seems promising.

In [38]:
for filename in big_files:
    name = filename.replace("opus_big/best_results\\", "").replace("_onto_output.txt", "")
    df = pd.read_csv(filename, sep = "\t")
    df = df[["sent_ID", "true_positives"]]
    (df_grouped_tp, tag) = (df.groupby(["sent_ID"], group_keys=False).sum(), name)
    big_dataframes.append((df_grouped_tp, tag))

In [39]:
first_df = big_dataframes[0][0].rename(columns={"sent_ID": "sent_ID", "true_positives": "tp_" + big_dataframes[0][1]})
first_df
for tup in big_dataframes[1:]:
    first_df["tp_" + tup[1]] = tup[0]["true_positives"].astype({"true_positives" : int})

In [40]:
first_df.to_csv("opus_big_collated_tp.txt", sep = "\t")
#TO CHECK LATER

Now that we've gathered our true positive data (so we can find the best sentences to display), let's set up our annotator.

In [41]:
import urllib.request, urllib.error, urllib.parse
import json
REST_URL = "https://services.bioportal.lirmm.fr"
API_KEY = "97be3b10-804a-4c98-9407-05caf1629ebb"
ONTO_SELECTED = "&ontologies=CISP-2,SNOMED35VF,CIF,WHO-ARTFRE,STY,ATCFRE,CIM-11,MEDLINEPLUS,MTHMSTFRE,MSHFRE,MDRFRE" #All French UMLS and SNOMED 3.5 ontologies
OPTIONS_1 = "&longest_only=true&exclude_numbers=false&whole_word_only=true&exclude_synonyms=false&expand_mappings=false&fast_context=false&certainty=false" #Match longest only because we want the whole concept, not the parts
OPTIONS_2 = "&temporality=false&experiencer=false&negation=false&lemmatize=false&score_threshold=0&confidence_threshold=0&display_links=false&display_context=false"
#We wanted to use fast_context, but this ran into several problems parsing punctuation. We'll leave this for future work.
PREFERENCE_STRING = ONTO_SELECTED + OPTIONS_1 + OPTIONS_2

def get_json(url):
    opener = urllib.request.build_opener()
    opener.addheaders = [('Authorization', 'apikey token=' + API_KEY)]
    return json.loads(opener.open(url).read())

In [54]:
def annotate_sentence(sentence):
    sentence = sentence.replace("%", " pour cent ")
    annotations = get_json(REST_URL + "/annotator?text=" + urllib.parse.quote(sentence) + PREFERENCE_STRING)
    found = set()
    for result in annotations:
        tag_loc = (result["annotations"][0]["from"], result["annotations"][0]["to"]) 
        if (tag_loc not in found):
            print(result["annotatedClass"]['prefLabel'])
            print(result['annotations'])
            found.add(tag_loc)

Finally, let's gather the identified sentences for annotation.

In [70]:
base_pred_path = "opus_base/best_preds"
big_pred_path = "opus_big/best_preds"
base_pred_files = [join(base_pred_path, f) for f in listdir(base_pred_path) if isfile(join(base_pred_path, f))] 
big_pred_files = [join(big_pred_path, f) for f in listdir(big_pred_path) if isfile(join(big_pred_path, f))]

In [71]:
base_pred_files = [base_pred_files[0], base_pred_files[2], base_pred_files[3], base_pred_files[5], base_pred_files[1], base_pred_files[4]]
big_pred_files = [big_pred_files[0], big_pred_files[2], big_pred_files[3], big_pred_files[4], big_pred_files[1], big_pred_files[5]]

In [72]:
#Order: Source, Ref, Baseline, FT, Wang, Wu, Choi, Best_WCE
base_sentence_files = ["wmt22test.txt", "wmt22gold.txt"] + base_pred_files
big_sentence_files = ["wmt22test.txt", "wmt22gold.txt"] + big_pred_files

In [90]:
def get_sentences(filenames, sentNum):
    for idx, file in enumerate(filenames):
        name = file.replace("opus_base/best_preds\\", "").replace("_pred.txt", "")
        f = open(file, "r", encoding = "utf8")
        chosen = f.readlines()[sentNum]
        print(name + " : " + chosen)
        if(idx != 0):
            annotate_sentence(chosen)
        print("\n")
        f.close()

In [98]:
get_sentences(base_sentence_files, 324) #Final output for opus-base - focus on "time period" vs entre-et ("between") and issues with tomodensitometrie vs tomographie

wmt22test.txt : All patients who underwent TAVI in the time period 2007-2017 with preoperative computed tomography were included.



wmt22gold.txt : Tous les patients ayant subi une IVAC avec tomodensitométrie préopératoire au cours de la période 2007-2017 ont été inclus.

Patients
[{'from': 10, 'to': 17, 'matchType': 'PREF', 'text': 'PATIENTS'}]
Remplacement valvulaire aortique par cathéter
[{'from': 34, 'to': 37, 'matchType': 'SYN', 'text': 'IVAC'}]
Tomodensitométrie
[{'from': 44, 'to': 60, 'matchType': 'PREF', 'text': 'TOMODENSITOMÉTRIE'}]
Cours
[{'from': 79, 'to': 83, 'matchType': 'PREF', 'text': 'COURS'}]
Période
[{'from': 91, 'to': 97, 'matchType': 'PREF', 'text': 'PÉRIODE'}]
inclus
[{'from': 117, 'to': 122, 'matchType': 'PREF', 'text': 'INCLUS'}]


opus-mt-en-fr : Tous les patients ayant subi un TAVI au cours de la période 2007-2017 avec tomographie préopératoire ont été inclus.

Patients
[{'from': 10, 'to': 17, 'matchType': 'PREF', 'text': 'PATIENTS'}]
TAVI
[{'from': 33, 'to': 

In [102]:
get_sentences(big_sentence_files, 284) #Final output for opus-big - focus on mometasone furoate (sys error) and corticosurrenale issue. 

wmt22test.txt : The objective of this study was to determine if topical florfenicol/terbinafine/mometasone furoate causes adrenocortical suppression in healthy, small-breed dogs with bilateral OE at D28 postapplication.



wmt22gold.txt : L'objectif de cette étude était de déterminer si le florfénicol/terbinafine/furoate de mométasone topique provoque une suppression corticosurrénale chez des chiens sains de petite race présentant une OE bilatérale à J28 après l'application.

D01AE15 - terbinafine
[{'from': 65, 'to': 75, 'matchType': 'SYN', 'text': 'TERBINAFINE'}]
Furoate de mométasone
[{'from': 77, 'to': 97, 'matchType': 'PREF', 'text': 'FUROATE DE MOMÉTASONE'}]
topique
[{'from': 99, 'to': 105, 'matchType': 'PREF', 'text': 'TOPIQUE'}]
Cortex surrénal
[{'from': 132, 'to': 147, 'matchType': 'SYN', 'text': 'CORTICOSURRÉNALE'}]
Chiens
[{'from': 158, 'to': 163, 'matchType': 'PREF', 'text': 'CHIENS'}]
après
[{'from': 221, 'to': 225, 'matchType': 'PREF', 'text': 'APRÈS'}]
Attention
[{'from':