En este cuaderno se encuentra una adaptación del proceso de filtrado para que podáis trabajar con él y ver su funcionamiento más en detalle.

Hay algunos ficheros que tendréis que poner en las carpetas que corresponda, por ejemplo todos los ficheros de frecuencias.


In [1]:
!pip install spacy lingua-language-detector




In [2]:
!tar -xvf freq.tar.xz
!tar -xvf extract.tar.xz

x freq/fre/
x freq/fre/fre-1gram.txt
x freq/spa/spa-1gram.txt
x freq/ger/
x freq/spa/
x freq/ita/ita-2gram.txt
x freq/ita/ita-1gram.txt
x freq/ger/ger-2gram.txt
x freq/ger/ger-1gram.txt
x freq/eng/
x freq/
x freq/eng/eng-2gram.txt
x freq/ita/
x freq/eng/eng-1gram.txt
x freq/fre/fre-2gram.txt
x freq/spa/spa-2gram.txt
x extract/eng/occupational_therapy/
x extract/spa/occupational_therapy/
x extract/fre/occupational_therapy/
x extract/spa/
x extract/spa/occupational_therapy/terms.txt
x extract/fre/occupational_therapy/terms.txt
x extract/ger/occupational_therapy/terms.txt
x extract/eng/
x extract/ger/occupational_therapy/
x extract/ita/
x extract/eng/occupational_therapy/terms.txt
x extract/ita/occupational_therapy/terms.txt
x extract/fre/
x extract/ita/occupational_therapy/
x extract/ger/
x extract/


In [3]:
!spacy download en_core_web_sm
!spacy download de_core_news_sm
!spacy download fr_core_news_sm
!spacy download it_core_news_sm
!spacy download es_core_news_sm

Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting de-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl (14.6 MB)
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
Collecting fr-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.7.0/fr_core_news_sm-3.7.0-py3-none-any.whl (16.3 MB)
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
Collecting it-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/it_core_

In [4]:
import time
import traceback
import uuid
import os
import spacy
import json

import lingua

# Spacy

spacy_models = json.loads('''{
	"bul": {"sm": null, "lg": null},
	"hrv": {"sm": "hr_core_news_sm", "lg": "hr_core_news_lg"},
	"cze": {"sm": null, "lg": null},
	"dan": {"sm": "da_core_news_sm", "lg": "da_core_news_trf"},
	"dut": {"sm": "nl_core_news_sm", "lg": "nl_core_news_lg"},
	"eng": {"sm": "en_core_web_sm", "lg": "en_core_web_trf"},
	"est": {"sm": null, "lg": null},
	"fin": {"sm": "fi_core_news_sm", "lg": "fi_core_news_lg"},
	"fre": {"sm": "fr_core_news_sm", "lg": "fr_dep_news_trf"},
	"ger": {"sm": "de_core_news_sm", "lg": "de_dep_news_trf"},
	"gre": {"sm": "el_core_news_sm", "lg": "el_core_news_lg"},
	"hun": {"sm": null, "lg": null},
	"gle": {"sm": null, "lg": null},
	"ita": {"sm": "it_core_news_sm", "lg": "it_core_news_lg"},
	"lav": {"sm": null, "lg": null},
	"lit": {"sm": "lt_core_news_sm", "lg": "lt_core_news_lg"},
	"mlt": {"sm": null, "lg": null},
	"pol": {"sm": "pl_core_news_sm", "lg": "pl_core_news_lg"},
	"por": {"sm": "pt_core_news_sm", "lg": "pt_core_news_lg"},
	"rum": {"sm": "ro_core_news_sm", "lg": "ro_core_news_lg"},
	"rus": {"sm": "ru_core_news_sm", "lg": "ru_core_news_lg"},
	"slo": {"sm": null, "lg": null},
	"slv": {"sm": null, "lg": null},
	"spa": {"sm": "es_core_news_sm", "lg": "es_dep_news_trf"},
	"swe": {"sm": null, "lg": null}
}''')



freq_list = {"spa": {}, "ger": {}, "fre": {}, "eng": {}, "ita": {}}

freq_list["spa"]["1-gram"] = open("./freq/spa/spa-1gram.txt", encoding="utf8").read().split("\n")
freq_list["ger"]["1-gram"] = open("./freq/ger/ger-1gram.txt", encoding="utf8").read().split("\n")
freq_list["fre"]["1-gram"] = open("./freq/fre/fre-1gram.txt", encoding="utf8").read().split("\n")
freq_list["eng"]["1-gram"] = open("./freq/eng/eng-1gram.txt", encoding="utf8").read().split("\n")
freq_list["ita"]["1-gram"] = open("./freq/ita/ita-1gram.txt", encoding="utf8").read().split("\n")

freq_list["spa"]["2-gram"] = open("./freq/spa/spa-2gram.txt", encoding="utf8").read().split("\n")
freq_list["ger"]["2-gram"] = open("./freq/ger/ger-2gram.txt", encoding="utf8").read().split("\n")
freq_list["fre"]["2-gram"] = open("./freq/fre/fre-2gram.txt", encoding="utf8").read().split("\n")
freq_list["eng"]["2-gram"] = open("./freq/eng/eng-2gram.txt", encoding="utf8").read().split("\n")
freq_list["ita"]["2-gram"] = open("./freq/ita/ita-2gram.txt", encoding="utf8").read().split("\n")

lingua_langs = {
		"eng": lingua.Language.ENGLISH,
		"spa": lingua.Language.SPANISH,
		"ita": lingua.Language.ITALIAN,
		"ger": lingua.Language.GERMAN,
		"fre": lingua.Language.FRENCH
	}

langs_used = [v for k,v in lingua_langs.items()]

lang_detector = lingua.LanguageDetectorBuilder.from_languages(*langs_used).build()

In [70]:
#Gold: Glosario de referencia revisado
gold_data_eng = json.loads(open("gold_occupational_therapy_eng.json").read())
gold_data_fre = json.loads(open("gold_occupational_therapy_fre.json").read())
gold_data_spa = json.loads(open("gold_occupational_therapy_spa.json").read())
gold_data_ger = json.loads(open("gold_occupational_therapy_ger.json").read())

gold_terms_eng = [term['forms']['eng'][0]['text'] for term in gold_data_eng]
gold_terms_fre = [term['forms']['fre'][0]['text'] for term in gold_data_fre]
gold_terms_spa = [term['forms']['spa'][0]['text'] for term in gold_data_spa]
gold_terms_ger = [term['forms']['ger'][0]['text'] for term in gold_data_ger]

def MatrizConfusion(terminos_raw, idioma, paso_actual):
    import numpy as np
    array_raw = np.array(terminos_raw)
    if idioma == "eng":
        array_gold = np.array(gold_terms_eng)
    elif idioma == "fre":
        array_gold = np.array(gold_terms_fre)        
    elif idioma == "spa":    
        array_gold = np.array(gold_terms_spa)        
    elif idioma == "ger":          
        array_gold = np.array(gold_terms_ger)
    else:
        print("ERROR: Idioma no reconocido")
        
    # Inicializa los contadores de TP, FP, TN y FN en 0
    TP = 0 #Están en ambos arrays
    TN = 0 #No se puede calcular?
    FP = 0 #Están en raw pero no en gold
    FN = 0 #No están en raw pero si en gold

    # Comparar los valores en los dos arrays
    for extraido in array_raw:
        #print("extraido: ")
        #print(extraido)
        for valido in array_gold:
            #print(valido)
            if extraido == valido:
                TP += 1

    FP = len(array_raw) - TP

    FN = len(array_gold) - TP

    # Crear la matriz de confusión
    matriz_confusion = [[TP, FP], [TN, FN]]
    
    print("Estamos en el paso: " + paso_actual)
    print("Idioma: " + idioma)
    print("Matriz de Confusión:")
    print("|TP|FP|")
    print("|TN|FN|\n")

    for fila in matriz_confusion:
        print(fila)

    #Precisión: Porcentaje de predicciones positivas correctas
    P=TP/(FP+TP)*100

    #Sensibilidad: Porcentaje de casos positivos detectados
    S=TP/(TP+FN)*100

    print("\nPrecisión (%):")
    print(P)

    print("\nSensibilidad (%):")
    print(S)
    
    print("\n----------------------------------------------------------------------------\n")

In [68]:
def filter_terms(lines, lang):

	terms = {}

	filter_deep_1g = 50000
	filter_deep_2g = 1000000

	dict_1g = {}
	dict_2g = {}
    

	if (lang in freq_list) and ("1-gram" in freq_list[lang]):

		lower_list = [t.lower() for t in freq_list[lang]["1-gram"][:filter_deep_1g]] #No entiendo que hace filter...

		dict_1g = dict(zip(lower_list, range(len(lower_list))))
        
	if (lang in freq_list) and ("2-gram" in freq_list[lang]):

		lower_list = [t.lower() for t in freq_list[lang]["2-gram"][:filter_deep_2g]]

		dict_2g = dict(zip(lower_list, range(len(lower_list))))
        
	for term in lines:

		freq, term = term.replace("\n", "").split("\t") #numero es la freq y la palabra el term (la separa y limpia)

		term = term.replace("-", " ").replace("  ", " ") #Elimina los guiones por espacios

		if (lang in freq_list) and ("1-gram" in freq_list[lang]) and (term.lower() in dict_1g):

			#print("Excluding", term, "(too freq 1-gram)")
			pass

		elif (lang in freq_list) and ("2-gram" in freq_list[lang]) and (term.lower() in dict_2g):

			#print("Excluding", term, "(too freq 2-gram)")
			pass
            #Filtrado en función frecuencias (paso)


		elif any(len(word) < 4 for word in term.split(" ")):

			#print("Excluding", term, "(too short)")
			pass

		elif not term.replace(" ", "").replace("'", "").replace("-","").isalpha() or term.replace(" ", "").startswith("-") or term.replace(" ", "").endswith("-"):

			#print("Excluding", term, "(strange symbols)")
			pass
            #Ultimo paso terminos cortos y que no es texto sino que tiene caracteres extraños (paso arriba)

		else:

			#print("Adding", term)
			terms[term] = {"f": freq}

	#PASO 1-------------------------------------------------------------
	#Calculamos matriz de confusión (Abraham)
	terms_key = list(terms.keys())

	MatrizConfusion(terms_key, lang, "Paso de filtrado 1")
	#-------------------------------------------------------------------            

# Las diferencias de capitalizacion se resuelven optando por la version mas habitual

	for term, obj in terms.copy().items():

		if term.lower() != term and term.lower() in terms:

			if int(terms[term.lower()]["f"]) > int(terms[term]["f"]):

				terms.pop(term)

				#print("Excluding", term, "(duplicated and less frequent capitalization)")

			else:

				terms.pop(term.lower())

				#print("Excluding", term.lower(), "(duplicated and less frequent capitalization)")
                #Términos que tienen diferentes lugares de uso y primera en mayus o minus (otro paso)

	#PASO 2-------------------------------------------------------------
	#Calculamos matriz de confusión (Abraham)
	terms_key = list(terms.keys())

	MatrizConfusion(terms_key, lang, "Paso de filtrado 2")
	#-------------------------------------------------------------------    

	valid_NE = ["EVENT", "FAC", "ORG", "WORK_OF_ART"]

	pipe = spacy.load(spacy_models[lang]["sm"])

	for term, obj in terms.copy().items():

		doc = pipe(term)

		for token in doc.ents:

			#print("Found NE: ", token.text, token.label_)

			if not (token.label_ in valid_NE) and term in terms:

				terms.pop(term)
                #Filtrando si son parte de entidades nombradas (paso)
	#PASO 3-------------------------------------------------------------
	#Calculamos matriz de confusión (Abraham)
	terms_key = list(terms.keys())

	MatrizConfusion(terms_key, lang, "Paso de filtrado 3")
	#-------------------------------------------------------------------    

	for term, obj in terms.copy().items():

		detected = lang_detector.detect_language_of(term)

		#print(term, detected)

		if lang in lingua_langs and detected != lingua_langs[lang]:

			terms.pop(term)

#PASO 4-------------------------------------------------------------
	#Calculamos matriz de confusión (Abraham)
	terms_key = list(terms.keys())

	MatrizConfusion(terms_key, lang, "Paso de filtrado 4")
	#-------------------------------------------------------------------    

	return terms

In [77]:
def lemmatize_terms(terms, lang):

	lemmatized_terms = {}

	pipe = spacy.load(spacy_models[lang]["sm"])

	term_list_old = list(terms.keys())

	for term in term_list_old:

		doc = pipe(term)

		full_token = []

		for token in doc:

			full_token.append(token.lemma_)

		lemma = " ".join(full_token)

		if term in terms:

			old_f = terms[term]

			if lemma in lemmatized_terms:

				current_f = lemmatized_terms[lemma]

				new_f = current_f["f"] + old_f["f"] # Varias palabras convergen en una raíz

				lemmatized_terms[lemma] = {"f": new_f}

			else:

				lemmatized_terms[lemma] = {"f": old_f["f"]}

	# Se reaplica filtrado a las palabras luego de filtrarlas, esto estaría
  # mejor hacerlo de otra forma, hay código repetido

	filter_deep_1g = 25000
	filter_deep_2g = 1000000

	dict_1g = {}
	dict_2g = {}

	if (lang in freq_list) and ("1-gram" in freq_list[lang]):

		lower_list = [t.lower() for t in freq_list[lang]["1-gram"][:filter_deep_1g]]

		dict_1g = dict(zip(lower_list, range(len(lower_list))))

	if (lang in freq_list) and ("2-gram" in freq_list[lang]):

		lower_list = [t.lower() for t in freq_list[lang]["2-gram"][:filter_deep_2g]]

		dict_2g = dict(zip(lower_list, range(len(lower_list))))


	for term in lemmatized_terms.copy().keys():

    # Solo se extá empleando en "eng" porque la lematización en otras cambia
    # también otras flexiones y es algo a evitar.

		if lang == "eng":

			if (term.lower() in dict_1g):

				#print("Excluding", term, "(too freq 1-gram) lemma")

				lemmatized_terms.pop(term)

			elif (term.lower() in dict_2g):

				#print("Excluding", term, "(too freq 2-gram) lemma")

				lemmatized_terms.pop(term)

#PASO 5-------------------------------------------------------------
	#Calculamos matriz de confusión (Abraham)
	terms_key = list(lemmatized_terms.keys())

	MatrizConfusion(terms_key, lang, "Paso de filtrado 5 (POST-LEMATIZACIÓN)")
#-------------------------------------------------------------------    

	return lemmatized_terms



In [78]:
#Inglés
example = open("./extract/eng/occupational_therapy/terms.txt", encoding="utf8").readlines()

terms = filter_terms(example, "eng")

#print(terms)

terms = lemmatize_terms(terms, "eng")

#print(terms)

Estamos en el paso: Paso de filtrado 1
Idioma: eng
Matriz de Confusión:
|TP|FP|
|TN|FN|

[401, 1295]
[0, 348]

Precisión (%):
23.6438679245283

Sensibilidad (%):
53.538050734312414

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 2
Idioma: eng
Matriz de Confusión:
|TP|FP|
|TN|FN|

[393, 1241]
[0, 356]

Precisión (%):
24.05140758873929

Sensibilidad (%):
52.46995994659546

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 3
Idioma: eng
Matriz de Confusión:
|TP|FP|
|TN|FN|

[383, 1122]
[0, 366]

Precisión (%):
25.448504983388702

Sensibilidad (%):
51.134846461949266

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 4
Idioma: eng
Matriz de Confusión:
|TP|FP|
|TN|FN|

[344, 919]
[0, 405]

Precisión (%):
27.23673792557403

Sensibilidad (%):
45.92790387182911

-------------------------------------

In [79]:
#Francés
example = open("./extract/fre/occupational_therapy/terms.txt", encoding="utf8").readlines()

terms = filter_terms(example, "fre")

#print(terms)

terms = lemmatize_terms(terms, "fre")

#print(terms)

Estamos en el paso: Paso de filtrado 1
Idioma: fre
Matriz de Confusión:
|TP|FP|
|TN|FN|

[92, 1577]
[0, 51]

Precisión (%):
5.512282804074296

Sensibilidad (%):
64.33566433566433

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 2
Idioma: fre
Matriz de Confusión:
|TP|FP|
|TN|FN|

[91, 1529]
[0, 52]

Precisión (%):
5.617283950617284

Sensibilidad (%):
63.63636363636363

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 3
Idioma: fre
Matriz de Confusión:
|TP|FP|
|TN|FN|

[91, 1164]
[0, 52]

Precisión (%):
7.250996015936255

Sensibilidad (%):
63.63636363636363

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 4
Idioma: fre
Matriz de Confusión:
|TP|FP|
|TN|FN|

[90, 925]
[0, 53]

Precisión (%):
8.866995073891626

Sensibilidad (%):
62.93706293706294

-----------------------------------------------

In [80]:
#Alemán
example = open("./extract/ger/occupational_therapy/terms.txt", encoding="utf8").readlines()

terms = filter_terms(example, "ger")

#print(terms)

terms = lemmatize_terms(terms, "ger")

#print(terms)

Estamos en el paso: Paso de filtrado 1
Idioma: ger
Matriz de Confusión:
|TP|FP|
|TN|FN|

[134, 1898]
[0, 9]

Precisión (%):
6.594488188976378

Sensibilidad (%):
93.7062937062937

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 2
Idioma: ger
Matriz de Confusión:
|TP|FP|
|TN|FN|

[134, 1893]
[0, 9]

Precisión (%):
6.610754810064135

Sensibilidad (%):
93.7062937062937

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 3
Idioma: ger
Matriz de Confusión:
|TP|FP|
|TN|FN|

[134, 1208]
[0, 9]

Precisión (%):
9.985096870342772

Sensibilidad (%):
93.7062937062937

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 4
Idioma: ger
Matriz de Confusión:
|TP|FP|
|TN|FN|

[134, 1076]
[0, 9]

Precisión (%):
11.074380165289256

Sensibilidad (%):
93.7062937062937

-------------------------------------------------

In [81]:
#Español
example = open("./extract/spa/occupational_therapy/terms.txt", encoding="utf8").readlines()

terms = filter_terms(example, "spa")

#print(terms)

terms = lemmatize_terms(terms, "spa")

#print(terms)

Estamos en el paso: Paso de filtrado 1
Idioma: spa
Matriz de Confusión:
|TP|FP|
|TN|FN|

[88, 2765]
[0, 29]

Precisión (%):
3.0844724851034

Sensibilidad (%):
75.21367521367522

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 2
Idioma: spa
Matriz de Confusión:
|TP|FP|
|TN|FN|

[82, 2674]
[0, 35]

Precisión (%):
2.9753265602322205

Sensibilidad (%):
70.08547008547008

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 3
Idioma: spa
Matriz de Confusión:
|TP|FP|
|TN|FN|

[82, 1469]
[0, 35]

Precisión (%):
5.286911669890394

Sensibilidad (%):
70.08547008547008

----------------------------------------------------------------------------

Estamos en el paso: Paso de filtrado 4
Idioma: spa
Matriz de Confusión:
|TP|FP|
|TN|FN|

[81, 1144]
[0, 36]

Precisión (%):
6.612244897959184

Sensibilidad (%):
69.23076923076923

-----------------------------------------------

In [82]:
#Italiano
example = open("./extract/ita/occupational_therapy/terms.txt", encoding="utf8").readlines()

terms = filter_terms(example, "ita")

#print(terms)

terms = lemmatize_terms(terms, "ita")

#print(terms)

ERROR: Idioma no reconocido


UnboundLocalError: local variable 'array_gold' referenced before assignment

In [16]:
print(filter_terms(example, "ita"))

Excluding classe (too freq 1-gram)
Excluding bambini (too freq 1-gram)
Excluding $ (too short)
Excluding risultati (too freq 1-gram)
Excluding attività (too freq 1-gram)
Excluding DCD (too short)
Excluding sviluppo (too freq 1-gram)
Excluding funzioni (too freq 1-gram)
Excluding evoluzione (too freq 1-gram)
Excluding allievi (too freq 1-gram)
Excluding classi (too freq 1-gram)
Adding motricità
Excluding difficoltà (too freq 1-gram)
Adding funzioni esecutive
Excluding scrittura (too freq 1-gram)
Excluding categorie (too freq 1-gram)
Excluding età (too short)
Excluding controllo (too freq 1-gram)
Excluding p<0.01 (strange symbols)
Excluding abilità (too freq 1-gram)
Excluding attenzione (too freq 1-gram)
Excluding disturbi (too freq 1-gram)
Excluding valutazione (too freq 1-gram)
Excluding interno (too freq 1-gram)
Excluding M. (too short)
Excluding scuola (too freq 1-gram)
Excluding J. (too short)
Excluding problemi (too freq 1-gram)
Excluding motorie (too freq 1-gram)
Adding Lietta San

In [17]:
print(lemmatize_terms(example, "ita"))

AttributeError: 'list' object has no attribute 'keys'