**Universidad Internacional de La Rioja (UNIR) - Máster Universitario en Inteligencia Artificial - Procesamiento del Lenguaje Natural** 

***
Datos del alumno (Carles Serra Vendrell):

Fecha:
***

<span style="font-size: 20pt; font-weight: bold; color: #0098cd;">Trabajo: Etiquetado morfosintáctico</span>

**Objetivos** 

Con esta actividad se tratará de que el alumno consiga aplicar un método basado en modelos ocultos de Markov (HMM) para realizar el etiquetado morfosintáctico de una oración.

**Descripción**

En esta actividad debes implementar en Python un etiquetador morfosintáctico basado en modelos ocultos de Markov (HMM) y realizar el etiquetado morfosintáctico de la oración:

> *Habla con el enfermo grave de trasplantes.*

Implementando también en Python el algoritmo de Viterbi.

<span style="font-size: 14pt; font-weight: bold; color: #0098cd;">Parte 1: Construir el etiquetador morfosintáctico</span>

En esta primera parte de la actividad tienes que implementar en Python el etiquetador morfosintáctico basado en un HMM bigrama a partir de un corpus etiquetado.

***
Para ello debes utilizar el corpus mia07_t3_tra_Corpus-tagged, que se encuentra disponible en el aula virtual.
***

El corpus se compone de frases en español etiquetadas con conocimiento sobre las partes de la oración (categorías gramaticales o POS tags). Estas frases etiquetadas han sido extraídas de algunos documentos que forman parte de Wikicorpus, un corpus trilingüe (español, catalán e inglés) compuesto por más de 750 millones de palabras. Wikicorpus fue creado por investigadores de la Universitat Politèncnica de Catalunya a partir de documentos de la Wikipedia que fueron anotados con la librería opensource FreeLing.

La tabla 1 muestra en formato de texto plano y sin etiquetar algunos ejemplos de frases que componen el corpus. De hecho, también se indica el identificador del documento del cual han sido extraídas las frases etiquetadas.

La versión anotada la conforma el corpus anotado proporcionado para realizar esta actividad. El formato del fichero de texto que contiene el corpus es el mismo que el utilizado en Wikicorpus. Por lo tanto, cada uno de los documentos de Wikipedia se identifica con el tag XML <doc> donde se indica el identificador del documento (id). 
    
Además, cada una de las frases en el documento viene separada por una línea en blanco. La información relativa a cada palabra de la frase se representa en una nueva línea del fichero. Para cada palabra, es decir, en cada línea del fichero, se proporciona —además del token que representa a la propia palabra— su lema, la etiqueta gramatical (POS tag) asociada a la palabra y el sentido de esta. 

La figura 1 muestra una captura del corpus anotado, donde se observa la frase «Tristana es una película del director español nacionalizado mexicano Luis Buñuel.» perteneciente al documento de Wikicorpus con identificador 27315 y titulado Tristana.

Si se analizan las anotaciones para la palabra «es», se observa que su lema es «ser», que la categoría gramatical a la que pertenece esa palabra es la identificada por la etiqueta gramatical «VSIP3S0» y que el sentido de la palabra es el identificado por el código «01775973175».

También se observa que la palabra «del» en la frase se representa en dos líneas y se anota con dos tokens, el primero «de» y el segundo «el». Esto se debe a que la palabra «del» es la contracción de la preposición «de» y el artículo «el». Por el contrario, el nombre propio «Luis Buñuel», que está formado por dos palabras (el nombre «Luis» y el apellido «Buñuel»), se anota como un único token «luis_buñuel». Además, se observa que el punto final de la frase también viene anotado como un token «.».

Aunque el corpus anotado proporciona más información (ver figura 1), es importante tener en cuenta de que para realizar esta actividad solo será necesario el token y la etiqueta gramatical (POS tag) de cada palabra; es decir, la información contenida en la primera y la tercera cadena de cada línea que representa una palabra en el corpus anotado. 

Las etiquetas gramaticales (POS tags) utilizadas para anotar la información morfosintáctica del corpus son las definidas en FreeLing y se basan en EAGLES, una recomendación para la anotación de la mayoría de las lenguas europeas. La definición del conjunto de etiquetas gramaticales (POS tags) utilizadas por FreeLing en el etiquetado de un corpus en español se puede consultar en la web.

***
Accede al recurso a través del aula virtual o desde la siguiente dirección web:
https://freeling-user-manual.readthedocs.io/en/v4.1/tagsets/tagset-es/
***

Las etiquetas gramaticales de EAGLES utilizadas por FreeLing son de longitud variable, donde cada carácter corresponde a una característica morfosintáctica. El primer carácter en la etiqueta es siempre la categoría gramatical o parte de la oración. Esa categoría gramatical determina la longitud de la etiqueta y la interpretación de cada uno del resto de caracteres en la misma. 

La definición de la etiqueta para la categoría gramatical «verbo» se muestra en la tabla 2. Entonces, la etiqueta «VSIP3S0», con la que ha sido etiquetada la palabra «es» en la frase que se presentó anteriormente, se interpreta de la siguiente forma: se refiere a un verbo (V) de tipo semiauxiliar (S) en modo indicativo (I) y en tiempo presente (P) para la tercera persona (3) de (número) singular (S). Asimismo, el carácter «0» al final de la etiqueta indica que esta forma verbal no tiene género.

Es importante destacar que para realizar la actividad se deben utilizar las etiquetas con las que se anota el corpus en formato EAGLES; por ejemplo, «VSIP3S0». 

**Importante:** Si se utilizan otras etiquetas la actividad será considerada incorrecta y puntuada con cero puntos.

Para construir el etiquetador morfosintáctico a partir del corpus etiquetado con los datos de entrenamiento, deberás seguir los siguientes pasos:

* Cargar el corpus para extraer la primera y tercera columna de cada registro.
* Calcular las probabilidades que rigen el HMM bigrama, es decir:
    - Calcular las probabilidades de emisión del HMM a partir del corpus etiquetado.
    - Calcular las probabilidades de transición del HMM a partir del corpus etiquetado.

**Nota:** Presenta en el envío de la actividad la tabla (guardada en formato de hoja de cálculo de Microsoft Excel (.xlsx) o equivalente) con las probabilidades de emisión y las de transición, calculadas para todas las etiquetas y tokens (palabras) que aparecen en el corpus.



### Cargar el corpus para extraer la primera y tercera columna de cada registro

En primer lugar se va a cargar el corpus leyendo el archivo y recuperando la información de la _primera_ y _tercera_ columna de cada registro que continen el _token_ de la palabra y la _etiqueta_, respectivamente.

Estos valores se almacenarán en objetos de la clase `Palabra`.

Esta clase permitirá recuperar el `Token()` y el `Tag()` fácilmente para cada registro.

In [1]:
class Palabra:
    '''
    Clase para guardar el token y la etiqueta de una palabra de un corpus
    '''

    def __init__(self, token: str, tag: str):
        '''
        Constructor de la clase

        token : str
            Token de la palabra

        tag : str
            Etiqueta de la palabra
        '''
        self._token = token
        self._tag = tag

    def Token(self):
        '''
        Método para acceder al token de la palabra
        '''
        return self._token

    def Tag(self):
        '''
        Método par acceder a la etiqueta de la palabra
        '''
        return self._tag


El corpus se guardará como una lista que a su vez contiene una serie de listas de objetos del tipo `Palabra`. Cada una de las listas de objetos del tipo `Palabra` guarda una oración. 

In [2]:
archivo = open('mia07_t3_tra_Corpus-tagged.txt', "r")

corpus = list()
oracion_actual = list()

for entrada in archivo.readlines():
    entrada = entrada.split()
    if len(entrada) == 0:
        # Puede ser la primera oración del documento
        # O que termina la oración
        if len(oracion_actual) > 0:
            # Fin de la oración
            corpus.append(oracion_actual)
        oracion_actual = list()
        continue

    elif entrada[0] == '<doc':
        # Inicio de documento. No se hace nada
        continue

    elif entrada[0] == '</doc>':
        # Fin del documento. No se hace nada
        continue
  
    oracion_actual.append(Palabra(token=entrada[0], tag=entrada[2]))

archivo.close()

In [3]:
corpus

 <__main__.Palabra at 0x1074bfbb0>,
  <__main__.Palabra at 0x1074bfc10>,
  <__main__.Palabra at 0x1074bfc70>,
  <__main__.Palabra at 0x1074bfcd0>,
  <__main__.Palabra at 0x1074bfd30>,
  <__main__.Palabra at 0x1074bfd90>,
  <__main__.Palabra at 0x1074bfdf0>,
  <__main__.Palabra at 0x1074bfe50>,
  <__main__.Palabra at 0x1074bfeb0>,
  <__main__.Palabra at 0x1074bff10>,
  <__main__.Palabra at 0x1074bff70>,
  <__main__.Palabra at 0x1074bffd0>,
  <__main__.Palabra at 0x1074c2070>,
  <__main__.Palabra at 0x1074c20d0>,
  <__main__.Palabra at 0x1074c2130>,
  <__main__.Palabra at 0x1074c2190>,
  <__main__.Palabra at 0x1074c21f0>,
  <__main__.Palabra at 0x1074c2250>],
 [<__main__.Palabra at 0x1074c22b0>,
  <__main__.Palabra at 0x1074c2310>,
  <__main__.Palabra at 0x1074c2370>,
  <__main__.Palabra at 0x1074c23d0>,
  <__main__.Palabra at 0x1074c2430>,
  <__main__.Palabra at 0x1074c2490>,
  <__main__.Palabra at 0x1074c24f0>,
  <__main__.Palabra at 0x1074c2550>,
  <__main__.Palabra at 0x1074c25b0>,
 

El siguiente código te permite imprimir el corpus:

In [4]:
for oracion in corpus:
    for palabra in oracion:
        print(palabra.Token(), palabra.Tag())

0
; Fx
en SPS00
este DD0MS0
tiempo NCMS000
Denny NP00000
le PP3CSD00
propone VMIP3S0
matrimonio NCMS000
a SPS00
Izzie NP00000
. Fp
a SPS00
el DA0MS0
saber VMN0000
el DA0MS0
consejo NCMS000
de SPS00
el DA0MS0
programa NCMS000
de SPS00
trasplantes NCMP000
que PR0CN000
Denny NP00000
tuvo VMIS3S0
" Fe
un DI0MS0
ataque NCMS000
" Fe
( Fpa
eso PD0NS000
es VSIP3S0
lo DA0NS0
que PR0CN000
dice VMIP3S0
Izzie NP00000
) Fpt
le PP3CSD00
dan VMIP3P0
el DA0MS0
corazón NCMS000
a SPS00
Denny NP00000
y CC
lo PP3CNA00
mandan VMIP3P0
con SPS00
Burke NP00000
pero CC
llegando VMG0000
a SPS00
el DA0MS0
hospital NCMS000
hay VMIP3S0
un DI0MS0
tiroteo NCMS000
y CC
Burke NP00000
resulta VMIP3S0
herido VMP00SM
. Fp
Por SPS00
diversas AQ0FP0
razones NCFP000
Cristina NP00000
, Fc
Meredith NP00000
y CC
George NP00000
se P0000000
únen VMIP3P0
a SPS00
Izzie NP00000
para SPS00
cuidar VMN0000
a SPS00
Denny NP00000
; Fx
pero CC
Bailey NP00000
los PP3MPA00
descubre VMIP3S0
y CC
se P0000000
pone VMIP3S0
furiosa AQ0FS0
pero 

### Calcular las probabilidades que rigen el HMM bigrama

Una vez se dispone del `corpus` correctamente cargado se creará un objeto, `hmmbigrama` de la clase `HMMBigrama`.

`hmmbigrama` permitirá hacer el cálculo de las tablas de probabilidades de transición y de emisión.

In [16]:
#Se usa pandas para crear las tablas.
import pandas as pd

class HMMBigrama:
    '''
    Clase para obtener las matrices de probabilidad HMM Bigrama a partir de un corpus
    '''

    def __init__(self, corpus: [[Palabra]]):
        '''
        Constructor de la clase para calcular el Modelo Oculto de Markov Bigrama
        '''
        self._corpus = corpus
        self._estados = dict()
        self._tokens = dict()
        self._q0 = 'q0'
        self._qF = 'qF'

        self._prob_trans = pd.DataFrame()
        self._prob_obs = pd.DataFrame()

    def Corpus(self):
        return self._corpus.copy()

    def EstadoInicial(self):
        return self._q0

    def EstadoFinal(self):
        return self._qF

    def _ProcesarCorpus(self):
        '''
        Método para contar el número de ocurrencias de estados y tokens
        '''
        for oracion in self._corpus:
            for palabra in oracion:

                ##################################################  
                ########## Aquí debes incluir tu código ##########  
                ################################################## 

                # Contar ocurrencias tokens
                if palabra.Token() not in self._tokens:
                    self._tokens[palabra.Token()] = 0
                self._tokens[palabra.Token()] += 1

                # Contar ocurrencias estados
                if palabra.Tag() not in self._estados:
                    self._estados[palabra.Tag()] = 0
                self._estados[palabra.Tag()] += 1
                

    def Estados(self, incluir_inicial: bool = False, incluir_final: bool = False):
        '''
        Devuelve los estados del bigrama en base al corpus proporcionado al constructor

        incluir_inicial : bool (False)
            Flag para indicar si se quiere recuperar el estado inicial

        incluir_final : bool (False)
            Flag para indicar si se quiere recuperar el estado final

        return
            Diccionario de estados con el número de ocurrencias de cada estado en el corpus
        '''

        if len(self._estados) == 0:
            self._ProcesarCorpus()

        copia_estados = dict()
        if incluir_inicial:
            # Hay tantos estados como oraciones en el corpus
            copia_estados[self._q0] = len(self._corpus)

        copia_estados.update(self._estados)

        if incluir_final:
            # Hay tantos estados como oraciones en el corpus
            copia_estados[self._qF] = len(self._corpus)

        return copia_estados

    def Tokens(self):
        '''
        Devuelve los tokens del bigrama en base al corpus proporcionado al constructor

        return
            Diccionario de tokens con el número de ocurrencias de cada token en el corpus
        '''

        if len(self._tokens) == 0:
            self._ProcesarCorpus()

        return self._tokens.copy()

    
    def ProbabilidadesDeTransicion(self):
        '''
        Método para calcular las probabilidades de transición bigrama
        a partir del corpus proporcionado a la clase
        '''

        # Si ya se ha calculado se devuelve
        if len(self._prob_trans) != 0:
            return self._prob_trans.copy()

        '''
        En esta parte del código se calcula el número de
        transiciones bigrama, es decir, en el diccionario
        'contador_transiciones' se almacenarán los contadores
        de las transiciones t-1 -> t

        Las claves del diccionario serán los estados de partida
        mientras que los valores de cada clave serán los estados
        de destino y el número de veces que transitan a cada estado
        '''
        q0 = self._q0
        qF = self._qF
        contador_transiciones = {q0: dict()}

        ##################################################  
        ########## Aquí debes incluir tu código ##########  
        ##################################################

        # Inicializamos estados
        contador_transiciones.update({t_1: dict() for t_1 in self.Estados().keys()})

        for oracion in self._corpus:
            # Inicializar estado anterior en cada nueva oración.
            t_1 = str()
            for palabra in oracion:
                # INIT -> Primera palabra de la oración.
                if not t_1:
                    contador_transiciones['q0'][palabra.Tag()] = contador_transiciones['q0'].get(palabra.Tag(), 0) + 1
                    t_1 = palabra.Tag()
                    continue

                # Contar transiciones
                if palabra.Tag() not in contador_transiciones[t_1]:
                    contador_transiciones[t_1][palabra.Tag()] = 1
                else:
                    contador_transiciones[t_1][palabra.Tag()] += 1

                # Actualizar estado anterior
                t_1 = palabra.Tag()

            # Ultima palabra de la oración -> FINAL
            contador_transiciones[oracion[-1].Tag()]['qF'] = contador_transiciones[oracion[-1].Tag()].get('qF', 0) + 1
            
        '''
        Cálculo de la tabla de probabilidades de transición.

        Se calculan ahora las probabilidades de transición
        siguiendo la relación: P(T|T-1) = C(T-1, T) / C(T-1).

        En 'contador_transiciones' se han acumulado la coincidencias C(T-1, T)
        y en 'estados' se tiene disponible C(T-1) por lo que es posible
        calcular la tabla de probabilidades de transiciones con estos elementos.
        '''
        tags_estados_iniciales = list(
            self.Estados(incluir_inicial=True).keys())
        tags_estados_finales = list(self.Estados(incluir_final=True).keys())
        estados_totales = self.Estados(
            incluir_inicial=True, incluir_final=True)

        prob_trans = {qt_1: {qt: 0 for qt in tags_estados_finales}
                      for qt_1 in tags_estados_iniciales}
        
        ##################################################  
        ########## Aquí debes incluir tu código ##########  
        ##################################################

        for t_1 in contador_transiciones.keys():
            for t in contador_transiciones[t_1].keys():
                prob_trans[t_1][t] = contador_transiciones[t_1][t] / estados_totales[t_1]

        self._prob_trans = pd.DataFrame.from_dict(prob_trans, orient='index')
        
        return self._prob_trans.copy()

    def ProbabilidadesDeEmision(self):
        '''
        Método para calcular las probabilidades de emisión
        a partir del corpus proporcionado a la clase
        '''

        if len(self._prob_obs) != 0:
            return self._prob_obs.copy()

        '''
        En esta parte del código se calculan el número de
        ocurrencias de la palabra Wi para la etiqueta Ti  
        '''
        estados = self.Estados()
        contador_observaciones = {key: dict() for key in estados.keys()}

        ##################################################  
        ########## Aquí debes incluir tu código ##########  
        ##################################################
                                        
 
        '''
        Cálculo de la tabla de probabilidades de emisión.

        Se calculan ahora las probabilidades de emisión
        siguiendo la relación: P(Wi|Ti) = C(Ti,Wi) / C(Ti).

        En 'contador_observaciones' se han acumulado la coincidencias C(Ti, Wi)
        y en 'estados' se tiene disponible C(Ti) por lo que es posible
        calcular la tabla de probabilidad de emisión con estos elementos.
        '''
        tokens = self.Tokens()
        prob_obs = {Ti: {Wi: 0 for Wi in tokens} for Ti in estados}

        ##################################################  
        ########## Aquí debes incluir tu código ##########  
        ##################################################

        return self._prob_obs


El siguiente código te permite crear el HMM Bigrama y obtener información relevante:

In [17]:
hmmbigrama = HMMBigrama(corpus)

In [18]:
hmmbigrama.Tokens()

{'Tristana': 5,
 'es': 28,
 'una': 39,
 'película': 27,
 'de': 236,
 'el': 147,
 'director': 3,
 'español': 1,
 'nacionalizado': 1,
 'mexicano': 1,
 'Luis_Buñuel': 2,
 '.': 178,
 'Está': 1,
 'basada': 2,
 'en': 128,
 'la': 137,
 'novela': 2,
 'mismo': 2,
 'nombre': 3,
 'Benito_Pérez_Galdós': 2,
 'Fue': 2,
 'nominada': 1,
 'a': 119,
 'Oscar': 3,
 'mejor': 5,
 'habla': 7,
 'no': 18,
 'inglesa': 1,
 '1970': 1,
 'y': 112,
 'Nazarín': 1,
 'son': 3,
 'las': 15,
 'dos': 9,
 'novelas': 1,
 'que': 133,
 'Buñuel': 2,
 'adaptó': 1,
 'cine': 4,
 'La': 9,
 'pasó': 2,
 'ser': 12,
 'uno': 1,
 'esos': 1,
 'proyectos': 1,
 'largamente': 1,
 'acariciados': 1,
 'por': 37,
 'constantemente': 1,
 'aplazados': 1,
 'Hubo': 1,
 'otras': 1,
 'tentativas': 1,
 'realizar': 2,
 ':': 17,
 'México': 1,
 '1952': 1,
 ',': 206,
 'con': 37,
 'Ernesto_Alonso': 1,
 'Silvia_Pinal': 1,
 'a_el_frente_de': 1,
 'reparto': 2,
 'otra': 4,
 '1962': 2,
 'hubiera': 1,
 'estado': 2,
 'protagonizada': 2,
 'Rocío_Durcal': 1,
 'o': 6,

In [19]:
len(hmmbigrama.Tokens())

1501

In [20]:
hmmbigrama.Estados()

{'NP00000': 320,
 'VSIP3S0': 31,
 'DI0FS0': 46,
 'NCFS000': 271,
 'SPS00': 683,
 'DA0MS0': 157,
 'NCMS000': 273,
 'AQ0MS0': 41,
 'VMP00SM': 76,
 'Fp': 178,
 'VAIP3S0': 15,
 'VMP00SF': 31,
 'DA0FS0': 142,
 'VSIS3S0': 21,
 'AQ0CS0': 73,
 'RN': 18,
 'AQ0FS0': 35,
 'Z': 55,
 'CC': 152,
 'VSIP3P0': 3,
 'DA0FP0': 18,
 'NCFP000': 49,
 'CS': 99,
 'VMIS3S0': 69,
 'VSN0000': 12,
 'PI0MS000': 3,
 'DD0MP0': 5,
 'NCMP000': 128,
 'RG': 127,
 'VMP00PM': 13,
 'DI0FP0': 8,
 'VMN0000': 88,
 'PP3FSA00': 4,
 'Fd': 17,
 'Fc': 206,
 'PI0FS000': 2,
 'VASI1S0': 1,
 'VAP00SM': 2,
 'AO0FS0': 12,
 'DP3CP0': 18,
 'DI0MS0': 65,
 'PR0FS000': 1,
 'DP3CS0': 61,
 'PD0MS000': 4,
 'VMIP3S0': 134,
 'VMG0000': 32,
 'PP3CNA00': 15,
 'VSIF3S0': 1,
 'DD0MS0': 13,
 'P0000000': 69,
 'W': 11,
 'NCMN000': 13,
 'DA0MP0': 44,
 'NCCP000': 14,
 'VMIP3P0': 41,
 'AQ0FP0': 11,
 'PR0CN000': 78,
 'PP3MPA00': 4,
 'DA0NS0': 15,
 'AQ0CP0': 10,
 'AQ0MP0': 15,
 'PR0CS000': 17,
 'DI0MP0': 18,
 'Fe': 54,
 'VMSP3P0': 4,
 'PP3CN000': 8,
 'Fpa': 3

In [21]:
len(hmmbigrama.Estados())

134

El método `ProbabilidadesDeTransición()` de la clase `HMMBigrama` devuelve la tabla de probabilidades de transición.

In [22]:
def non_zero_green(val):
    '''
    Función para resaltar en verde las probabilidades que no sean 0
    '''
    return 'background-color: Aquamarine' if val > 0 else ''

In [23]:
prob_transicion = hmmbigrama.ProbabilidadesDeTransicion()
prob_transicion.style.applymap(non_zero_green)

Unnamed: 0,NP00000,VSIP3S0,DI0FS0,NCFS000,SPS00,DA0MS0,NCMS000,AQ0MS0,VMP00SM,Fp,VAIP3S0,VMP00SF,DA0FS0,VSIS3S0,AQ0CS0,RN,AQ0FS0,Z,CC,VSIP3P0,DA0FP0,NCFP000,CS,VMIS3S0,VSN0000,PI0MS000,DD0MP0,NCMP000,RG,VMP00PM,DI0FP0,VMN0000,PP3FSA00,Fd,Fc,PI0FS000,VASI1S0,VAP00SM,AO0FS0,DP3CP0,DI0MS0,PR0FS000,DP3CS0,PD0MS000,VMIP3S0,VMG0000,PP3CNA00,VSIF3S0,DD0MS0,P0000000,W,NCMN000,DA0MP0,NCCP000,VMIP3P0,AQ0FP0,PR0CN000,PP3MPA00,DA0NS0,AQ0CP0,AQ0MP0,PR0CS000,DI0MP0,Fe,VMSP3P0,PP3CN000,Fpa,Fpt,VMIS3P0,VMM03P0,PR000000,VMII3P0,Fx,PD0MP000,VMIF3P0,AO0MS0,VSSP1S0,PI0CS000,NCCS000,PR0CP000,PP3CSD00,VMIC1S0,VAIC1S0,VSP00SM,VMIF3S0,VMP00PF,VMSP1S0,PD0NS000,DI0CS0,DD0FS0,VMIS2S0,VAIP3P0,PP3FS000,VAN0000,DD0CS0,PP3CPD00,VMSI3P0,VASP1S0,PR0MS000,VAII3P0,VMIC3P0,VMIS1S0,PP1CSN00,Fs,VMM02S0,VAIS3S0,VMSP2S0,Zd,NC00000,NCMS00D,VSIS3P0,PP3MS000,PI0MP000,VMII1S0,VAII1S0,PD0FS000,AQ0CN0,I,VSII1S0,PP3NS000,VSIC1S0,VMIP1S0,Zu,PT0CS000,Fg,AO0MP0,PT0CN000,VMSI1S0,DD0CP0,DD0FP0,PP3FP000,NCFN000,PT0CP000,Fz,qF
q0,0.196532,0.00578,0.017341,0.0,0.277457,0.057803,0.00578,0.0,0.00578,0.0,0.00578,0.00578,0.040462,0.011561,0.0,0.0,0.0,0.023121,0.00578,0.0,0.0,0.0,0.017341,0.00578,0.0,0.0,0.00578,0.00578,0.138728,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00578,0.00578,0.00578,0.0,0.017341,0.00578,0.028902,0.0,0.0,0.0,0.011561,0.023121,0.0,0.0,0.011561,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00578,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011561,0.0,0.00578,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00578,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00578,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00578,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011561,0.0
NP00000,0.0,0.021875,0.0,0.0,0.078125,0.003125,0.009375,0.0,0.003125,0.14375,0.003125,0.003125,0.00625,0.01875,0.00625,0.00625,0.003125,0.0,0.08125,0.003125,0.0,0.0,0.015625,0.040625,0.0,0.0,0.0,0.0,0.009375,0.0,0.0,0.0,0.003125,0.040625,0.159375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003125,0.0,0.059375,0.003125,0.0,0.003125,0.0,0.0125,0.0,0.0,0.0,0.0,0.00625,0.0,0.00625,0.003125,0.0,0.0,0.00625,0.0,0.0,0.046875,0.0,0.0,0.053125,0.0375,0.0,0.0,0.003125,0.0,0.0375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009375,0.00625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003125,0.0,0.0,0.00625,0.0,0.0,0.0,0.0,0.009375,0.015625,0.0,0.003125,0.0,0.003125,0.0,0.0,0.0,0.0,0.0,0.003125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
VSIP3S0,0.0,0.0,0.16129,0.0,0.064516,0.064516,0.0,0.0,0.129032,0.0,0.0,0.129032,0.0,0.0,0.096774,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.096774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.096774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DI0FS0,0.0,0.0,0.0,0.73913,0.065217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.065217,0.0,0.065217,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
NCFS000,0.02583,0.01107,0.0,0.0,0.361624,0.0,0.00369,0.00369,0.00369,0.095941,0.0,0.02214,0.0,0.01476,0.066421,0.00738,0.062731,0.0,0.04059,0.0,0.0,0.0,0.01107,0.01476,0.0,0.0,0.0,0.0,0.01845,0.0,0.0,0.0,0.0,0.00738,0.066421,0.0,0.0,0.0,0.0,0.0,0.00369,0.0,0.00369,0.0,0.03321,0.0,0.00369,0.0,0.0,0.02214,0.0,0.0,0.00369,0.0,0.00369,0.0,0.04059,0.0,0.00369,0.0,0.0,0.0,0.0,0.01107,0.0,0.0,0.00738,0.00738,0.0,0.0,0.0,0.0,0.00369,0.0,0.0,0.0,0.00369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00369,0.0,0.00369,0.0,0.00369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SPS00,0.149341,0.0,0.019034,0.058565,0.001464,0.149341,0.045388,0.001464,0.0,0.0,0.0,0.0,0.136164,0.0,0.001464,0.0,0.0,0.038067,0.0,0.0,0.011713,0.013177,0.021962,0.001464,0.007321,0.0,0.005857,0.032211,0.008785,0.0,0.002928,0.064422,0.0,0.0,0.0,0.0,0.0,0.0,0.001464,0.011713,0.039531,0.0,0.04246,0.0,0.0,0.0,0.001464,0.0,0.013177,0.0,0.0,0.0,0.038067,0.002928,0.0,0.004392,0.010249,0.0,0.011713,0.001464,0.002928,0.002928,0.010249,0.002928,0.0,0.0,0.0,0.0,0.0,0.0,0.001464,0.0,0.0,0.0,0.0,0.001464,0.0,0.0,0.0,0.001464,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004392,0.001464,0.002928,0.0,0.0,0.001464,0.002928,0.002928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001464,0.0,0.0,0.0,0.004392,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001464,0.0,0.0,0.0,0.0,0.0,0.0,0.001464,0.0,0.0,0.0,0.001464,0.0,0.001464,0.0,0.0
DA0MS0,0.019108,0.0,0.0,0.019108,0.012739,0.0,0.694268,0.019108,0.0,0.0,0.0,0.0,0.0,0.0,0.019108,0.0,0.0,0.006369,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.044586,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057325,0.044586,0.0,0.0,0.0,0.0,0.019108,0.0,0.0,0.0,0.0,0.012739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006369,0.0,0.0,0.012739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
NCMS000,0.040293,0.007326,0.0,0.0,0.322344,0.007326,0.010989,0.040293,0.018315,0.07326,0.007326,0.003663,0.0,0.010989,0.054945,0.003663,0.0,0.0,0.080586,0.0,0.0,0.0,0.010989,0.0,0.0,0.0,0.0,0.0,0.010989,0.0,0.0,0.0,0.0,0.0,0.113553,0.0,0.0,0.0,0.0,0.0,0.0,0.003663,0.0,0.0,0.021978,0.0,0.007326,0.0,0.0,0.010989,0.0,0.0,0.0,0.0,0.0,0.003663,0.051282,0.0,0.0,0.0,0.0,0.0,0.0,0.018315,0.0,0.0,0.003663,0.03663,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003663,0.0,0.003663,0.0,0.003663,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003663,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010989,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AQ0MS0,0.04878,0.0,0.0,0.0,0.341463,0.0,0.219512,0.0,0.02439,0.02439,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.097561,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.073171,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
VMP00SM,0.0,0.0,0.0,0.0,0.421053,0.026316,0.039474,0.026316,0.0,0.039474,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.013158,0.065789,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.078947,0.0,0.0,0.026316,0.0,0.0,0.092105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.013158,0.0,0.0,0.0,0.013158,0.026316,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
prob_transicion.to_excel('mia07_t3_tra_resultados_trans.xlsx', sheet_name='prob_trans')

El método `ProbabilidadesDeEmision()` de la clase `HMMBigrama` devuelve la tabla de probabilidades de emisión.

In [None]:
prob_emision = hmmbigrama.ProbabilidadesDeEmision()
prob_emision.style.applymap(non_zero_green)

In [None]:
prob_emision.to_excel('mia07_t3_tra_resultados_emision.xlsx', sheet_name='prob_emision')

<span style="font-size: 14pt; font-weight: bold; color: #0098cd;">Parte 2: Etiquetar morfosintácticamente una oración</span>

En esta segunda parte de la actividad tienes que implementar en Python un programa que permita calcular la mejor secuencia de etiquetas para una oración, dicho de otro modo, realizar el etiquetado morfosintáctico de la oración: «Habla con el enfermo grave de trasplantes. ».

Para ello debes utilizar el etiquetador que has construido en la parte 1 de esta actividad, es decir las tablas de probabilidades calculadas, y aplicar el algoritmo de Viterbi.

Para aplicar el algoritmo de Viterbi, se deben seguir los siguientes pasos: 

* Calcular la matriz de probabilidades de la ruta se Viterbi (matriz con los valores de Viterbi) donde se representen claramente las observaciones y los estados de la máquina de estados finitos. Calcula el valor de Viterbi para cada celda de la matriz e indica claramente los valores obtenidos.
Nota: Para simplificar, puedes eliminar todos aquellos estados asociados a etiquetas que no aparezcan en el posible análisis de la oración y sólo quedarte con los estados relevantes. Además, debes tener en cuenta la transición al estado final representado por el punto al final de la oración a analizar. 

* Obtener la ruta con máxima probabilidad, es decir, traza la ruta inversa para obtener la mejor secuencia de etiquetas. 

* Mostrar la oración etiquetada. Debes indicar claramente el resultado obtenido del etiquetado morfosintáctico de la oración estudiada.

**Nota:** Presenta en el envío de la actividad la tabla (guardada en formato de hoja de cálculo de Microsoft Excel (.xlsx) o equivalente) con la matriz de probabilidades de la ruta Viterbi para el etiquetado morfosintáctico de la oración «Habla con el enfermo grave de trasplantes. ».



### Calcular la matriz de probabilidades de la ruta de Viterbi

La clase `Viterbi` permitirá realizar el cálculo de la matriz de probabilidades de la ruta de Viterbi y la posterior decodificación de la secuencia óptima de etiquetado para una oración a analizar.

El etiquetado morfosintáctico creado en la Parte 1, es decir el objeto `hmmbigrama` de la clase `HMMBigrama`, será proporcionado al objeto `viterbi` de la clase `Viterbi` para poder aplicar el Algoritmo de Viterbi.

El cálculo de los valores de Viterbi se realiza en el método `Probabilidades()` de la clase `Viterbi`.



### Obtener la ruta con máxima probabilidad

El método `DecodificacionSecuenciaOptima()` de la clase `Viterbi` permite obtener la secuencia de etiquetas más probables para la oración a analizar.

In [None]:
class Viterbi:
    '''
    Algoritmo de Viterbi para obtener las mejores
    etiquetas de las palabras de una oración
    '''

    def __init__(self, hmmbigrama: HMMBigrama, oracion: str):
        self._hmmbigrama = hmmbigrama
        self._oracion = oracion

        self._estados_relevantes = None
        self._prob_viterbi = pd.DataFrame()
        self._estado_max_anterior = None

    def _CalculoEstadosRelevantes(self):
        self._estados_relevantes = set()
        for palabra_analizar in [x.lower() for x in self._oracion.split()]:
            # Búsqueda de estados
            for oracion in self._hmmbigrama.Corpus():
                for palabra_corpus in oracion:
                    if palabra_corpus.Token() == palabra_analizar:
                        self._estados_relevantes.add(palabra_corpus.Tag())

    def Probabilidades(self):
        if len(self._prob_viterbi) != 0:
            return self._prob_viterbi.copy()

        if not self._estados_relevantes:
            self._CalculoEstadosRelevantes()

        estados_relevantes = self._estados_relevantes

        '''
        Matriz en la que se guardan los valores de Viterbi
        '''
        matriz_viterbi = {q: dict() for q in estados_relevantes}

        '''
        Matriz asociada a la matriz de Viterbi en la que se almacena
        el estado de origen que maximiza cada probabilidad
        '''
        self._estado_max_anterior = {q: dict() for q in estados_relevantes}

        q0 = self._hmmbigrama.EstadoInicial()
        prob_trans = self._hmmbigrama.ProbabilidadesDeTransicion()
        prob_obs = self._hmmbigrama.ProbabilidadesDeEmision()

        token_anterior = None
        for token in [x.lower() for x in self._oracion.split()]:
            for qDestino in estados_relevantes:

                prob_max = 0
                if not token_anterior:
                    # Estado q0
                    prob_max = prob_trans[qDestino][q0]
                else:
                    # Resto de estados
                    for qOrigen in estados_relevantes:
                        
                        ##################################################  
                        ########## Aquí debes incluir tu código ##########  
                        ##################################################
                        
                        if prob_qOrigen > prob_max:
                            
                            ##################################################  
                            ########## Aquí debes incluir tu código ##########  
                            ##################################################

                matriz_viterbi[qDestino][token] = prob_max * prob_obs[token][qDestino]

            token_anterior = token

        self._prob_viterbi = pd.DataFrame.from_dict(matriz_viterbi, orient='index')

        return self._prob_viterbi.copy()

    def DecodificacionSecuenciaOptima(self):
        # Decodificación de la secuencia óptima
        oracion_invertida = [x.lower() for x in self._oracion.split()]
        oracion_invertida.reverse()

        prob_viterbi = self.Probabilidades()

        oracion_etiquetada = []
        # Se busca la probablidad máxima de Viterbi asociada a la última palabra de la oración
        palabra = oracion_invertida[0]
        etiqueta = prob_viterbi[palabra].idxmax()
        oracion_etiquetada.append({'token': palabra, 'tag': etiqueta, 'prob': prob_viterbi[palabra].max()})

        # Ahora se usa la tabla auxiliar de Viterbi que contiene
        # el estado de origen que maximiza cada probabilidad Viterbi
        palabra_anterior = palabra
        for palabra in oracion_invertida[1:]:
            
            ##################################################  
            ########## Aquí debes incluir tu código ##########  
            ##################################################

        # Se recupera el orden de la oración con las palabras ya etiquetadas
        oracion_etiquetada.reverse()

        return oracion_etiquetada

El siguiente código te permite realizar el análisis de la oración: "Habla con el enfermo grave de trasplantes."

In [None]:
viterbi = Viterbi(hmmbigrama=hmmbigrama, oracion='Habla con el enfermo grave de trasplantes .')

El siguiente código te permite mostrar la matriz de probabilidades de la ruta de Viterbi (solo se presentan aquellas etiquetas que tienen algún valor no nulo para alguna de las palabras de la oración analizada).

In [None]:
matriz_prob_viterbi = viterbi.Probabilidades()
matriz_prob_viterbi.style.applymap(non_zero_green)

In [None]:
matriz_prob_viterbi.to_excel('mia07_t3_tra_resultados_viterbi.xlsx', sheet_name='viterbi')

El siguiente código te permite mostrar la ruta de Viterbi con máxima probabilidad

In [None]:
oracion_etiquetada = viterbi.DecodificacionSecuenciaOptima()

In [None]:
oracion_etiquetada

### Mostrar la oración etiquetada

El siguiente código te permite mostrar la oración etiquetada

In [None]:
for palabra in oracion_etiquetada:
    print('{} / {}'.format(palabra['token'], palabra['tag']))

<span style="font-size: 14pt; font-weight: bold; color: #0098cd;">Parte 3: Analizar el etiquetador morfosintáctico</span>

Una vez hayas creado el etiquetador morfosintáctico y lo hayas utilizado para etiquetar la oración «Habla con el enfermo grave de trasplantes.», reflexiona sobre los resultados obtenidos, interprétalos y analiza el rendimiento del etiquetador creado y sus limitaciones. Para ello responde de forma razonada a las siguientes preguntas:

* ¿Es correcto el etiquetado morfosintáctico que has obtenido? Indica por qué.

########## Aquí debes indicar tu respuesta ##########

* Indica el resultado de etiquetar la oración «El enfermo grave habla de trasplantes.» utilizando el etiquetador morfosintáctico. ¿Es correcto el etiquetado morfosintáctico que has obtenido? Indica por qué.

########## Aquí debes indicar tu respuesta ##########

* ¿Cuáles son las limitaciones del analizador morfosintáctico que has creado?

########## Aquí debes indicar tu respuesta ##########

* ¿Qué posibles mejoras se podrían aplicar para mejorar el rendimiento del etiquetador morfosintáctico creado?

########## Aquí debes indicar tu respuesta ##########