<a href="https://colab.research.google.com/github/cris-her/AI/blob/master/nlp_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Etiquetado en NLTK

## Pipeline básico para Ingles

In [None]:
#@title Dependencias previas
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
#@title Etiquetado en una línea ...
text = word_tokenize("And now here I am enjoying today")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('here', 'RB'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('enjoying', 'VBG'),
 ('today', 'NN')]

In [None]:
#@title Categoria gramatical de cada etiqueta
nltk.download('tagsets')
for tag in ['CC', 'RB', 'PRP', 'VBP', 'VBG', 'NN']:
  print(nltk.help.upenn_tagset(tag))

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
None
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
None
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
None
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
None
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' allegin

In [None]:
#@title Palabras homónimas
text = word_tokenize("They do not permit other people to get residence permit")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('do', 'VBP'),
 ('not', 'RB'),
 ('permit', 'VB'),
 ('other', 'JJ'),
 ('people', 'NNS'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('residence', 'NN'),
 ('permit', 'NN')]

## Etiquetado en Español 

Para el ingles, NLTK tiene tokenizador y etiquetador pre-entrenados por defecto. En cambio, para otros idiomas es preciso entrenarlo previamente. 

* usamos el corpus `cess_esp` https://mailman.uib.no/public/corpora/2007-October/005448.html

* el cual usa una convención de etiquetas gramaticales dada por el grupo EAGLES https://www.cs.upc.edu/~nlp/tools/parole-sp.html

In [None]:
nltk.download('cess_esp')
from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

[nltk_data] Downloading package cess_esp to /root/nltk_data...
[nltk_data]   Unzipping corpora/cess_esp.zip.


In [None]:
#@title Entrenamiendo del tagger por unigramas
cess_sents = cess.tagged_sents()
fraction = int(len(cess_sents)*90/100)
cess_sents = cess.tagged_sents()
uni_tagger = ut(cess_sents[:fraction])
uni_tagger.evaluate(cess_sents[fraction+1:])

0.8069484240687679

In [None]:
uni_tagger.tag("Yo soy una persona muy amable".split(" "))

[('Yo', 'pp1csn00'),
 ('soy', 'vsip1s0'),
 ('una', 'di0fs0'),
 ('persona', 'ncfs000'),
 ('muy', 'rg'),
 ('amable', None)]

In [None]:
#@title Entrenamiento del tagger por bigramas
fraction = int(len(cess_sents)*90/100)
bi_tagger = bt(cess_sents[:fraction])
bi_tagger.evaluate(cess_sents[fraction+1:])

0.1095272206303725

In [None]:
bi_tagger.tag("Yo soy una persona muy amable".split(" "))

[('Yo', 'pp1csn00'),
 ('soy', 'vsip1s0'),
 ('una', None),
 ('persona', None),
 ('muy', None),
 ('amable', None)]

# Etiquetado mejorado con Stanza (StanfordNLP)

**¿Que es Stanza?**

* El grupo de investigacion en NLP de Stanford tenía una suite de librerias que ejecutaban varias tareas de NLP, esta suite se unifico en un solo servicio que llamaron **CoreNLP** con base en codigo java: https://stanfordnlp.github.io/CoreNLP/index.html

* Para python existe **StanfordNLP**: https://stanfordnlp.github.io/stanfordnlp/index.html

* Sin embargo, **StanfordNLP** ha sido deprecado y las nuevas versiones de la suite de NLP reciben mantenimiento bajo el nombre de **Stanza**: https://stanfordnlp.github.io/stanza/

In [None]:
!pip install stanza

Collecting stanza
[?25l  Downloading https://files.pythonhosted.org/packages/e7/8b/3a9e7a8d8cb14ad6afffc3983b7a7322a3a24d94ebc978a70746fcffc085/stanza-1.1.1-py3-none-any.whl (227kB)
[K     |█▍                              | 10kB 16.9MB/s eta 0:00:01[K     |██▉                             | 20kB 1.7MB/s eta 0:00:01[K     |████▎                           | 30kB 2.2MB/s eta 0:00:01[K     |█████▊                          | 40kB 2.5MB/s eta 0:00:01[K     |███████▏                        | 51kB 1.9MB/s eta 0:00:01[K     |████████▋                       | 61kB 2.2MB/s eta 0:00:01[K     |██████████                      | 71kB 2.4MB/s eta 0:00:01[K     |███████████▌                    | 81kB 2.6MB/s eta 0:00:01[K     |█████████████                   | 92kB 2.9MB/s eta 0:00:01[K     |██████████████▍                 | 102kB 2.7MB/s eta 0:00:01[K     |███████████████▉                | 112kB 2.7MB/s eta 0:00:01[K     |█████████████████▎              | 122kB 2.7MB/s eta 0:00

In [None]:
# esta parte puede demorar un poco ....
import stanza
stanza.download('es')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 10.4MB/s]                    
2020-10-22 11:45:46 INFO: Downloading default packages for language: es (Spanish)...
Downloading http://nlp.stanford.edu/software/stanza/1.1.0/es/default.zip: 100%|██████████| 583M/583M [01:18<00:00, 7.44MB/s]
2020-10-22 11:47:15 INFO: Finished downloading models and saved to /root/stanza_resources.


In [None]:
# Stanza funciona por medio de pipelines (pegar tareas una tras otra)
nlp = stanza.Pipeline('es', processors='tokenize,pos')
# pos indicar las categorias gramaticales
doc = nlp('yo soy una persona muy amable')
#nlp una instancia de pipeline

2020-10-22 11:48:31 INFO: Loading these models for language: es (Spanish):
| Processor | Package |
-----------------------
| tokenize  | ancora  |
| pos       | ancora  |

2020-10-22 11:48:31 INFO: Use device: cpu
2020-10-22 11:48:31 INFO: Loading: tokenize
2020-10-22 11:48:31 INFO: Loading: pos
2020-10-22 11:48:33 INFO: Done loading processors!


In [None]:
for sentence in doc.sentences:
  for word in sentence.words:
    print(word.text, word.pos)

yo PRON
soy AUX
una DET
persona NOUN
muy ADV
amable ADJ


# Referencias adicionales:

* Etiquetado POS con Stanza https://stanfordnlp.github.io/stanza/pos.html#accessing-pos-and-morphological-feature-for-word

* Stanza | Github: https://github.com/stanfordnlp/stanza

* Articulo en ArXiv: https://arxiv.org/pdf/2003.07082.pdf

# Entrenando un Modelo Markoviano Latente (HMM)

## Corpus de español: 

* AnCora | Github: https://github.com/UniversalDependencies/UD_Spanish-AnCora

* usamos el conllu parser para leer el corpus: https://pypi.org/project/conllu/

* Etiquetas Universal POS (Documentación): https://universaldependencies.org/u/pos/

In [None]:
#@title dependencias previas
!pip install conllu
!git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora.git

[31mERROR: Operation cancelled by user[0m
fatal: destination path 'UD_Spanish-AnCora' already exists and is not an empty directory.


In [None]:
#@title leyendo el corpus AnCora
from conllu import parse_incr 
wordList = []
data_file = open("UD_Spanish-AnCora/es_ancora-ud-dev.conllu", "r", encoding="utf-8")
for tokenlist in parse_incr(data_file):
    print(tokenlist.serialize())

In [None]:
#@title Estructura de los tokens etiquetados del corpus
tokenlist[1]

{'deprel': 'nsubj',
 'deps': None,
 'feats': {'Gender': 'Masc', 'Number': 'Sing'},
 'form': 'cierto',
 'head': 3,
 'id': 2,
 'lemma': 'cierto',
 'misc': None,
 'upos': 'ADJ',
 'xpos': 'ADJ'}

In [None]:
tokenlist[1]['form']+'|'+tokenlist[1]['upos']

# hacemos esto porque sera la forma en que haremos conteos sobre cada objeto condicionado

'cierto|ADJ'

## Entrenamiento del modelo - Calculo de conteos:

* tags (tags) `tagCountDict`: $C(tag)$
* emisiones (word|tag) `emissionProbDict`: $C(word|tag)$
* transiciones (tag|prevtag) `transitionDict`: $C(tag|prevtag)$

In [None]:
tagCountDict = {} 
emissionDict = {}
transitionDict = {}

#UPUS convesion de etiquetas universal
tagtype = 'upos'
data_file = open("UD_Spanish-AnCora/es_ancora-ud-dev.conllu", "r", encoding="utf-8")

# Calculando conteos (pre-probabilidades)
for tokenlist in parse_incr(data_file):
  prevtag = None
  # definimos prevtag no tenemos token previo
  for token in tokenlist:

    # C(tag)
    tag = token[tagtype]
    if tag in tagCountDict.keys():
      tagCountDict[tag] += 1
    else:
      tagCountDict[tag] = 1

    # C(word|tag) -> probabilidades emision
    wordtag = token['form'].lower()+'|'+token[tagtype] # (word|tag)
    if wordtag in emissionDict.keys():
      emissionDict[wordtag] = emissionDict[wordtag] + 1
    else:
      emissionDict[wordtag] = 1

    #  C(tag|tag_previo) -> probabilidades transición
    if prevtag is None:
      prevtag = tag
      continue
    transitiontags = tag+'|'+prevtag
    if transitiontags in transitionDict.keys():
      transitionDict[transitiontags] = transitionDict[transitiontags] + 1
    else:
      transitionDict[transitiontags] = 1
    prevtag = tag
    
#transitionDict
#emissionDict
#tagCountDict

## Entrenamiento del modelo - calculo de probabilidades
* probabilidades de transición:
$$P(tag|prevtag) = \frac{C(prevtag, tag)}{C(prevtag)}$$

* probabilidades de emisión:
 $$P(word|tag) = \frac{C(word|tag)}{C(tag)}$$

In [None]:
transitionProbDict = {} # matriz A
emissionProbDict = {} # matriz B

# transition Probabilities 
for key in transitionDict.keys():
  tag, prevtag = key.split('|')
  if tagCountDict[prevtag]>0:
    transitionProbDict[key] = transitionDict[key]/(tagCountDict[prevtag])
  else:
    print(key)

# emission Probabilities 
for key in emissionDict.keys():
  word, tag = key.split('|')
  if emissionDict[key]>0:
    emissionProbDict[key] = emissionDict[key]/tagCountDict[tag]
  else:
    print(key)

transitionProbDict['ADJ|ADJ']
#emissionProbDict

0.030225988700564973

In [None]:
emissionProbDict

{'el|DET': 0.2411214953271028,
 'gobernante|NOUN': 0.00020835503698301907,
 ',|PUNCT': 0.45316979929913986,
 'con|ADP': 0.05196480938416422,
 'ganada|ADJ': 0.0002824858757062147,
 'fama|NOUN': 0.00010417751849150954,
 'desde|ADP': 0.008797653958944282,
 'que|SCONJ': 0.6382042253521126,
 'llegó|VERB': 0.0022411474675033617,
 'hace|VERB': 0.009188704616763783,
 '16|NUM': 0.011428571428571429,
 'meses|NOUN': 0.0028127929992707574,
 'al|ADP': 0.04105571847507331,
 'poder|NOUN': 0.0011459527034066049,
 'de|ADP': 0.37478005865102637,
 'explotar|VERB': 0.00044822949350067237,
 'máximo|NOUN': 0.00020835503698301907,
 'su|DET': 0.0503235082674335,
 'oratoria|NOUN': 0.00010417751849150954,
 'y|CCONJ': 0.7771664374140302,
 'acusado|ADJ': 0.000847457627118644,
 'por|ADP': 0.05970674486803519,
 'sus|DET': 0.019985621854780734,
 'detractores|NOUN': 0.0003125325554745286,
 'incontinencia|NOUN': 0.00010417751849150954,
 'verbal|ADJ': 0.0005649717514124294,
 'enmudeció|VERB': 0.00022411474675033618,
 '

## Guardar parámetros del modelo

In [None]:
import numpy as np
np.save('transitionHMM.npy', transitionProbDict)
np.save('emissionHMM.npy', emissionProbDict)
transitionProbdict = np.load('transitionHMM.npy', allow_pickle='TRUE').item()
transitionProbDict['ADJ|ADJ']

0.030225988700564973

In [None]:
# instalacion de dependencias previas
!pip install conllu
!git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora.git

fatal: destination path 'UD_Spanish-AnCora' already exists and is not an empty directory.


# Carga del modelo HMM previamente entrenado

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
%cd 'drive/My Drive/Colab Notebooks/Curso Algoritmos de Clasificación de Texto'

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
[Errno 2] No such file or directory: 'drive/My Drive/Colab Notebooks/Curso Algoritmos de Clasificación de Texto'
/content/drive/My Drive/Colab Notebooks/Curso Algoritmos de Clasificación de Texto


In [None]:
# cargamos las probabilidades del modelo HMM
import numpy as np
transitionProbdict = np.load('transitionHMM.npy', allow_pickle='TRUE').item()
emissionProbdict = np.load('emissionHMM.npy', allow_pickle='TRUE').item()

In [None]:
# identificamos las categorias gramaticales 'upos' unicas en el corpus
stateSet = set([w.split('|')[1] for w in list(emissionProbdict.keys())])
stateSet

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'VERB',
 '_'}

In [None]:
# enumeramos las categorias con numeros para asignar a 
# las columnas de la matriz de Viterbi
tagStateDict = {}
for i, state in enumerate(stateSet):
  tagStateDict[state] = i
tagStateDict

{'ADJ': 15,
 'ADP': 2,
 'ADV': 5,
 'AUX': 6,
 'CCONJ': 10,
 'DET': 0,
 'INTJ': 16,
 'NOUN': 14,
 'NUM': 9,
 'PART': 7,
 'PRON': 4,
 'PROPN': 8,
 'PUNCT': 1,
 'SCONJ': 3,
 'SYM': 13,
 'VERB': 11,
 '_': 12}

# Distribucion inicial de estados latentes

In [None]:
# Calculamos distribución inicial de estados (la primer palabra en cada frase del corpus)
from conllu import parse_incr 
wordList = []
data_file = open("UD_Spanish-AnCora/es_ancora-ud-dev.conllu", "r", encoding="utf-8")

# realizamos el conteo de todas las categorias gramaticales al inifico del dataset
initTagStateProb = {} # \rho_i^{(0)}
count = 0 # cuenta la longitud de frases del corpus 
for tokenlist in parse_incr(data_file):
  count += 1
  tag = tokenlist[0]['upos']
  if tag in initTagStateProb.keys():
    initTagStateProb[tag] += 1
  else:
    initTagStateProb[tag] = 1

# Recorre initTagStateProb y divide cada llave entre la longitud del corpus
for key in initTagStateProb.keys():
  initTagStateProb[key] /= count

# Expresa las probabilidades de cada etiqueta del estado inicial
initTagStateProb

{'ADJ': 0.010882708585247884,
 'ADP': 0.16384522370012092,
 'ADV': 0.06287787182587666,
 'AUX': 0.022370012091898428,
 'CCONJ': 0.03325272067714631,
 'DET': 0.3633615477629988,
 'INTJ': 0.0006045949214026602,
 'NOUN': 0.02720677146311971,
 'NUM': 0.01995163240628779,
 'PART': 0.0018137847642079807,
 'PRON': 0.034461910519951636,
 'PROPN': 0.1124546553808948,
 'PUNCT': 0.07799274486094317,
 'SCONJ': 0.02418379685610641,
 'SYM': 0.0006045949214026602,
 'VERB': 0.04353083434099154,
 '_': 0.0006045949214026602}

In [None]:
# verificamos que la suma de las probabilidades es 1 (100%)
np.array([initTagStateProb[k] for k in initTagStateProb.keys()]).sum()

1.0

In [None]:
sum(initTagStateProb.values())

1.0

# Construcción del algoritmo de Viterbi






Dada una secuencia de palabras $\{p_1, p_2, \dots, p_n \}$, y un conjunto de categorias gramaticales dadas por la convención `upos`, se considera la matriz de probabilidades de Viterbi así:

$$
\begin{array}{c c}
\begin{array}{c c c c}
\text{ADJ} \\
\text{ADV}\\
\text{PRON} \\
\vdots \\
{}
\end{array} 
&
\left[
\begin{array}{c c c c}
\nu_1(\text{ADJ}) & \nu_2(\text{ADJ}) & \dots  & \nu_n(\text{ADJ})\\
\nu_1(\text{ADV}) & \nu_2(\text{ADV}) & \dots  & \nu_n(\text{ADV})\\ 
\nu_1(\text{PRON}) & \nu_2(\text{PRON}) & \dots  & \nu_n(\text{PRON})\\
\vdots & \vdots & \dots & \vdots \\ \hdashline
p_1 & p_2 & \dots & p_n 
\end{array}
\right] 
\end{array}
$$

Donde las probabilidades de la primera columna (para una categoria $i$) están dadas por: 

$$
\nu_1(i) = \underbrace{\rho_i^{(0)}}_{\text{probabilidad inicial}} \times \underbrace{P(p_1 \vert i)}_{\text{emisión}}
$$

luego, para la segunda columna (dada una categoria $j$) serán: 

$$
\nu_2(j) = \max_i \{ \nu_1(i) \times \underbrace{P(j \vert i)}_{\text{transición}} \times \underbrace{P(p_2 \vert j)}_{\text{emisión}} \}
$$

así, en general las probabilidades para la columna $t$ estarán dadas por: 

$$
\nu_{t}(j) = \max_i \{ \overbrace{\nu_{t-1}(i)}^{\text{estado anterior}} \times \underbrace{P(j \vert i)}_{\text{transición}} \times \underbrace{P(p_t \vert j)}_{\text{emisión}} \}
$$

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
def ViterbiMatrix(secuencia, transitionProbdict=transitionProbdict, 
                  emissionProbdict=emissionProbdict, tagStateDict=tagStateDict, 
                  initTagStateProb=initTagStateProb):
  
  seq = word_tokenize(secuencia)
  # creamos la matriz inicial 
  viterbiProb = np.zeros((17, len(seq)))  # upos tiene 17 categorias

  # inicialización primera columna
  for key in tagStateDict.keys():
    tag_row = tagStateDict[key]
    word_tag = seq[0].lower()+'|'+key
    if word_tag in emissionProbdict.keys():
      viterbiProb[tag_row, 0] = initTagStateProb[key]*emissionProbdict[word_tag]

  # computo de las siguientes columnas
  for col in range(1, len(seq)):
    for key in tagStateDict.keys():
      tag_row = tagStateDict[key]
      word_tag = seq[col].lower()+'|'+key
      if word_tag in emissionProbdict.keys():
        # miramos estados de la col anterior
        possible_probs = []
        for key2 in tagStateDict.keys(): 
          tag_row2 = tagStateDict[key2]
          tag_prevtag = key+'|'+key2
      
          if tag_prevtag in transitionProbdict.keys():
            if viterbiProb[tag_row2, col-1]>0:
              possible_probs.append(
                  viterbiProb[tag_row2, col-1]*transitionProbdict[tag_prevtag]*emissionProbdict[word_tag])
        
        viterbiProb[tag_row, col] = max(possible_probs)
  
  return viterbiProb

matrix = ViterbiMatrix('el mundo es pequeño')
matrix

array([[8.76142797e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 4.97926792e-07, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 2.00411724e-05, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 5.02871314e-09, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e

In [None]:
def ViterbiTags(secuencia, transitionProbdict=transitionProbdict,
                emissionProbdict=emissionProbdict, tagStateDict=tagStateDict, 
                initTagStateProb=initTagStateProb):
  
  seq = word_tokenize(secuencia)
  viterbiProb = np.zeros((17, len(seq)))  # upos tiene 17 categorias

  # inicialización primera columna
  for key in tagStateDict.keys():
    tag_row = tagStateDict[key]
    word_tag = seq[0].lower()+'|'+key
    if word_tag in emissionProbdict.keys():
      viterbiProb[tag_row, 0] = initTagStateProb[key]*emissionProbdict[word_tag]

  # computo de las siguientes columnas
  for col in range(1, len(seq)):
    for key in tagStateDict.keys():
      tag_row = tagStateDict[key]
      word_tag = seq[col].lower()+'|'+key
      if word_tag in emissionProbdict.keys():
        # miramos estados de la col anterior
        possible_probs = []
        for key2 in tagStateDict.keys(): 
          tag_row2 = tagStateDict[key2]
          tag_prevtag = key+'|'+key2
          if tag_prevtag in transitionProbdict.keys():
            if viterbiProb[tag_row2, col-1]>0:
              possible_probs.append(
                  viterbiProb[tag_row2, col-1]*transitionProbdict[tag_prevtag]*emissionProbdict[word_tag])
        viterbiProb[tag_row, col] = max(possible_probs)

    # contruccion de secuencia de tags
    res = []
    for i, p in enumerate(seq):
      for tag in tagStateDict.keys():
        if tagStateDict[tag] == np.argmax(viterbiProb[:, i]): #i es indice de palabra enumerada
          res.append((p, tag))
      
  return res

ViterbiTags('el mundo es muy pequeño')

[('el', 'DET'),
 ('mundo', 'NOUN'),
 ('es', 'AUX'),
 ('muy', 'ADV'),
 ('pequeño', 'ADJ')]

In [None]:
ViterbiTags('estos instrumentos han de rasgar')

[('estos', 'DET'),
 ('instrumentos', 'NOUN'),
 ('han', 'AUX'),
 ('de', 'ADP'),
 ('rasgar', 'VERB')]

# Entrenamiento directo de HMM con NLTK

* clase en python (NLTK) de HMM: https://www.nltk.org/_modules/nltk/tag/hmm.html

In [None]:
#title ejemplo con el Corpus Treebank en ingles
import nltk
nltk.download('treebank')
from nltk.corpus import treebank
train_data = treebank.tagged_sents()[:3900]

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


In [None]:
#title estructura de la data de entrenamiento
train_data

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], ...]

In [None]:
#title HMM pre-construido en NLTK
from nltk.tag import hmm
tagger = hmm.HiddenMarkovModelTrainer().train_supervised(train_data)
tagger

<HiddenMarkovModelTagger 46 states and 12385 output symbols>

In [None]:
tagger.tag("Pierre Vinken will get old".split())

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 ('will', 'MD'),
 ('get', 'VB'),
 ('old', 'JJ')]

In [None]:
#title training accuracy
tagger.evaluate(treebank.tagged_sents()[:3900])

0.9815403947224078

## Ejercicio de práctica

**Objetivo:** Entrena un HMM usando la clase `hmm.HiddenMarkovModelTrainer()` sobre el dataset `UD_Spanish_AnCora`.

1. **Pre-procesamiento:** En el ejemplo anterior usamos el dataset en ingles `treebank`, el cual viene con una estructura diferente a la de `AnCora`, en esta parte escribe código para transformar la estructura de `AnCora` de manera que quede igual al `treebank` que usamos así:

$$\left[ \left[ (\text{'El'}, \text{'DET'}), (\dots), \dots\right], \left[\dots \right] \right]$$

In [None]:
# Instalacion de dependencias
!pip install conllu
!git clone https://github.com/UniversalDependencies/UD_Spanish-AnCora.git

fatal: destination path 'UD_Spanish-AnCora' already exists and is not an empty directory.


In [None]:
from conllu import parse_incr 
wordList = []
data_file = open("UD_Spanish-AnCora/es_ancora-ud-dev.conllu", "r", encoding="utf-8")

corpusphrases = [corpusphrases for corpusphrases in parse_incr(data_file)]
tokenlist = [[(token['form'], token['upos']) for token in tokenlist] for tokenlist in corpusphrases]

2. **Entrenamiento:** Una vez que el dataset esta con la estructura correcta, utiliza la clase `hmm.HiddenMarkovModelTrainer()` para entrenar con el $80 \%$ del dataset como conjunto de `entrenamiento` y $20 \%$ para el conjunto de `test`.

**Ayuda:** Para la separacion entre conjuntos de entrenamiento y test, puedes usar la funcion de Scikit Learn: 

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

En este punto el curso de Machine Learning con Scikit Learn es un buen complemento para entender mejor las funcionalidades de Scikit Learn: https://platzi.com/cursos/scikitlearn-ml/ 

In [None]:
from sklearn.model_selection import train_test_split

# Dividiendo en train_data y test_data
train_data, test_data = train_test_split(tokenlist, test_size=0.20)

In [None]:
# title HMM pre-construido en NLTK
from nltk.tag import hmm
tagger = hmm.HiddenMarkovModelTrainer().train_supervised(train_data)
tagger

<HiddenMarkovModelTagger 17 states and 9065 output symbols>

In [None]:
tagger.tag("El gobernante, con ganada fama desde que llegó hace 16 meses al poder".split())

[('El', 'DET'),
 ('gobernante,', 'DET'),
 ('con', 'DET'),
 ('ganada', 'DET'),
 ('fama', 'DET'),
 ('desde', 'DET'),
 ('que', 'DET'),
 ('llegó', 'DET'),
 ('hace', 'DET'),
 ('16', 'DET'),
 ('meses', 'DET'),
 ('al', 'DET'),
 ('poder', 'DET')]

3. **Validación del modelo:** Un vez entrenado el `tagger`, calcula el rendimiento del modelo (usando `tagger.evaluate()`) para los conjuntos de `entrenamiento` y `test`.



In [None]:
# Evaluacion del modelo 
print(f'Accuracy para test_Data {tagger.evaluate(test_data)}')
print(f'Accuracy para train_Data {tagger.evaluate(train_data)}')

Accuracy para test_Data 0.30348027842227376
Accuracy para train_Data 0.9838748322790876


# Clasificación de palabras (por género de nombre)

In [None]:
import nltk, random
nltk.download('names')
from nltk.corpus import names 

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


**Función básica de extracción de atributos**

In [None]:
# definición de atributos relevantes
def atributos(palabra):
	return {'ultima_letra': palabra[-1]}

In [None]:
# Creacion del tagset usando los archivos de texto generados por names
tagset = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [None]:
tagset[:10]

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male')]

In [None]:
random.shuffle(tagset)
tagset[:10]

[('Bruce', 'male'),
 ('Godwin', 'male'),
 ('Lenore', 'female'),
 ('Ellene', 'female'),
 ('Lyndon', 'male'),
 ('Fiona', 'female'),
 ('Nadean', 'female'),
 ('Douglis', 'male'),
 ('Olia', 'female'),
 ('Kalila', 'female')]

In [None]:
fset = [(atributos(n), g) for (n, g) in tagset]
train, test = fset[500:], fset[:500]

**Modelo de clasificación Naive Bayes**

In [None]:
# entrenamiento del modelo NaiveBayes
classifier = nltk.NaiveBayesClassifier.train(train)

 **Verificación de algunas predicciones**

In [None]:
classifier.classify(atributos('amanda'))

'female'

In [None]:
classifier.classify(atributos('peter'))

'male'

**Performance del modelo**

In [None]:
print(nltk.classify.accuracy(classifier, test))

0.764


In [None]:
print(nltk.classify.accuracy(classifier, train))

0.7627619559376679


**Mejores atributos**

In [None]:
def mas_atributos(nombre):
    atrib = {}
    atrib["primera_letra"] = nombre[0].lower()
    atrib["ultima_letra"] = nombre[-1].lower()
    for letra in 'abcdefghijklmnopqrstuvwxyz':
        #atrib 3. numero de veces aparece la letra
        atrib["count({})".format(letra)] = nombre.lower().count(letra)
        #atrib 4. si tiene o no la letra
        atrib["has({})".format(letra)] = (letra in nombre.lower())
    return atrib

In [None]:
mas_atributos('jhon')

{'count(a)': 0,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 1,
 'count(i)': 0,
 'count(j)': 1,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 1,
 'count(o)': 1,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 0,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'has(a)': False,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': True,
 'has(i)': False,
 'has(j)': True,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': True,
 'has(o)': True,
 'has(p)': False,
 'has(q)': False,
 'has(r)': False,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'primera_letra': 'j',
 'ultima_letra': 'n'}

In [None]:
fset = [(mas_atributos(n), g) for (n, g) in tagset]
train, test = fset[500:], fset[:500]
classifier2 = nltk.NaiveBayesClassifier.train(train)

In [None]:
print(nltk.classify.accuracy(classifier2, test))

### Ejercicio de práctica

**Objetivo:** Construye un classificador de nombres en español usando el siguiente dataset: 
https://github.com/jvalhondo/spanish-names-surnames

1. **Preparación de los datos**: con un `git clone` puedes traer el dataset indicado a tu directorio en Colab, luego asegurate de darle el formato adecuado a los datos y sus features para que tenga la misma estructura del ejemplo anterior con el dataset `names` de nombres en ingles. 

* **Piensa y analiza**: ¿los features en ingles aplican de la misma manera para los nombres en español?

In [None]:
# escribe tu código aquí
!git clone https://github.com/jvalhondo/spanish-names-surnames

fatal: destination path 'spanish-names-surnames' already exists and is not an empty directory.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

male_names = pd.read_csv('/content/spanish-names-surnames/male_names.csv')
female_names =  pd.read_csv('/content/spanish-names-surnames/female_names.csv')
female_names.dropna(axis=0, inplace=True)

tagnames = [(name.lower(), 'male') for name in male_names['name']] + [(name.lower(), 'female') for name in female_names['name']] 
random.shuffle(tagset)

In [None]:
# definición de atributos relevantes
def atributos(palabra):
	return {'ultima_letra': palabra[-1]}

In [None]:
# Creando dataset con atributos
fset = [(atributos(n), g) for (n, g) in tagset]

# Dividiendo en train_data y test_data
train_data, test_data = train_test_split(fset, test_size=0.20)

2. **Entrenamiento y performance del modelo**: usando el classificador de Naive Bayes de NLTK entrena un modelo sencillo usando el mismo feature de la última letra del nombre, prueba algunas predicciones y calcula el performance del modelo. 

In [None]:
# escribe tu código aquí
classifier = nltk.NaiveBayesClassifier.train(train_data)

print(f'test_data accuracy: {nltk.classify.accuracy(classifier, test_data)}')
print(f'train_data accuracy {nltk.classify.accuracy(classifier, train_data)}')

test_data accuracy: 0.7715544367526747
train_data accuracy 0.7594020456333596


3. **Mejores atributos:** Define una función como `atributos2()` donde puedas extraer mejores atributos con los cuales entrenar una mejor version del clasificador. Haz un segundo entrenamiento y verifica como mejora el performance de tu modelo. ¿Se te ocurren mejores maneras de definir atributos para esta tarea particular?

In [None]:
def atributos2(nombre):
    atrib = {}
    atrib["nombre"] = nombre.lower()
    atrib["primera_letra"] = nombre[0].lower()
    atrib["ultima_letra"] = nombre[-1].lower()
    atrib["primeras_dos_letras"] = nombre[:2].lower()
    atrib["ultimas_dos_letras"] = nombre[-2:].lower()

    for letra in 'abcdefghijklmnopqrstuvwxyz':
        #atrib 3. numero de veces aparece la letra
        atrib["count({})".format(letra)] = nombre.lower().count(letra)
        #atrib 4. si tiene o no la letra
        atrib["has({})".format(letra)] = (letra in nombre.lower())
    return atrib

In [None]:
fset2 = [(mas_atributos(n), g) for (n, g) in tagset]

# Dividiendo en train_data y test_data
train_data2, test_data2 = train_test_split(fset2, test_size=0.20)

In [None]:
classifier2 = nltk.NaiveBayesClassifier.train(train_data2)
print(f'train_data accuracy: {nltk.classify.accuracy(classifier2, train_data2)}')
print(f'test_data accuracy {nltk.classify.accuracy(classifier2, test_data2)}')

train_data accuracy: 0.7801730920535012
test_data accuracy 0.7734424166142227


# Clasificación de documentos (email spam o no spam)

In [None]:
!git clone https://github.com/pachocamacho1990/datasets

Cloning into 'datasets'...
remote: Enumerating objects: 39, done.[K
remote: Total 39 (delta 0), reused 0 (delta 0), pack-reused 39[K
Unpacking objects: 100% (39/39), done.


In [None]:
import pandas as pd
import numpy as np
import nltk
import random
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
df = pd.read_csv('datasets/email/csv/spam-apache.csv', names = ['clase','contenido'])
df['tokens'] = df['contenido'].apply(lambda x: word_tokenize(x))
df.head()

Unnamed: 0,clase,contenido,tokens
0,-1,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr...","[<, !, DOCTYPE, HTML, PUBLIC, ``, -//W3C//DTD,..."
1,1,> Russell Turpin:\n> > That depends on how the...,"[>, Russell, Turpin, :, >, >, That, depends, o..."
2,-1,Help wanted. We are a 14 year old fortune 500...,"[Help, wanted, ., We, are, a, 14, year, old, f..."
3,-1,Request A Free No Obligation Consultation!\nAc...,"[Request, A, Free, No, Obligation, Consultatio..."
4,1,Is there a way to look for a particular file o...,"[Is, there, a, way, to, look, for, a, particul..."


In [None]:
df['tokens'].values[0]

In [None]:
#obtendremos la lista de palabras mas frecuentes para usarlas como aproximacion inicial

all_words = nltk.FreqDist([w for tokenlist in df['tokens'].values for w in tokenlist])
top_words = all_words.most_common(200)

def document_features(document):
    document_words = set(document)
    features = {}
    for word in top_words:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [None]:
document_features(df['tokens'].values[0])

{'contains(("\'", 88))': False,
 'contains(("\'\'", 438))': False,
 'contains(("\'m", 51))': False,
 'contains(("\'re", 41))': False,
 'contains(("\'s", 263))': False,
 'contains(("\'ve", 39))': False,
 'contains(("n\'t", 175))': False,
 "contains(('!', 698))": False,
 "contains(('#', 521))": False,
 "contains(('$', 413))": False,
 "contains(('%', 677))": False,
 "contains(('&', 181))": False,
 "contains(('(', 380))": False,
 "contains((')', 463))": False,
 "contains(('*', 43))": False,
 "contains((',', 2173))": False,
 "contains(('-', 283))": False,
 "contains(('--', 1611))": False,
 "contains(('.', 2200))": False,
 "contains(('...', 327))": False,
 "contains(('//www.adclick.ws/p.cfm', 40))": False,
 "contains(('1', 123))": False,
 "contains(('2', 94))": False,
 "contains(('2002', 67))": False,
 "contains(('3', 72))": False,
 "contains(('30', 220))": False,
 "contains(('31', 255))": False,
 "contains(('4', 61))": False,
 "contains(('5', 116))": False,
 "contains((':', 1220))": False,


In [None]:
fset = [(document_features(texto), clase) 
            for texto, clase in zip(df['tokens'].values, df['clase'].values)]
            #zip() permite recorrer ambas columnas/listas al mismo tiempo, great hack
random.shuffle(fset)
train, test = fset[:200], fset[200:]

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train)

In [None]:
print(nltk.classify.accuracy(classifier, test))

0.46


In [None]:
classifier.show_most_informative_features(10)

Most Informative Features
  contains(('been', 87)) = False               1 : -1     =      1.0 : 1.0
    contains(('-', 283)) = False               1 : -1     =      1.0 : 1.0
contains(('Please', 47)) = False               1 : -1     =      1.0 : 1.0
 contains(('your', 359)) = False               1 : -1     =      1.0 : 1.0
 contains(('their', 76)) = False               1 : -1     =      1.0 : 1.0
  contains(('THE', 108)) = False               1 : -1     =      1.0 : 1.0
  contains(('get', 107)) = False               1 : -1     =      1.0 : 1.0
    contains(('OF', 67)) = False               1 : -1     =      1.0 : 1.0
contains(('within', 47)) = False               1 : -1     =      1.0 : 1.0
    contains(('so', 99)) = False               1 : -1     =      1.0 : 1.0


In [None]:
df[df['clase']==-1]['contenido']

0      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Tr...
2      Help wanted.  We are a 14 year old fortune 500...
3      Request A Free No Obligation Consultation!\nAc...
10     >\n>“µ×è¹µÑÇ ¡ÑºâÅ¡¸ØÃ¡Ô¨º¹ÍÔ¹àµÍÃìà¹çµ” \n>àµ...
                             ...                        
243    ##############################################...
244    Wanna see sexually curious teens playing with ...
246    REQUEST FOR URGENT BUSINESS ASSISTANCE\n------...
248    Email marketing works!  There's no way around ...
249    Email marketing works!  There's no way around ...
Name: contenido, Length: 125, dtype: object

## Ejercicio de práctica


¿Como podrías construir un mejor clasificador de documentos?

0. **Dataset más grande:** El conjunto de datos que usamos fue muy pequeño, considera usar los archivos corpus que estan ubicados en la ruta: `datasets/email/plaintext/` 

1. **Limpieza:** como te diste cuenta no hicimos ningun tipo de limpieza de texto en los correos electrónicos. Considera usar expresiones regulares, filtros por categorias gramaticales, etc ... . 

---

Con base en eso construye un dataset más grande y con un tokenizado más pulido. 

In [None]:
# escribe tu código aquí
!git clone https://github.com/jvalhondo/spanish-names-surnames

Cloning into 'spanish-names-surnames'...
remote: Enumerating objects: 36, done.[K
remote: Total 36 (delta 0), reused 0 (delta 0), pack-reused 36[K
Unpacking objects: 100% (36/36), done.


In [None]:
import zipfile
import random

import pandas as pd
import numpy as np
import nltk

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
unzip_files = zipfile.ZipFile('/content/datasets/email/plaintext/corpus1.zip')
unzip_files.extractall('/content/datasets/email/plaintext')
unzip_files.close()

In [None]:
from os import listdir

In [None]:
path_ham = "/content/datasets/email/plaintext/corpus1/ham/"
filepaths_ham = [path_ham+f for f in listdir(path_ham) if f.endswith('.txt')]

path_spam = "/content/datasets/email/plaintext/corpus1/spam/"
filepaths_spam = [path_spam+f for f in listdir(path_spam) if f.endswith('.txt')]

In [None]:
# Creamos la funcion para tokenizar y leer los archivos 

def abrir(texto):
  with open(texto, 'r', errors='ignore') as f2:
    data = f2.read()
    data = word_tokenize(data)
  return data

# Creamos la lista tokenizada del ham
list_ham = list(map(abrir, filepaths_ham))
# Creamos la lista tokenizada del spam
list_spam = list(map(abrir, filepaths_spam))

nltk.download('stopwords')

# Separamos las palabras mas comunes
all_words = nltk.FreqDist([w for tokenlist in list_ham+list_spam for w in tokenlist])
top_words = all_words.most_common(250)

# Agregamos Bigramas
bigram_text = nltk.Text([w for token in list_ham+list_spam for w in token])
bigrams = list(nltk.bigrams(bigram_text))
top_bigrams = (nltk.FreqDist(bigrams)).most_common(250)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


2. **Validación del modelo anterior:**  
---

una vez tengas el nuevo conjunto de datos más pulido y de mayor tamaño, considera el mismo entrenamiento con el mismo tipo de atributos del ejemplo anterior, ¿mejora el accuracy del modelo resultante?

In [None]:
def document_features(document):
    document_words = set(document)
    bigram = set(list(nltk.bigrams(nltk.Text([token for token in document]))))
    features = {}
    for word, j in top_words:
        features['contains({})'.format(word)] = (word in document_words)

    for bigrams, i in top_bigrams:
        features['contains_bigram({})'.format(bigrams)] = (bigrams in bigram)
  
    return features

# Juntamos las listas indicando si tienen palabras de las mas comunes
import random
fset_ham = [(document_features(texto), 0) for texto in list_ham]
fset_spam = [(document_features(texto), 1) for texto in list_spam]
fset = fset_spam + fset_ham[:1500]
random.shuffle(fset)



In [None]:
# Separamos en las listas en train y test
from sklearn.model_selection import train_test_split
fset_train, fset_test = train_test_split(fset, test_size=0.20, random_state=45)


In [None]:

# Entrenamos el programa
classifier = nltk.NaiveBayesClassifier.train(fset_train)

# Probamos y calificamos
classifier.classify(document_features(list_ham[34]))
print(nltk.classify.accuracy(classifier, fset_test))


0.8683333333333333


3. **Construye mejores atributos**: A veces no solo se trata de las palabras más frecuentes sino de el contexto, y capturar contexto no es posible solo viendo los tokens de forma individual, ¿que tal si consideramos bi-gramas, tri-gramas ...?, ¿las secuencias de palabras podrián funcionar como mejores atributos para el modelo?. Para ver si es así,  podemos extraer n-gramas de nuestro corpus y obtener sus frecuencias de aparición con `FreqDist()`, desarrolla tu propia manera de hacerlo y entrena un modelo con esos nuevos atributos, no olvides compartir tus resultados en la sección de comentarios. 

In [None]:
# escribe tu código aquí:


In [None]:
import math
import os

## Preparación del corpus de emails

In [None]:
!git clone https://github.com/pachocamacho1990/datasets

fatal: destination path 'datasets' already exists and is not an empty directory.


In [None]:
! unzip datasets/email/plaintext/corpus1.zip

In [None]:
os.listdir('corpus1/spam')

In [None]:
data = []
clases = []
#lectura de spam data
for file in os.listdir('corpus1/spam'):
  with open('corpus1/spam/'+file, encoding='latin-1') as f:
    data.append(f.read())
    clases.append('spam')
#lectura de ham data
for file in os.listdir('corpus1/ham'):
  with open('corpus1/ham/'+file, encoding='latin-1') as f:
    data.append(f.read())
    clases.append('ham')
len(data)

5172

In [None]:
len(data), len(clases)

(5172, 5172)

## Construcción de modelo Naive Bayes

### Tokenizador de Spacy

* Documentación: https://spacy.io/api/tokenizer
* ¿Cómo funciona el tokenizador? https://spacy.io/usage/linguistic-features#how-tokenizer-works

In [None]:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

In [None]:
print([t.text for t in tokenizer(data[0])])

['Subject:', 'drug', 'turns', 'a', 'normal', 'guys', 'into', 'studs', '!', '\n', 'cialls', 'delivered', 'discreetly', 'to', 'your', 'door', 'in', 'an', 'unmarked', 'envelope', '\n', 'lasts', '8', 'x', 'times', 'longer', 'than', 'vlagra', ',', 'take', 'one', 'friday', ',', 'still', 'working', 'on', 'sunday', '!', '\n', 'no', 'one', 'needs', 'to', 'know', '!', '\n', 'buy', 'generic', '(', 'exactly', 'the', 'same', 'drug', ',', 'but', 'not', 'with', 'the', 'big', 'company', 'prices', '.', '.', ')', 'to', 'save', '70', '%', '!', '!', '\n', 'check', 'it', 'out', 'here', '!', '\n', 'fireball', 'abcdzhongguo', 'johnson', 'daddy', 'front', '242', 'gretchen', 'memory', '\n', 'josie', 'buttonshorizon', 'buffy', 'preston', '\n', 'josh', 'informix', 'diane', '\n']


### Clase principal para el algoritmo

Recuerda que la clase más probable viene dada por (en espacio de cómputo logarítmico): 


$$\hat{c} = {\arg \max}_{(c)}\log{P(c)}
 +\sum_{i=1}^n
\log{ P(f_i \vert c)}
$$

Donde, para evitar casos atípicos, usaremos el suavizado de Laplace así:

$$
P(f_i \vert c) = \frac{C(f_i, c)+1}{C(c) + \vert V \vert}
$$

siendo $\vert V \vert$ la longitud del vocabulario de nuestro conjunto de entrenamiento. 

In [None]:
import numpy as np

class NaiveBayesClassifier():
  nlp = English()
  tokenizer = Tokenizer(nlp.vocab)
  
  def tokenize(self, doc):
    return  [t.text.lower() for t in tokenizer(doc)]

  def word_counts(self, words):
    '''hace el conteo de las palabras'''
    wordCount = {}
    for w in words: 
      if w in wordCount.keys():
        wordCount[w] += 1
      else:
        wordCount[w] = 1
    return wordCount

  def fit(self, data, clases):
    '''Calcula todas las probabilidades'''
    n = len(data)
    self.unique_clases = set(clases)
    self.vocab = set()
    self.classCount = {} #C(c)
    self.log_classPriorProb = {} #P(c)
    self.wordConditionalCounts = {} #C(w|c)
    #conteos de clases
    for c in clases:
      if c in self.classCount.keys():
        self.classCount[c] += 1
      else:
        self.classCount[c] = 1
    # calculo de P(c)
    for c in self.classCount.keys():
      self.log_classPriorProb[c] = math.log(self.classCount[c]/n)
      self.wordConditionalCounts[c] = {}
    # calculo de C(w|c)
    for text, c in zip(data, clases):
      counts = self.word_counts(self.tokenize(text))
      for word, count in counts.items():
        # Agregamos palabra a vocab
        if word not in self.vocab:
          self.vocab.add(word)
        if word not in self.wordConditionalCounts[c]:
          self.wordConditionalCounts[c][word] = 0.0
        self.wordConditionalCounts[c][word] += count

  def predict(self, data):
    results = []
    for text in data:
      words = set(self.tokenize(text))
      scoreProb = {}
      for word in words: 
        if word not in self.vocab: continue #ignoramos palabras nuevas
        #suavizado Laplaciano para P(w|c)
        for c in self.unique_clases:
          log_wordClassProb = math.log(
              (self.wordConditionalCounts[c].get(word, 0.0)+1)/(self.classCount[c]+len(self.vocab)))
          scoreProb[c] = scoreProb.get(c, self.log_classPriorProb[c]) + log_wordClassProb
      # obtenemos argumento de maxima probabilidad
      arg_maxprob = np.argmax(np.array(list(scoreProb.values())))
      results.append(list(scoreProb.keys())[arg_maxprob])
    return results


### Utilidades de Scikit Learn
* `train_test_split`: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

* `accuracy_score`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

* `precision_score`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

* `recall_score`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
data_train, data_test, clases_train, clases_test = train_test_split(data, clases, test_size=0.10, random_state=42)

In [None]:
classifier = NaiveBayesClassifier()
classifier.fit(data_train, clases_train)

In [None]:
clases_predict = classifier.predict(data_test)

In [None]:
accuracy_score(clases_test, clases_predict)

0.8397683397683398

In [None]:
precision_score(clases_test, clases_predict, average=None, zero_division=1)

array([0.81390135, 1.        ])

In [None]:
# de todo lo que predije que era ham solo el 0.81% es ham | todo lo que predije como spam fue stpam

In [None]:
recall_score(clases_test, clases_predict, average=None, zero_division=1)

array([1.        , 0.46451613])

In [None]:
# De todo lo que en el dataset realmente es ham capture el 100% | 
# de todo lo que en el dataset es spam logre capturar solo el 51% 