<a href="https://colab.research.google.com/github/danieldrako/Algoritmos-Clasificacion-de-Texto/blob/main/04NaiveBayes_clasificador.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
import math
import os


## Preparación del corpus de emails

In [4]:
!git clone https://github.com/pachocamacho1990/datasets

Cloning into 'datasets'...
remote: Enumerating objects: 39, done.[K
remote: Total 39 (delta 0), reused 0 (delta 0), pack-reused 39[K
Unpacking objects: 100% (39/39), done.


In [None]:
! unzip datasets/email/plaintext/corpus1.zip

In [None]:
os.listdir('corpus1/spam')

In [9]:
data = []
clases = []
#lectura de spam data
for file in os.listdir('corpus1/spam'):
  with open('corpus1/spam/'+file, encoding='latin-1') as f:
    data.append(f.read())
    clases.append('spam')
#lectura de ham data
for file in os.listdir('corpus1/ham'):
  with open('corpus1/ham/'+file, encoding='latin-1') as f:
    data.append(f.read())
    clases.append('ham')
len(data)

5172

## Construcción de modelo Naive Bayes

### Tokenizador de Spacy

* Documentación: https://spacy.io/api/tokenizer
* ¿Cómo funciona el tokenizador? https://spacy.io/usage/linguistic-features#how-tokenizer-works

In [10]:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

In [11]:
print([t.text for t in tokenizer(data[0])])

['Subject:', 'more', 'money', 'now', '\n', 'stop', 'making', 'other', 'people', 'rich', '\n', 'run', 'your', 'own', 'business', 'and', 'make', 'your', 'own', 'rules', '\n', 'you', 'can', 'even', 'work', 'from', 'home', '-', 'no', 'prior', 'knowledge', 'needed', '\n', 'the', 'cas?no', 'industry', 'has', 'has', '10', 'billion', 'dollars', '/', 'year', 'for', 'the', 'taking', '\n', 'want', 'in', 'on', 'this', '?', '\n', 'call', '1', '-', '877', '-', '467', '-', '2636', 'ext', ':', '213', 'for', 'more', 'info', '.', '\n', 'bilharziasis', 'biracial', 'coy', 'addition', 'emotional', 'circus', 'rpm', '\n', 'verdict', 'pedestal', 'appanage', 'cranford', 'cedar', 'deterred', 'hoop', '\n', 'dolan', 'golf', 'regis', 'burette', 'honey', 'blood', 'manage', '\n', 'sanskrit', 'puccini', 'spitfire', 'megohm', 'distinguish', 'deadwood', 'syrinx', '\n', 'encroach', 'now', 'advise', 'calcify', 'nutritive', 'mouthful', 'scoop', '\n', 'your', 'tomorrow', 'dandelion', 'interfere', 'misanthrope', 'centerline

### Clase principal para el algoritmo

Recuerda que la clase más probable viene dada por (en espacio de cómputo logarítmico): 


$$\hat{c} = {\arg \max}_{(c)}\log{P(c)}
 +\sum_{i=1}^n
\log{ P(f_i \vert c)}
$$

Donde, para evitar casos atípicos, usaremos el suavizado de Laplace así:

$$
P(f_i \vert c) = \frac{C(f_i, c)+1}{C(c) + \vert V \vert}
$$

siendo $\vert V \vert$ la longitud del vocabulario de nuestro conjunto de entrenamiento. 

In [12]:
import numpy as np

class NaiveBayesClassifier():
  nlp = English()
  tokenizer = Tokenizer(nlp.vocab)
  
  def tokenize(self, doc):
    return  [t.text.lower() for t in tokenizer(doc)]

  def word_counts(self, words):
    wordCount = {}
    for w in words: 
      if w in wordCount.keys():
        wordCount[w] += 1
      else:
        wordCount[w] = 1
    return wordCount

  def fit(self, data, clases):
    n = len(data)
    self.unique_clases = set(clases)
    self.vocab = set()
    self.classCount = {} #C(c)
    self.log_classPriorProb = {} #log (P(c))
    self.wordConditionalCounts = {} #C(w|c)
    #conteos de clases
    for c in clases:
      if c in self.classCount.keys():
        self.classCount[c] += 1
      else:
        self.classCount[c] = 1
    # calculo de P(c)
    for c in self.classCount.keys():
      self.log_classPriorProb[c] = math.log(self.classCount[c]/n)
      self.wordConditionalCounts[c] = {}
    # calculo de C(w|c)
    for text, c in zip(data, clases):
      counts = self.word_counts(self.tokenize(text))
      for word, count in counts.items():
        if word not in self.vocab:
          self.vocab.add(word)
        if word not in self.wordConditionalCounts[c]:
          self.wordConditionalCounts[c][word] = 0.0
        self.wordConditionalCounts[c][word] += count

  def predict(self, data):
    results = []
    for text in data:
      words = set(self.tokenize(text))
      scoreProb = {}
      for word in words: 
        if word not in self.vocab: continue #ignoramos palabras nuevas
        #suavizado Laplaciano para P(w|c)
        for c in self.unique_clases:
          log_wordClassProb = math.log(
              (self.wordConditionalCounts[c].get(word, 0.0)+1)/(self.classCount[c]+len(self.vocab)))
          scoreProb[c] = scoreProb.get(c, self.log_classPriorProb[c]) + log_wordClassProb
      arg_maxprob = np.argmax(np.array(list(scoreProb.values())))
      results.append(list(scoreProb.keys())[arg_maxprob])
    return results


### Utilidades de Scikit Learn
* `train_test_split`: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

* `accuracy_score`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

* `precision_score`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

* `recall_score`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
data_train, data_test, clases_train, clases_test = train_test_split(data, clases, test_size=0.10, random_state=42)

In [14]:
classifier = NaiveBayesClassifier()
classifier.fit(data_train, clases_train)

In [15]:
clases_predict = classifier.predict(data_test)

In [16]:
accuracy_score(clases_test, clases_predict)

0.8359073359073359

In [17]:
precision_score(clases_test, clases_predict, average=None, zero_division=1)
# de todo lo que es correo legitimo el 81% lo es y de todo lo que yo predije como spam el 100% lo es 

array([0.81026786, 1.        ])

In [18]:
recall_score(clases_test, clases_predict, average=None, zero_division=1)
#lo que en el dataset realmente es correo valido, logre capturar el 100%

array([1.       , 0.4516129])