# Naive Bayes

## Modelo de clasificación de textos de manera supervisada

- [Documentación utilizada](https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data)

- [Referencia](https://www.campusvirtual.frba.utn.edu.ar/especialidad/pluginfile.php/300766/mod_resource/content/1/NLP%20-%20UTN%20-%20Clase%203.pdf)

---

## Preparación del dataset

- Obtención del dataset de un `CSV`
- Ignoramos las columnas innecesarias
- Limpieza de datos
- Separación de dataset en Train/Test

## Teorema de Bayes

![](https://miro.medium.com/max/358/1*8vBP06EtIIf-420o_q1u6g.png)


Se debe calcular qué tópico tiene mayor probabilidad para una texto determinado

¿`P(c1 | unTexto)` es mayor que `P(c2 | unTexto)`?

Según el Teorema de Bayes, esto se puede calcular de la siguiente manera:

`P(c | unTexto) = (P(unTexto | c) * P(c)) / P(unTexto)`

Como para ambas clases el denominador es el mismo, podemos ignorarlo y nos queda:

`P(c | unTexto) = P(unTexto | c) * P(c)`

Finalmente

`P(c) = count(textos, c) / count(textos, dataset)` 

`P(unTexto) = count(unTexto, c) / count(textos, c)`

Dado que los textos a evaluar no necesariamente aparecen en el dataset, y por consiguiente su probabilidad es cero. Entonces se asumen todas las palabras independientes. Ésto se lo conoce como [Markov Assumption](https://es.wikipedia.org/wiki/Proceso_de_M%C3%A1rkov)

Entonces teniendo en cuenta lo mencionado:

`P(unTexto | c) = P(w1 | c) * P(w2 | c) * ... * P(wn | c)`

Siendo

`P(unaPalabra | c) = count(unaPalabra, c) / count(palabras, c)`
















In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from repository.csv_tools import get_documents
import random
from util.count_vectorizer import MyCountVectorizer
import math
from multiprocessing import Pool
from models.naive_bayes_utils import *

## Preparación del dataset

In [2]:
documents = get_documents('../data/dataset.csv')
preprocessed_docs = [(document.lemmatized_string, document.topic) for document in documents]

In [3]:
texts_train, classes_train, text_test, classes_test = split_dataset(preprocessed_docs, 0.2)

## Entrenamiento

In [4]:
classes = set(classes_train)
frequencies = {}
probabilities = {}
features_by_class = {}

def process(topic):
    documents = get_documents_from_class(texts_train, classes_train, topic)

    vectorizer           = MyCountVectorizer(preprocess = False)
    term_document_matrix = vectorizer.fit_transform(documents)
    word_list            = get_words(vectorizer)
    count_list           = get_words_count(term_document_matrix)

    __frequencies = get_frecuencies(word_list, count_list)
    __probabilities = get_probabilities(word_list, count_list)
    __features_by_class = get_features_count(count_list)
   
    return (__frequencies, __probabilities, __features_by_class, topic)
    

with Pool(10) as pool:
    results = pool.map(process, classes)

for result in results:
    topic = result[3]
    frequencies[topic] = result[0]
    probabilities[topic] = result[1]
    features_by_class[topic] = result[2]


In [5]:
total_features = get_total_features(frequencies, classes)
print('Total features: {}'.format(total_features))

Total features: 4304


In [6]:
class_probabilities = get_class_probabilities(classes_train)

for topic in class_probabilities.keys():
    print('Class: {} has a P = {}'.format(topic, class_probabilities[topic]))

Class: La larga cola has a P = 0.15510204081632653
Class: Economia de experiencia has a P = 0.22040816326530613
Class: Adopcion y difusion has a P = 0.1836734693877551
Class: Nueva economia has a P = 0.024489795918367346
Class: Wikinomics has a P = 0.09795918367346938
Class: La sociedad de costo marginal cero has a P = 0.12244897959183673
Class: Sistemas emergentes has a P = 0.08979591836734693
Class: El dominio de la informacion has a P = 0.00816326530612245
Class: Marketing 4.0 has a P = 0.04897959183673469
Class: E-commerce has a P = 0.024489795918367346
Class: Machine - Platform - Crowd has a P = 0.012244897959183673
Class: Realidad virtual has a P = 0.004081632653061225
Class: Plataformas y modelos de negocio has a P = 0.004081632653061225
Class: Domotica has a P = 0.004081632653061225


## Testing

In [7]:
from models.naive_bayes_utils import NaiveBayes
classifier = NaiveBayes()
classifier.init(frequencies, classes, features_by_class, class_probabilities)
classifier.save_model(path = '../data/')

In [8]:
successfuly = 0
failed = 0 
total = len(text_test)

for ix, document in enumerate(text_test[:]):
    topic_predicted, probability_predicted = classifier.topic(document)
    if topic_predicted == classes_test[ix]:
        successfuly += 1
    else:
        failed += 1

    print('For document #{}: Predicted topic: {} with P = {}. Real topic: {}'.format(ix, topic_predicted, probability_predicted, classes_test[ix]))

For document #0: Predicted topic: La larga cola with P = -1728.399955793279. Real topic: La larga cola
For document #1: Predicted topic: La sociedad de costo marginal cero with P = -1699.7607884193883. Real topic: La sociedad de costo marginal cero
For document #2: Predicted topic: La larga cola with P = -2267.2486363063326. Real topic: La larga cola
For document #3: Predicted topic: El dominio de la informacion with P = -5697.110132994869. Real topic: El dominio de la informacion
For document #4: Predicted topic: Machine - Platform - Crowd with P = -1707.8647781047532. Real topic: Machine - Platform - Crowd
For document #5: Predicted topic: La larga cola with P = -2129.4351930576954. Real topic: La larga cola
For document #6: Predicted topic: Sistemas emergentes with P = -1111.0794155079873. Real topic: Sistemas emergentes
For document #7: Predicted topic: Sistemas emergentes with P = -1607.3719337785915. Real topic: Sistemas emergentes
For document #8: Predicted topic: La sociedad de

In [9]:
fiability = {

    'accuracy': (successfuly*100)/total,
    'failed': (failed*100)/total
}

print('El modelo acertó el {}% de las veces'.format(fiability['accuracy']))
print('El modelo falló el {}% de las veces'.format(fiability['failed']))

El modelo acertó el 95.08196721311475% de las veces
El modelo falló el 4.918032786885246% de las veces
