# Naive Bayes

## Modelo de clasificación de textos de manera supervisada

- [Documentación utilizada](https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data)

---

## Preparación del dataset

- Obtención del dataset de un `CSV`
- Ignoramos las columnas innecesarias
- Limpieza de datos
- Separación de dataset en Train/Test

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from repository.csv_tools import get_documents
import random
from util.count_vectorizer import MyCountVectorizer

In [2]:
documents = get_documents('../data.csv')
preprocessed_docs = [(document.stemmed_string, document.topic) for document in documents]

In [3]:
def split_dataset(docs, test_size):
    random.shuffle(docs)

    test = int(len(docs) * test_size)
    texts_train = [doc[0] for doc in docs[test:]]
    topic_train = [doc[1] for doc in docs[test:]]
    texts_test  = [doc[0] for doc in docs[:test]]
    topic_test  = [doc[1] for doc in docs[:test]]

    return texts_train, topic_train, texts_test, topic_test


In [4]:
texts_train, classes_train, text_test, classes_test = split_dataset(preprocessed_docs, 0.2)

In [12]:
classes = set(classes_train)
frequencies = {}
probabilities = {}
features_by_class = {}

def get_documents_from_class(texts, classes, topic):
    return [texts[index] for index, cl in enumerate(classes_train) if cl == topic]

def get_words(vectorizer):
    return vectorizer.get_feature_names(); 

def get_words_count(vectorizer):
    return term_document_matrix.toarray().sum(axis=0) 

def get_frecuencies(word_list, count_list):
    return dict(zip(word_list, count_list))

def get_probabilities(word_list, count_list):
    prob = [(count / len(word_list)) for count in count_list]
    return dict(zip(word_list, prob))

def get_features_count(count_list):
    return count_list.sum(axis=0)

for topic in classes:
    documents = get_documents_from_class(texts_train, classes_train, topic)

    vectorizer           = MyCountVectorizer(preprocess = False)
    term_document_matrix = vectorizer.fit_transform(documents)
    word_list            = get_words(vectorizer)
    count_list           = get_words_count(vectorizer)

    frequencies[topic] = get_frecuencies(word_list, count_list)

    probabilities[topic] = get_probabilities(word_list, count_list)

    features_by_class[topic] = get_features_count(count_list)


In [11]:
total_features_in_training_set = {}
vectorizer = MyCountVectorizer(preprocess = False)
tdm = vectorizer.fit_transform(texts_train)
total_features = len(vectorizer.get_feature_names())
total_features


1504

## Teorema de Bayes

![](https://miro.medium.com/max/358/1*8vBP06EtIIf-420o_q1u6g.png)


Se debe calcular qué tópico tiene mayor probabilidad para una texto determinado

¿`P(c1 | unTexto)` es mayor que `P(c2 | unTexto)`?

Según el Teorema de Bayes, esto se puede calcular de la siguiente manera:

`P(c | unTexto) = (P(unTexto | c) * P(c)) / P(unTexto)`

Como para ambas clases el denominador es el mismo, podemos ignorarlo y nos queda:

`P(c | unTexto) = P(unTexto | c) * P(c)`

Finalmente

`P(c) = count(textos, c) / count(textos, dataset)` 

`P(unTexto) = count(unTexto, c) / count(textos, c)`

Dado que los textos a evaluar no necesariamente aparecen en el dataset, y por consiguiente su probabilidad es cero. Entonces se asumen todas las palabras independientes. Ésto se lo conoce como [Markov Assumption](https://es.wikipedia.org/wiki/Proceso_de_M%C3%A1rkov)

Entonces teniendo en cuenta lo mencionado:

`P(unTexto | c) = P(w1 | c) * P(w2 | c) * ... * P(wn | c)`

Siendo

`P(unaPalabra | c) = count(unaPalabra, c) / count(palabras, c)`














