# Naive Bayes

## Modelo de clasificación de textos de manera supervisada

- [Documentación utilizada](https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data)

---

## Preparación del dataset

- Obtención del dataset de un `CSV`
- Ignoramos las columnas innecesarias
- Limpieza de datos
- Separación de dataset en Train/Test

## Teorema de Bayes

![](https://miro.medium.com/max/358/1*8vBP06EtIIf-420o_q1u6g.png)


Se debe calcular qué tópico tiene mayor probabilidad para una texto determinado

¿`P(c1 | unTexto)` es mayor que `P(c2 | unTexto)`?

Según el Teorema de Bayes, esto se puede calcular de la siguiente manera:

`P(c | unTexto) = (P(unTexto | c) * P(c)) / P(unTexto)`

Como para ambas clases el denominador es el mismo, podemos ignorarlo y nos queda:

`P(c | unTexto) = P(unTexto | c) * P(c)`

Finalmente

`P(c) = count(textos, c) / count(textos, dataset)` 

`P(unTexto) = count(unTexto, c) / count(textos, c)`

Dado que los textos a evaluar no necesariamente aparecen en el dataset, y por consiguiente su probabilidad es cero. Entonces se asumen todas las palabras independientes. Ésto se lo conoce como [Markov Assumption](https://es.wikipedia.org/wiki/Proceso_de_M%C3%A1rkov)

Entonces teniendo en cuenta lo mencionado:

`P(unTexto | c) = P(w1 | c) * P(w2 | c) * ... * P(wn | c)`

Siendo

`P(unaPalabra | c) = count(unaPalabra, c) / count(palabras, c)`
















In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from repository.csv_tools import get_documents
import random
from util.count_vectorizer import MyCountVectorizer
import math
from multiprocessing import Pool

## Preparación del dataset

In [14]:
documents = get_documents('../data.csv')
preprocessed_docs = [(document.lemmatized_string, document.topic) for document in documents]

In [15]:
def split_dataset(docs, test_size):
    random.shuffle(docs)

    test = int(len(docs) * test_size)
    texts_train = [doc[0] for doc in docs[test:]]
    topic_train = [doc[1] for doc in docs[test:]]
    texts_test  = [doc[0] for doc in docs[:test]]
    topic_test  = [doc[1] for doc in docs[:test]]

    return texts_train, topic_train, texts_test, topic_test


In [16]:
texts_train, classes_train, text_test, classes_test = split_dataset(preprocessed_docs, 0.2)

## Entrenamiento

In [32]:
classes = set(classes_train)

def get_documents_from_class(texts, classes, topic):
    return [texts[index] for index, cl in enumerate(classes_train) if cl == topic]

def get_words(vectorizer):
    return vectorizer.get_feature_names(); 

def get_words_count(term_document_matrix):
    return term_document_matrix.toarray().sum(axis=0) 

def get_frecuencies(word_list, count_list):
    return dict(zip(word_list, count_list))

def get_probabilities(word_list, count_list):
    prob = [(count / len(word_list)) for count in count_list]
    return dict(zip(word_list, prob))

def get_features_count(count_list):
    return count_list.sum(axis=0)

def process(topic):
    documents = get_documents_from_class(texts_train, classes_train, topic)

    vectorizer           = MyCountVectorizer(preprocess = False)
    term_document_matrix = vectorizer.fit_transform(documents)
    word_list            = get_words(vectorizer)
    count_list           = get_words_count(term_document_matrix)

    __frequencies = get_frecuencies(word_list, count_list)
    __probabilities = get_probabilities(word_list, count_list)
    __features_by_class = get_features_count(count_list)
   
    return (__frequencies, __probabilities, __features_by_class, topic)
    

with Pool(10) as pool:
    results = pool.map(process, classes)

for result in results:
    topic = result[3]
    frequencies[topic] = result[0]
    probabilities[topic] = result[1]
    features_by_class[topic] = result[2]


In [47]:
features = [key for key in frequencies[topic].keys() for topic in classes]

total_features = len(set(features))
print('Total features: {}'.format(total_features))


Total features: 1115


In [48]:
class_probabilities = {}

for topic in set(classes_train):
    class_probabilities[topic] = classes_train.count(topic) / len(classes_train)


## Testing

In [85]:
def get_word_probability(word, topic, frequencies):
    freq = frequencies[topic]
    count = freq[word] if word in freq.keys() else 0
    return (count + 1) / (features_by_class[topic] + total_features)

def get_words_probabilities(document, topic, frequencies):
    prob = []
    for word in document:
        probability = get_word_probability(word, topic, frequencies)
        prob.append(probability)
    return dict(zip(document, prob))

def get_topic_predicted(predicted_probabilities):

    max_probability = None
    topic = None

    for key in predicted_probabilities.keys():
        p = predicted_probabilities[key] 
        if not max_probability or p > max_probability:
            max_probability = p
            topic = key
    
    return topic, max_probability


for ix, document in enumerate(text_test[:]):

    predicted_probabilities = {}

    for topic in classes:
        words_probability = get_words_probabilities(document, topic, frequencies)
        P = math.log(class_probabilities[topic], 2)
        for key in words_probability.keys():
            p = words_probability[key]
            P += math.log(p, 2)
        predicted_probabilities[topic] = P
    
    topic_predicted, probability_predicted = get_topic_predicted(predicted_probabilities)
    print('For document #{}: Predicted topic: {} with P = {}. Real topic: {}'.format(ix, topic_predicted, probability_predicted, classes_test[ix]))


For document #0: Predicted topic: Sistemas emergentes with P = -1000.8557035744055. Real topic: Sistemas emergentes
For document #1: Predicted topic: Wikinomics with P = -2483.940418548769. Real topic: Wikinomics
For document #2: Predicted topic: Test with P = -2547.886458839754. Real topic: La larga cola
For document #3: Predicted topic: La larga cola with P = -2220.704247597434. Real topic: La larga cola
For document #4: Predicted topic: La larga cola with P = -2704.5176637854624. Real topic: La larga cola
For document #5: Predicted topic: Adopcion y difusion with P = -1635.249766293545. Real topic: Adopcion y difusion
For document #6: Predicted topic: Sistemas emergentes with P = -896.2390487986557. Real topic: Sistemas emergentes
For document #7: Predicted topic: Python with P = -4987.274582167257. Real topic: Marketing 4.0
For document #8: Predicted topic: Sistemas emergentes with P = -1135.4272116586253. Real topic: Sistemas emergentes
For document #9: Predicted topic: La larga c