# Naïve Bayes
In this proposed exercise, the topic of study is Naïve Bayes.

To implement its predictions - and simplify the usage of the created codebase - a class `Bayes` is implemented on the same folder as this notebook. The file containing the Naïve Bayes class is the [bayes.py](./bayes.py) file. It contains a class with the following exposed methods: train and predict.

In [105]:
from bayes import Bayes
bayes = Bayes()
bayes.train(['Eu gosto de batata'.lower().split(' '), 'Eu não gosto de batata'.lower().split(' ')], ["Positivo", 'Negativo'])

Now with the trained model, we can make predictions based on the possible values given in the training model. Here's a prediction example:

In [106]:
print(f"O texto 'Eu não gosto' é: {bayes.predict('Eu não gosto'.lower().split(' '))}")
print(f"O texto 'Eu gosto' é: {bayes.predict('Eu  gosto'.lower().split(' '))}")
print(f"O texto 'batata' é: {bayes.predict('Batata'.lower().split(' '))}")
print(f"O texto 'não' é: {bayes.predict('não'.lower().split(' '))}")

O texto 'Eu não gosto' é: Negativo
O texto 'Eu gosto' é: Positivo
O texto 'batata' é: Positivo
O texto 'não' é: Negativo


Looking at this small sample we can already predict that the influence of the negative word "não" is already pulling the probability to the negative class. The interesting part here is that the probability of both is the same whenever the negative word is not present. Let's try with the whole positive sentence

In [108]:
bayes.predict('Eu gosto de batata'.lower().split(' '), print_probabilities=True)
bayes.predict('de batata eu gosto'.lower().split(' '), print_probabilities=True)

['Likelyhood Positivo: -6.016309587105097', 'Likelyhood Negativo: -6.437751649736401']
['Likelyhood Positivo: -6.016309587105097', 'Likelyhood Negativo: -6.437751649736401']


'Positivo'

As we can see, we could predict that the text that has the word 'não' is negative, without the entire sentence.

This now can be used in the case of an already treated dataset that gives the variables and the classes. In the case of the exercise, we still need to process the incoming text for the Bayes class to be able to train based uppon it.

For the base to be trained, I will use the database of brazilian poems in this website: 
http://www.blocosonline.com.br/literatura/poesia/pn/pn000000.htm

The objective is to classify the poem in an author.

First to extract the poems, we need to crawl on all letters for the athors' names.



In [180]:
def tokenize(page):
    text = " ".join([unescape(line) for line in re.findall(r'(?:<p>|<br>|<P>|<BR>)(.*?)(?:\n)', page) if not line.endswith('>') and not line.endswith('&nbsp;')])
    return extract_words(text)

def extract_words(text):
    text = re.sub(r'[!\.,\-?\\/0-9:|()]', ' ', text)
    text = re.sub(r'\x97', '', text)
    text = re.sub(r'\x93', '', text)
    text = re.sub(r'\x94', '', text)
    text = re.sub(r'\xa0', '', text)
    text = re.sub(r'<br>', '', text)
    text = re.sub(r'[dD]o [Ll]ivro.*', '', text)
    text = re.sub(r' +', ' ', text)
    return text.strip().lower().split(' ')

tokenize("""
html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="GENERATOR" content="Mozilla/4.51 [pt] (Win98; I) [Netscape]">
   <meta name="Author" content="Blocos">
   <meta name="KeyWords" content="Blocos, MPB, A., Alexandre, Ramôa, Ramoa, poesias, cultura, literatura, poetas, brasileiros, brasileiras, autores, nacionais, Brasil,">
   <title>&Uacute;ltimos versos - A. Ram&ocirc;a</title>
<link href="../../../estilos.css" rel="stylesheet" type="text/css"><SCRIPT LANGUAGE="JavaScript" src="../../../menu.js"></script>
</head>
<body onload="goSetHeight()"  class="texto-conteudo" bgcolor="#FFFFFF" link="#000066" vlink="#551A8B" alink="#FF0000"  nosave>

<ul><b>&Uacute;ltimos Versos</b>
<p>Te deixei voar t&atilde;o alto
<br>T&atilde;o longe
<br>T&atilde;o bela e t&atilde;o livre
<br>Que te tornastes o horizonte
<br>Inalcan&ccedil;&aacute;vel
<p>Hoje, sigo sem asas
<br>Em busca de um limite
<br>(Inexistente)
<p>Te deixo
<br>Te perco
<br>Me esque&ccedil;o
<br>Pra sempre
<p>Enfim, teu esp&iacute;rito voa livre
<br>Por sobre os campos de sonhos
<br>Que deixei
<br>E que um dia semeei pra mim...
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<b>&nbsp;
A. Ram&ocirc;a</b>
<br><a class="navegacao4"  href="../pn03/pn001608.htm"></a></ul>

<p align="right"><a href="../../poesia_origens.php" class="navegacao2" onFocus="if(this.blur)this.blur()">&laquo; 
  Voltar</a></p>

</BODY>
</html>


""")

['te',
 'deixei',
 'voar',
 'tão',
 'alto',
 'tão',
 'longe',
 'tão',
 'bela',
 'e',
 'tão',
 'livre',
 'que',
 'te',
 'tornastes',
 'o',
 'horizonte',
 'inalcançável',
 'hoje',
 'sigo',
 'sem',
 'asas',
 'em',
 'busca',
 'de',
 'um',
 'limite',
 'inexistente',
 'te',
 'deixo',
 'te',
 'perco',
 'me',
 'esqueço',
 'pra',
 'sempre',
 'enfim',
 'teu',
 'espírito',
 'voa',
 'livre',
 'por',
 'sobre',
 'os',
 'campos',
 'de',
 'sonhos',
 'que',
 'deixei',
 'e',
 'que',
 'um',
 'dia',
 'semeei',
 'pra',
 'mim']

In [164]:
from html import unescape
import requests
import re
from datetime import datetime
from pathlib import Path

Path('poems').mkdir(parents=True, exist_ok=True)

now = datetime.now()
authors = {}
authors_list = []

def get_poem(poem_ref, url_base):
    if poem_ref[0] == '..':
        url_base_temp = url_base[:-1] + [poem_ref[1]]
        poem_ref = [poem_ref[-1]]
    else:
        url_base_temp = url_base
    print('/'.join(url_base_temp + poem_ref))
    poem_request = requests.get('/'.join(url_base_temp + poem_ref), headers={"User-Agent": "XY"})
    return tokenize(poem_request.text) if poem_request.status_code == 200 else None

for letter in 'ghijklmnopqrstuvwxyz':
    url_base = 'http://www.blocosonline.com.br/literatura/poesia/pn'.split('/')
    letter_path = f'pn00000{letter}.htm'
    authors_request = requests.get("/".join(url_base + [letter_path]), headers={"User-Agent": "XY"})
    for author in [author for author in re.findall(r'(?:href=\"|HREF=\")(.*?)(?:\")', authors_request.text)[:-1] if author.endswith('.htm') and author != "../../../servic/sermails.htm"]:
        author = author.split('/')
        if author[0] == '..':
            url_base = url_base[:-1] + [author[1]]
            author = [author[-1]]
        print('/'.join(url_base + author))
        poems_request = requests.get('/'.join(url_base + author), headers={"User-Agent": "XY"})
        author_name = unescape(re.findall(r'(?:<title>|<TITLE>)(.*?)(?:</title>|</TITLE>)', poems_request.text)[0])
        
        Path(f'poems/{author_name}').mkdir(parents=True, exist_ok=True)
        for i, poem in enumerate(re.findall(r'(?:href=\"|HREF=\")(.*?)(?:\")', poems_request.text)):
            poem = poem.split('/')
            if poem[-1].endswith('htm') and poem[-1] != letter_path:
                with open(f'poems/{author_name}/{poem[-1]}.txt', 'w') as file_stream:
                    dataset = get_poem(poem, url_base)
                    if dataset:
                        file_stream.write(" ".join(dataset))
print(datetime.now() - now)

5.htm
http://www.blocosonline.com.br/literatura/poesia/pf/pf000003a.htm
http://www.blocosonline.com.br/literatura/poesia/pf/pn000837.htm
http://www.blocosonline.com.br/literatura/poesia/pn03/pn001594.htm
http://www.blocosonline.com.br/literatura/poesia/p03/p030511.htm
http://www.blocosonline.com.br/literatura/poesia/p03/p030512.htm
http://www.blocosonline.com.br/literatura/poesia/p03/p030513.htm
http://www.blocosonline.com.br/literatura/poesia/p03/p030514.htm
http://www.blocosonline.com.br/literatura/poesia/p03/p030515.htm
http://www.blocosonline.com.br/literatura/poesia/p03/p030516.htm
http://www.blocosonline.com.br/literatura/poesia/p03/p030517.htm
http://www.blocosonline.com.br/literatura/poesia/pn03/pn000403.htm
http://www.blocosonline.com.br/literatura/poesia/pn03/pn000348.htm
http://www.blocosonline.com.br/literatura/poesia/pn03/pn001319.htm
http://www.blocosonline.com.br/literatura/poesia/pn03/pn001126.htm
http://www.blocosonline.com.br/literatura/poesia/pn03/pn000553.htm
http:/

Now, with all the extracted poems in their respective author's folders, we can load this dataset whenever necessary to train this model.

To assess how well the model can verify if an author wrote a poem, let's train the bayes model with 4 authors and check the probabilities of a poem from each of them.

In [248]:
from os import listdir

def get_training_data(limit=None):
    author_folders = [item for item in listdir('poems') if not item.startswith('.')]
    dataset = []
    output = []
        
    for i in range(limit) if limit else range(len(author_folders)):
        for poem in [item for item in listdir(f'poems/{author_folders[i]}') if not item.startswith('.')]:
            with open(f'poems/{author_folders[i]}/{poem}', 'r') as file_reader:
                dataset.append(extract_words(file_reader.read()))
                output.append(author_folders[i])
    return (dataset, output)

In [224]:
def train_bayes(dataset, output):
    now = datetime.now()
    bayes = Bayes()
    bayes.train(dataset, output)
    print(f"took: {datetime.now() - now}")
    return bayes

In [225]:
def get_results(bayes_model, dataset, output):
    results = []
    for poem, author in zip(dataset, output):
        result = bayes_model.predict(poem) == author
        results.append(result)
    print(f"We've got {(len([result for result in results if result])/len(results))*100}% right")
dataset, output = get_training_data(limit=4)
get_results(train_bayes(dataset, output), dataset, output)

took: 0:00:00.085408
We've got 100.0% right


As we can see, testing this model for 4 authors, it has got 100% of the predictions right. To assess if this is going to hold true for other authors, I am going to use the same data and functions for other numbers of authors to watch for the performance of the model

In [226]:
for i in range(10, 30):
    print(f"For {i} authors;")
    dataset, output = get_training_data(limit=i)
    get_results(train_bayes(dataset, output), dataset, output)

For 10 authors;
took: 0:00:01.430694
We've got 91.66666666666666% right
For 11 authors;
took: 0:00:01.710053
We've got 92.3076923076923% right
For 12 authors;
took: 0:00:02.237094
We've got 93.05555555555556% right
For 13 authors;
took: 0:00:02.911196
We've got 93.5064935064935% right
For 14 authors;
took: 0:00:03.201320
We've got 93.58974358974359% right
For 15 authors;
took: 0:00:04.535990
We've got 92.94117647058823% right
For 16 authors;
took: 0:00:05.034465
We've got 93.10344827586206% right
For 17 authors;
took: 0:00:05.776650
We've got 93.61702127659575% right
For 18 authors;
took: 0:00:06.257362
We've got 93.81443298969072% right
For 19 authors;
took: 0:00:07.053881
We've got 94.0% right
For 20 authors;
took: 0:00:07.839202
We've got 94.39252336448598% right
For 21 authors;
took: 0:00:08.462771
We've got 94.4954128440367% right
For 22 authors;
took: 0:00:08.719867
We've got 93.63636363636364% right
For 23 authors;
took: 0:00:09.561570
We've got 91.96428571428571% right
For 24 a

This is already a quite impressive model, it should be good enough to identify most of the poems' authors.
Now, it wouldn't be fun if we couldn't test it with one of my own poems to see which author I identify myself with:

In [228]:
dataset, output = get_training_data(limit=40)
bayes = train_bayes(dataset, output)

took: 0:00:27.157213


'MPB - Paolo Lim'

In [230]:
bayes.predict(extract_words("De minha volição, palavras abstém; De meu coração, pensamentos advém; Consciência e espírito, vivência e experiência; Efêmera ilusão, sentimentos contém"), print_probabilities=True)

['Likelyhood MPB - R. B. Sotero: -80.59115874923836', 'Likelyhood MPB - Yara Daher: -83.51623866682769', 'Likelyhood MPB - Iauaretê: -89.10895543337449', 'Likelyhood MPB - Vera Vilela: -77.45125210969985', 'Likelyhood MPB - Zila Mamede: -85.65699952501105', 'Likelyhood Iaiá Castro: -78.77421122406761', 'Likelyhood MPB - Fábio Mor: -86.9733228047176', 'Likelyhood MPB - Dorcila Garcia: -85.03076297113476', 'Likelyhood MPB - Enzo Lenine: -81.60231403514001', 'Likelyhood MPB - Rui Barbosa: -83.3884780008886', 'Likelyhood MBP - Urayoan Noel: -86.67985991417927', 'Likelyhood MPB - Laura Esteves: -87.94106665866437', 'Likelyhood MBP - Cacilda Barboza: -86.7285967755151', 'Likelyhood MBP - Fabrício Carpinejar: -85.06286322225857', 'Likelyhood MPB - Carlos de Hollanda: -83.91715799385209', 'Likelyhood MPB - Neide Ferreira Mendes da Silva: -76.801924918645', 'Likelyhood MPB - Narcisa Amália: -82.51586806753333', 'Likelyhood MPB - Yacy Maia Saraiva: -87.22193862361732', 'Likelyhood MPB - Lucas Te

'MPB - Paolo Lim'

Looking at the results I found that Paolo Lim is the poet that this poem of mine would identify as. Out of curiosity, and reading his poems, I found that the `;` symbol was not removed from the datasets. That is probably the reason my likelyhood got pulled to his, as he uses the semi-colon a lot in his poems.
Let's improve the treatment of the dataset and check again how it influences the accuracy of the model;

In [231]:
def extract_words(text):
    text = re.sub(r'[!\.,\-?\\/0-9:|();]', ' ', text)
    text = re.sub(r'\x97', '', text)
    text = re.sub(r'\x93', '', text)
    text = re.sub(r'\x94', '', text)
    text = re.sub(r'\xa0', '', text)
    text = re.sub(r'<br>', '', text)
    text = re.sub(r'[dD]o [Ll]ivro.*', '', text)
    text = re.sub(r' +', ' ', text)
    return text.strip().lower().split(' ')

for i in range(10, 15):
    print(f"For {i} authors;")
    dataset, output = get_training_data(limit=i)
    get_results(train_bayes(dataset, output), dataset, output)

For 10 authors;
took: 0:00:01.370974
We've got 91.66666666666666% right
For 11 authors;
took: 0:00:01.692889
We've got 92.3076923076923% right
For 12 authors;
took: 0:00:02.170121
We've got 93.05555555555556% right
For 13 authors;
took: 0:00:02.948848
We've got 93.5064935064935% right
For 14 authors;
took: 0:00:03.154648
We've got 93.58974358974359% right


It doesn't look like it influences the model at all, quite interesting! Let's try and see if it changes the likelyhood of my poem

In [232]:
dataset, output = get_training_data(limit=40)
bayes = train_bayes(dataset, output)
bayes.predict(extract_words("De minha volição, palavras abstém; De meu coração, pensamentos advém; Consciência e espírito, vivência e experiência; Efêmera ilusão, sentimentos contém"), print_probabilities=True)

took: 0:00:27.140702
['Likelyhood MPB - R. B. Sotero: -80.51962272923525', 'Likelyhood MPB - Yara Daher: -83.44055697109172', 'Likelyhood MPB - Iauaretê: -89.0322177999193', 'Likelyhood MPB - Vera Vilela: -77.38248535532553', 'Likelyhood MPB - Zila Mamede: -85.58117737962995', 'Likelyhood Iaiá Castro: -78.70404415592154', 'Likelyhood MPB - Fábio Mor: -86.89701513477998', 'Likelyhood MPB - Dorcila Garcia: -84.9573540744606', 'Likelyhood MPB - Enzo Lenine: -81.53308117527256', 'Likelyhood MPB - Rui Barbosa: -83.31660038902088', 'Likelyhood MBP - Urayoan Noel: -86.60379578130241', 'Likelyhood MPB - Laura Esteves: -87.86584279896485', 'Likelyhood MBP - Cacilda Barboza: -86.65364889647572', 'Likelyhood MBP - Fabrício Carpinejar: -84.98994097661452', 'Likelyhood MPB - Carlos de Hollanda: -83.84378676236065', 'Likelyhood MPB - Neide Ferreira Mendes da Silva: -76.73342170323967', 'Likelyhood MPB - Narcisa Amália: -82.44253446259324', 'Likelyhood MPB - Yacy Maia Saraiva: -87.14653627195023', 'L

'MPB - Paolo Lim'

Indeed, everything looks the same. It's definitely quite an accurate model to predict the author as the class.

Next, on the purpose of trying to improve the recognition, I will remove the stopwords from the datasets as well as the ponctuation using the python `nltk` library. 

In [237]:
import nltk
nltk.download('stopwords')
def extract_words(text):
    text = re.sub(r'[!\.,\-?\\/0-9:|();]', ' ', text)
    text = re.sub(r'\x97', '', text)
    text = re.sub(r'\x93', '', text)
    text = re.sub(r'\x94', '', text)
    text = re.sub(r'\xa0', '', text)
    text = re.sub(r'<br>', '', text)
    text = re.sub(r'[dD]o [Ll]ivro.*', '', text)
    text = re.sub(r' +', ' ', text)
    text = text.strip().lower().split(' ')
    stopwords = nltk.corpus.stopwords.words('portuguese')
    return [word for word in text if word not in stopwords]

extract_words('Eu gosto de batata')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gcarraro/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['gosto', 'batata']

In [240]:
for i in range(10, 15):
    print(f"For {i} authors;")
    dataset, output = get_training_data(limit=i)
    get_results(train_bayes(dataset, output), dataset, output)

For 10 authors;
took: 0:00:01.302891
We've got 91.66666666666666% right
For 11 authors;
took: 0:00:01.547199
We've got 92.3076923076923% right
For 12 authors;
took: 0:00:01.998293
We've got 93.05555555555556% right
For 13 authors;
took: 0:00:02.703310
We've got 93.5064935064935% right
For 14 authors;
took: 0:00:02.996653
We've got 93.58974358974359% right


And looking at the results, it's clear to see the stopwords didn't influence at all the given dataset. Let's see if new data is introduced, if that would change.

In [241]:
dataset, output = get_training_data(limit=40)
bayes = train_bayes(dataset, output)
bayes.predict(extract_words("De minha volição, palavras abstém; De meu coração, pensamentos advém; Consciência e espírito, vivência e experiência; Efêmera ilusão, sentimentos contém"), print_probabilities=True)

took: 0:00:24.787848
['Likelyhood MPB - R. B. Sotero: -41.12617662050834', 'Likelyhood MPB - Yara Daher: -40.26970517520092', 'Likelyhood MPB - Iauaretê: -40.91559039697373', 'Likelyhood MPB - Vera Vilela: -40.57689577226427', 'Likelyhood MPB - Zila Mamede: -40.275236149216255', 'Likelyhood Iaiá Castro: -40.47436852061002', 'Likelyhood MPB - Fábio Mor: -40.94205654039516', 'Likelyhood MPB - Dorcila Garcia: -40.365624806316035', 'Likelyhood MPB - Enzo Lenine: -39.83406199035904', 'Likelyhood MPB - Rui Barbosa: -41.10739473633596', 'Likelyhood MBP - Urayoan Noel: -40.94205654039516', 'Likelyhood MPB - Laura Esteves: -40.98631685707168', 'Likelyhood MBP - Cacilda Barboza: -40.9739081922168', 'Likelyhood MBP - Fabrício Carpinejar: -41.05877198687598', 'Likelyhood MPB - Carlos de Hollanda: -41.05334015581488', 'Likelyhood MPB - Neide Ferreira Mendes da Silva: -40.50349995894903', 'Likelyhood MPB - Narcisa Amália: -41.05605680896511', 'Likelyhood MPB - Yacy Maia Saraiva: -40.96976511781871',

'MPB - Paolo Lim'

One thing to note is: the likelyhood, although yielding smaller numbers, seems to be proportional with the previous results. But the great thing we can take back from this analysis is: removing stop words reduces the time in training the model considerably! We could see an improvement of approximately 3 seconds. This is a great improvement when trying to train for bigger datasets. It still too long to compute all authors, but it would take less time.