# Poe x Frost 
* This project aims to conceive a classification strategy of differentiating Edgar Allan Poe's poems from Robert Frost's. We'll conduct the experiment by training two separate Markov Models, each of them calibrated with the poems of one of the poets.
* Then, given the $p(\text{text}|\text{author})$ returned by the models, we'll apply the Bayes' Theorem to compute $p(\text{author}|\text{text})$ to receive the final prediction.

## Loading the Documents

In [1]:
from re import sub
from typing import List

def load_text(filename:str)->List[str]:
    '''
        Reads the .txt file.
        
        Parameter
        ---------
        `filename`: str
            The name of the poems file.
            
        Returns
        -------
        A list containing each strophe's content.
    '''
    with open(f'/kaggle/input/poe-vs-frost/{filename}', 'r') as f:
        strophe_delim = '\n\n'
        return sub('\n\u2009\n', strophe_delim, f.read()).split(strophe_delim)
    
txt_frost = load_text('05_robert_frost.txt')
txt_poe = load_text('05_edgar_allan_poe.txt')

In [2]:
# Note that the variables are lists of strophes content. 
print(txt_frost[0])

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth; 


## Fitting our Models
* Let's use the class below in order to generate the $A$'s and $\pi$'s of our models.

In [3]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from typing import List

class MarkovModel:
    '''
       Markov Model, with Add-Epsilon Smoothing.

        Parâmetro
        ---------
        `corpus`: List[str]
            List with the documents to be used.
        `epsilon`: float
            Smoothing degree of the probabilities..
    '''
    def __init__(self, corpus:List[str], epsilon:float):
        self.corpus = self.split_corpus(corpus)
        self.corpus_length = len(self.corpus)
        self.epsilon = epsilon

    @staticmethod
    def split_corpus(corpus:List[str])->List[List[str]]:
        '''
            Tokenizes the corpus' documents.
            
            Parameter
            ---------
            `corpus`: List[str]
                A list with each of the corpus' documents.
                
            Returns
            -------
            A list of the documents tokens.
        '''
        return [word_tokenize(document.lower()) for document in corpus]

    def __vocab(self)->set:
        '''
            Extraction of all the corpus tokens. Disregarding the ones only used as first word of the strophes.
        '''
        vocab = []
        for doc in self.corpus:
            vocab+=doc[1:] # Not including the first tokens.
        return set(vocab)
        

    def __pi(self):
        '''
            Encharged for measuring the model's pi vector.
        '''
        self._pi = {}
        for doc in self.corpus:
            i = doc[0]
            if i not in self._pi.keys():
                self._pi[i] = 1
            else:
                self._pi[i]+=1
        m = self.a.shape[0]
        self.pi =  (pd.Series(self._pi)+self.epsilon) / (self.corpus_length+self.epsilon*m)
        
    def __a(self):
        '''
            Measures the model's A matrix..
        '''
        self._a = {j:{} for j in self.__vocab()}
        for doc in self.corpus:
            for idx, j in enumerate(doc[1:], start=1):
                d_j = self._a[j]
                i = doc[idx-1]
                if i not in d_j.keys():
                    d_j[i] = 1
                else:
                    d_j[i] += 1
        
        a = pd.DataFrame(self._a).fillna(0)
        num = (a+self.epsilon)
        denom = a.sum(axis=1, skipna=True)+a.shape[0]*self.epsilon
        self.a =  num.div(denom, axis=0) 

    def fit(self):
        self.__a()
        self.__pi()

a = MarkovModel(txt_frost, 1)
a.fit()

In [4]:
a.a.sum(axis=1)

was         0.996594
him         0.996519
of          0.996743
a           0.996782
n't         0.996641
              ...   
hog         0.996441
crooked     0.996441
budded      0.996441
drinking    0.996441
unswept     0.996441
Length: 2247, dtype: float64

<p style='color:red'> Set de treino e teste para cada um dos txt's</p>