# Poe x Frost 
* This project aims to conceive a classification strategy of differentiating Edgar Allan Poe's poems from Robert Frost's. We'll conduct the experiment by training two separate Markov Models, each of them calibrated with the poems of one of the poets.
* Then, given the $p(\text{text}|\text{author})$ returned by the models, we'll apply the Bayes' Theorem to compute $p(\text{author}|\text{text})$ to receive the final prediction.

## Loading the Documents

In [1]:
from re import sub
from typing import List

def load_text(filename:str)->List[str]:
    '''
        Reads the .txt file.
        
        Parameter
        ---------
        `filename`: str
            The name of the poems file.
            
        Returns
        -------
        A list containing each strophe's content.
    '''
    with open(f'/kaggle/input/poe-vs-frost/{filename}', 'r') as f:
        strophe_delim = '\n\n'
        return sub('\n\u2009\n', strophe_delim, f.read()).split(strophe_delim)
    
txt_frost = load_text('05_robert_frost.txt')
txt_poe = load_text('05_edgar_allan_poe.txt')

In [2]:
# Note that the variables are lists of strophes content. 
print(txt_frost[0])

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth; 


## Datasets split
* As in any Data Science project, we'll have to split our sets in two partitions. One dedicated to training our models and the other one to simply estimate the algorithm's performance in deployment scenario.

In [3]:
from sklearn.model_selection import train_test_split
train_poe, test_poe = train_test_split(txt_poe, train_size=.75, random_state=42)
train_frost, test_frost = train_test_split(txt_frost, train_size=.75, random_state=42)

In [4]:
# Note that we are dealing with an unbalanced dataset.
len(txt_frost)/ (len(txt_poe)+len(txt_frost))

0.6460176991150443

In [5]:
# Let's store each class' probability in a dictionary.
proba_frost = len(txt_frost)/ (len(txt_poe)+len(txt_frost))
probas = {'frost':proba_frost, 'poe':(1-proba_frost)}

## Fitting our Models
* Let's use the class below in order to generate the $A$'s and $\pi$'s of our models.

In [6]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from typing import List, Tuple

class MarkovModel:
    '''
       Markov Model, with Add-Epsilon Smoothing.

        Parameters
        ---------
        `corpus`: List[str]
            List with the documents to be used.
        `epsilon`: float
            Smoothing degree of the probabilities.
        `name`: str
            A name for your model.
            
        Methods
        ------
        `fit`: Generates the model's A and pi.
        `predict_log_proba`: Estimates the probability's log of a given sequence.
        
        
        Attributes
        ----------
        `a`: `pd.DataFrame`
            The model's A matrix.
        `_a`: Dict[str, Dict[str, int]]
            A Dictionary mapping the number of occurences a given state transition happened.
        `pi`: `pd.Series`
            The model's pi vector.
        `_pi`: Dict[str, int]
            A dictionary informing the amount of times a given token started a sentence.
        `_vocab`: Set[str]
            A set object with all the corpus's vocabulary.
    '''
    def __init__(self, corpus:List[str], epsilon:float, name:str):
        self.corpus = self.split_corpus(corpus)
        self.corpus_length = len(self.corpus)
        self.epsilon = epsilon
        self.name = name

    @staticmethod
    def split_corpus(corpus:List[str])->List[List[str]]:
        '''
            Tokenizes the corpus' documents.
            
            Parameter
            ---------
            `corpus`: List[str]
                A list with each of the corpus' documents.
                
            Returns
            -------
            A list of the documents tokens.
        '''
        return [word_tokenize(document.lower()) for document in corpus]
    
    def __vocab(self)->None:
        '''
            Extraction of all the corpus tokens.
            
            We create a set with all training tokens and another one disregarding the ones only used as first word of the strophes.
        '''
        self._vocab, self._a_vocab = [], []
        
        for doc in self.corpus:
            self._vocab += doc
            self._a_vocab+=doc[1:] # Not including the first tokens.
            
        self._vocab, self._a_vocab = set(self._vocab), set(self._a_vocab)        
    
    def __check_pi(self, token:str)->str:
        '''
            Masks a sentence's first token with '<UNKNOWN>' mark if it is not included in the training set.
            
            Parameter
            ---------
            `token`: str
                The sentence's first token under scrutiny
            
            Returns
            -------
            The treated token.
        '''
        return token if token in self._pi else '<UNKNOWN>'
    
    def __check_a(self, token1:str, token2:str)->Tuple[str]:
        '''
            When querying the model's A matrix, checks whether the provided initial and target states are present. If not,
            the tokens are masked with the flag '<UNKNOWN>'.
            
            Parameters
            ----------
            `token1`: str
                The initial state.
            `token2`: str
                The target state.
            
            Returns
            -------
            The treated tokens inside a tuple.
        '''
        token1 = token1 if token1 in self.a.index else '<UNKNOWN>'
        token2 = token2 if token2 in self.a.columns else '<UNKNOWN>'
        return token1, token2
    
    def __pi(self):
        '''
            Encharged for measuring the model's pi vector.
        '''
        self._pi = {}
        m = self.a.shape[0]
        
        for doc in self.corpus:
            i = doc[0]
            if i not in self._pi.keys():
                self._pi[i] = 1
            else:
                self._pi[i]+=1
        
        self._pi['<UNKNOWN>'] = 0 # Defining a key for possible tokens of the test set that were unseen during training.
        self.pi =  (pd.Series(self._pi)+self.epsilon) / (self.corpus_length+self.epsilon*m)
        
    def __a(self):
        '''
            Measures the model's A matrix.
        '''
        self._a = {j:{} for j in self._a_vocab}
        for doc in self.corpus:
            for idx, j in enumerate(doc[1:], start=1):
                d_j = self._a[j]
                i = doc[idx-1]
                if i not in d_j.keys():
                    d_j[i] = 1
                else:
                    d_j[i] += 1
        self._a['<UNKNOWN>'] = {'<UNKNOWN>':0}
        a = pd.DataFrame(self._a).fillna(0)
        num = (a+self.epsilon)
        denom = a.sum(axis=1, skipna=True)+a.shape[0]*self.epsilon
        self.a =  num.div(denom, axis=0) 
        

    def fit(self):
        '''
            Fits the algorithm to the provided corpus.
        '''
        self.__vocab()
        self.__a()
        self.__pi()
        return self
    
    def predict_log_proba(self, text:str)->float:
        '''
            Estimates the probability's log of a given sequence.
            
            Parameter
            ---------
            `text`: str
                The text whose probability needs to be computed.
            
            Returns
            -------
            The sequence's log probability.
        '''
        text = word_tokenize(text.lower())
        proba_pi = np.log(self.pi[self.__check_pi(text[0])])
        proba_a = np.log([self.a.loc[self.__check_a(text[i], text[i+1])] for i, _ in enumerate(text[:-1])]) 
        return proba_pi + np.sum(proba_a)                                                   
    
    def predict_proba(self, text:str)->float:
        '''
            Estimates the probability of a given sequence.
            
            *Note:* There is a risk of the output to be 0 for long sequences.
            
            Parameter
            ---------
            `text`: str
                The text whose probability needs to be computed.
            
            Returns
            -------
            The sequence's probability.
        '''
        return np.exp(self.predict_log_proba(text))
    
    def predict_log_proba_author(self, text:str)->float:
        '''
            Measures the likelihood that a given text was written by the model's author by Bayes' Theorem.
            
            Parameter
            ---------
            `text`: str
                The text under scrutiny.
            
            Returns
            -------
            The computed probability.
        '''
        global probas
        return self.predict_log_proba(text) + np.log(probas[self.name])

In [7]:
# Finally fitting our Markov Models.
model_frost = MarkovModel(train_frost, 1, 'frost').fit()
model_poe = MarkovModel(train_poe, 1, 'poe').fit()

## Evaluation Stage
* Now, we are able to use our fitted models in order to predict the probability a given stranza was written by Edgar Allan Poe or Robert Frost.

In [8]:
# I'm creating this object to automatize the assessement.
from sklearn.metrics import f1_score

class Evaluator:
    '''
        This object handles the evaluation of our models' predictions.
        
        Parameters
        ----------
        `model_frost`: MarkovModel
            The Markov Model fitted with Robert Forst's stranzas.
        `model_poe`: MarkovPoe
            The Markov Model fitted with Edgar Allan Poe's stranzas.
            
        Methods
        -------
        `predict` Performs the author predictions for a given text.
        `f1_score`: Evaluates the f1-score for the predictions of a given set of texts.
    '''
    def __init__(self, model_frost:MarkovModel, model_poe:MarkovModel):
        self.model_frost = model_frost
        self.model_poe = model_poe
    
    def predict(self, text:str)->int:
        '''
            Predicts a given text's author (0='Robert Frost'; 1='Edgar Allan Poe')
            
            Parameter
            ---------
            `text`: str
                The strophe which author will be predicted.
                
            Returns
            -------
            The predicted author's code.
        '''
        predictions = [self.model_frost.predict_log_proba_author(text), self.model_poe.predict_log_proba_author(text)]
        return np.argmax(predictions)
    
    def f1_score(self, texts:List[str], targets:List[int])->float:
        '''
            Measures the f1-score for the predictions of a given batch of texts.
            
            Parameters
            ----------
            `texts`: List[str]
                The batch of texts.
            `targets`: List[int]
                The list of targets that will base the f1-score measurement.
            
            Returns
            -------
            The f1-score.
        '''
        predictions = list(map(self.predict, texts))
        return f1_score(targets, predictions)

In [9]:
# Training performance. Note the models have overfitted the training sets!
Evaluator(model_frost, model_poe).f1_score(train_frost+train_poe, [0 for _ in train_frost]+[1 for _ in train_poe])

1.0

In [10]:
# The score drops down substantially with the test set!
Evaluator(model_frost, model_poe).f1_score(test_frost+test_poe, [0 for _ in test_frost]+[1 for _ in test_poe])

0.5797101449275363

<p style='color:red'>Documentar a nossa `Evaluator` e revisar o projeto! Terminando isso, podemos voltar ao curso</p>