# Naive Bayes - Identifying the probable author of a message from it's previous texts

Spam filters basically keep statistics of which words were seen in messages that were classified as ham or spam. For this notebook, I was curious: at first, could I compile statistics on the usage of words by few peoples, then with these statistics in hand, identify the probable author of a snipet of text?

The success of this exercise highly depends on the quality of the text used as sources. However, it's an interesting toy project. 

I know that the subject is polarising, but I went with the last two US Presidents speeches for content.

Sources:
* Obama: few first speeches from http://obamaspeeches.com/ 
* Trump: https://www.tampabay.com/florida-politics/buzz/2018/08/01/heres-a-full-transcript-of-president-trumps-speech-from-his-tampa-rally/ and https://www.politico.com/story/2018/09/25/trump-un-speech-2018-full-text-transcript-840043

And for the snipet of text, I used twitter:
* tweet 1: https://twitter.com/BarackObama/status/1044690296917962754
* tweet 2: https://twitter.com/realDonaldTrump/status/1045444544068812800
* tweet 3: https://twitter.com/realDonaldTrump/status/1045003711104331776
* tweet 4: https://twitter.com/BarackObama/status/1039512025406349312
* tweet 5: https://twitter.com/BarackObama/status/1034892109868945409
* tweet 6: https://twitter.com/realDonaldTrump/status/1043966388182953984

**Spoiler**: Using these 6 examples, we have a 5/6 success rate, which isn't perfect, but still, it's a good start. The first tweet comes out negative (<12%) for both.

In [1]:
import pandas as pd
import numpy as np
import copy
import re

In [2]:
%load_ext autoreload
%autoreload 2

# I am reusing the same update_probability function used in my previous naive bayes notebooks
from naive_bayes import update_probability

## Loading the source texts

This function loads the files containing the text from each authors. The shape of the file doesn't matter too much since we will flat it out in a dataframe that contains "author" and "word" at first. Later on, we add the statistics to it.

In [3]:
def load_data(files):
    df = pd.DataFrame(columns=['author', 'word'])
    
    for author, file in files.items():
        data = pd.read_csv(file, names=['text'], delimiter='\n')
        data['author'] = author
        data['text'] = data['text'].str.lower()
        data['text'] = data['text'].str.replace('[^A-Za-z0-9\@\#\']', ' ')
        data['text'] = data['text'].str.replace('\s+', ' ')
        data['text'] = data['text'].str.strip()
        data['text'] = data['text'].str.split()
        for ll in data['text']:
            for ww in ll:
                df = df.append({'author': author, 'word': ww}, ignore_index=True)
    return df

##  Cleaning the lines from unuseful characters

This clean_line function looks a lot like the cleaning part of the load_data() function, however, because we are playing with pure text here instead of dataframes, I separated the part for the loading and cleaning. 

TODO: see if I can reuse something like this function for the load_data() function, instead of duplicating commands

In [4]:
def clean_line(line):

    line = line.lower()
    line = re.sub(r'[^\w\s\'\@\#]',' ',line)
    line = re.sub(r'\s+', ' ', line)
    line = line.strip()
    line = line.split(" ")
    return line

def clean_lines(lines):
    newlines = []
    for l in lines:
        newlines.append(clean_line(l))
    return newlines

## Loading the training data

This is where we load the training data and we assign the content to its owner. We can easily add more authors if desired, as opposed to traditional spam filters that only discriminate between good and bad text.

In [5]:
files = {
    'obama': 'training_obama.txt', 
    'trump': 'training_trump.txt'
}


df = load_data(files)
df.describe()

Unnamed: 0,author,word
count,28647,28647
unique,2,3374
top,obama,the
freq,17980,1280


## Adding the statistics 

This function calculates the following:
* How many time an author used a word, devided by everyone who used that word (true positives)
* How many time others used that same word, devided by everyone who used that word (false positives)
* True negatives = 1 - true positives
* False negatives = 1 - false positives

In addition, we keep the count of each word per author and for the others, which is useful for filtering later on.

In [6]:
def add_statistics(df):
    metrics = pd.DataFrame(columns=['author', 'word', 'c_author','c_others', 'tp', 'tn', 'fp', 'fn'])

    for index, row in df.iterrows():
        if metrics.loc[(metrics['author'] == row['author'])& (metrics['word'] == row['word'])].empty:
            count_word_for_author = len(df.loc[(df['word'] == row['word']) & (df['author'] == row['author'])].index)+1
            count_word_for_others = len(df.loc[(df['word'] == row['word']) & (df['author'] != row['author'])].index)+1

            metrics = metrics.append({
                'author': row['author'],
                'word': row['word'],
                'c_author': count_word_for_author,
                'c_others': count_word_for_others,
                'tp': count_word_for_author/(count_word_for_author + count_word_for_others),
                'tn': 1 - (count_word_for_author/(count_word_for_author + count_word_for_others)),
                'fp': count_word_for_others/(count_word_for_author + count_word_for_others),
                'fn': 1 - (count_word_for_others/(count_word_for_author + count_word_for_others))
            }, ignore_index=True)
    
    return metrics

In [7]:
metrics_unfiltered = add_statistics(df)
print(metrics_unfiltered.sort_values(by=['c_author', 'c_others'], ascending=False))

     author           word c_author c_others        tp        tn        fp  \
9     obama            the      828      454  0.645866  0.354134  0.354134   
35    obama            and      756      391  0.659111  0.340889  0.340889   
30    obama             to      536      311  0.632822  0.367178  0.367178   
20    obama             of      471      230  0.671897  0.328103  0.328103   
2353  trump            the      454      828  0.354134  0.645866  0.645866   
101   obama           that      410      118  0.776515  0.223485  0.223485   
2368  trump            and      391      756  0.340889  0.659111  0.659111   
108   obama              a      352      163  0.683495  0.316505  0.316505   
87    obama             we      316      216  0.593985  0.406015  0.406015   
2344  trump             to      311      536  0.367178  0.632822  0.632822   
83    obama             in      272      176  0.607143  0.392857  0.392857   
23    obama            our      253      138  0.647059  0.352941

## Setting the priors

We have two sources, therefore, each source should have a prior of 50%

In [9]:
prior = {}
users = sorted(metrics_unfiltered['author'].unique())
for u in users:
    prior[u] = 1 / len(metrics_unfiltered['author'].unique())
    print("{}: {:.2f}".format(u, prior[u]))

obama: 0.50
trump: 0.50


## Analysing text snipets

The analyse_texts() function converts the metrics dataframe in a format that the update_probability() function can handle. 

The function will go through each word of the text snipet, check if there is an associated probability to it (only probabilities of the target author are received in parameter), and if a word isn't found, which means that that author never used that word before, we assign a probability 50% to it, basically not altering the posterior probability.

Experimentation notes: I did perform some test by setting a 49% probability instead of 50% for never used words, which added a negative bias for new words, but I removed it because I also added a filtering on the words to keep only the bottom 80%, removing over repeated English words.

In [10]:
def analyse_texts(prior, texts, probabilities, debug):
    posterior = prior
    tests = {}

    for tt in texts:
        prob = probabilities.loc[probabilities['word'] == tt]

        if tt not in tests.keys():
            if not prob.empty:
                tests[tt] = {
                    'True': {
                        'Positive': prob.head(1)['tp'].values[0],
                        'Negative': prob.head(1)['tn'].values[0]
                    },
                    'False': {
                        'Positive': prob.head(1)['fp'].values[0],
                        'Negative': prob.head(1)['fn'].values[0]
                    }
                }
            else:
                tests[tt] = {
                    'True': {
                        'Positive': 0.5,
                        'Negative': 0.5
                    },
                    'False': {
                        'Positive': 0.5,
                        'Negative': 0.5
                    }
                }
        if not prob.empty:
            posterior = update_probability(posterior, tt, tests, 'True', debug)
        else:
            posterior = update_probability(posterior, tt, tests, 'False', debug)

    return posterior

In [11]:
def lookup_text(lines, prior, metrics):
    for ll in lines:
        print('Given the text "{}"...'.format(" ".join(ll)))
        for author in prior.keys():
            posterior = analyse_texts(prior[author], ll, metrics.loc[metrics['author'] == author], False)
            print("\tProbability that {} wrote this text is: {:.3f}%".format(author, 100 * posterior))
        print("\n")

## Filtering metrics to keep only the buttom 80% used words.

I am removing over repeating words here because they are skewing the probablities in the direction of the author that uses the most articles...

In [12]:
metrics = metrics_unfiltered[metrics_unfiltered.c_author < metrics_unfiltered.c_author.quantile(.8)]
print(metrics.sort_values(by=['c_author', 'c_others'], ascending=False).head())
print(len(metrics['word'].unique()))

     author        word c_author c_others        tp        tn        fp  \
3705  trump      change        5       68  0.068493  0.931507  0.931507   
2821  trump      cannot        5       31  0.138889  0.861111  0.861111   
1468  obama       doing        5       26  0.161290  0.838710  0.838710   
3266  trump  washington        5       26  0.161290  0.838710  0.838710   
2342  trump          am        5       25  0.166667  0.833333  0.833333   

            fn  
3705  0.068493  
2821  0.138889  
1468  0.161290  
3266  0.161290  
2342  0.166667  
2955


## Querying our model using tweets

We are now ready to throw tweets at our model. The text comes from the following tweets:

* tweet 1: https://twitter.com/BarackObama/status/1044690296917962754
* tweet 2: https://twitter.com/realDonaldTrump/status/1045444544068812800
* tweet 3: https://twitter.com/realDonaldTrump/status/1045003711104331776
* tweet 4: https://twitter.com/BarackObama/status/1039512025406349312
* tweet 5: https://twitter.com/BarackObama/status/1034892109868945409
* tweet 6: https://twitter.com/realDonaldTrump/status/1043966388182953984

This function could also be used to analyse each sentenses from an anonymous op-ed against the known speeches of candidate writers, and evaluate if a full text is likely or not to have been written by multiple authors or from a single one.

In [13]:
lines_of_text = ["The antidote to government by a powerful few is government by the organized, energized many. This National Voter Registration Day, make sure you're registered, vote early if you can, or show up on November 6. This moment is too important to sit out.",
                "Judge Kavanaugh showed America exactly why I nominated him. His testimony was powerful, honest, and riveting. Democrats’ search and destroy strategy is disgraceful and this process has been a total sham and effort to delay, obstruct, and resist. The Senate must vote!",
                "Congressman Lee Zeldin is doing a fantastic job in D.C. Tough and smart, he loves our Country and will always be there to do the right thing. He has my Complete and Total Endorsement!",
                "We will always remember everyone we lost on 9/11, thank the first responders who keep us safe, and honor all who defend our country and the ideals that bind us together. There's nothing our resilience and resolve can’t overcome, and no act of terror can ever change who we are.",
                "Yesterday I met with high school students on Chicago’s Southwest side who spent the summer learning to code some pretty cool apps. Michelle and I are proud to support programs that invest in local youth and we’re proud of these young people.",
                "Going to New York. Will be with Prime Minister Abe of Japan tonight, talking Military and Trade. We have done much to help Japan, would like to see more of a reciprocal relationship. It will all work out!"]

clean_lines_of_texts = clean_lines(lines_of_text)

posterior = lookup_text(clean_lines_of_texts, prior, metrics)

Given the text "the antidote to government by a powerful few is government by the organized energized many this national voter registration day make sure you're registered vote early if you can or show up on november 6 this moment is too important to sit out"...
	Probability that obama wrote this text is: 10.917%
	Probability that trump wrote this text is: 11.868%


Given the text "judge kavanaugh showed america exactly why i nominated him his testimony was powerful honest and riveting democrats search and destroy strategy is disgraceful and this process has been a total sham and effort to delay obstruct and resist the senate must vote"...
	Probability that obama wrote this text is: 62.327%
	Probability that trump wrote this text is: 90.909%


Given the text "congressman lee zeldin is doing a fantastic job in d c tough and smart he loves our country and will always be there to do the right thing he has my complete and total endorsement"...
	Probability that obama wrote this text is: 23

## How to improve?

This notebook uses Naive Bayes, which naively consider each word as independent events. This isn't true in reality, but allows us to run a simple function over all words without too much computer horse power. 

* The accuracy could most probably be improved by using markov chains. To do in a future notebook...
* An other approach would be through machine learning, whatever the model used... also for a future notebook.