<a href="https://colab.research.google.com/github/ericmcai/Projects/blob/main/Assessing_Pauline_Authorship_Using_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assessing Pauline Authorship of Disputed Texts Using Logistic Regression and NLP

### Background

This project was inspired by [Dr. Wei Hu's study of Pauline epistles in the New Testament using machine learning](https://www.scirp.org/journal/paperinformation.aspx?paperid=30473). Among modern, scholarly consensus, there are 7 letters that are undisputedly attributed to the apostle Paul: Romans, 1&2 Corinthians, Galatians, Philippians, 1 Thessalonians, and Philemon.

The main reason for questioning the remaining 7 traditional Pauline epistles is stylistic differences, a point which N.T. Wright quite ably challenges in his <i> Paul and the Faithfulness of God</i>:

>“I’ve never been very impressed with arguments like that. “Paul couldn’t have said this, because he never says this kind of thing, as far as we know.” But what if he just said it in the passage? Then it would be the kind of thing he would say. I find this to be especially problematic given the contextual nature of these letters. It strikes me as kind of like saying, C.S. Lewis couldn’t have written <i> The Space Trilogy </i> because he never talked about aliens in <i>The Chronicles of Narnia </i> and the former is written for adults and clearly the latter is for children. Or it’s like saying “Oh, Bob could never have talked about that with his girlfriend Gina. I know that because I know what he talks about with his mother”...Arguments from style are clearly important in principle. But they are hard to make in practice.” <i> I, 60.</i>
>

Point well taken. Nevertheless, as an exercise, the question that I want to consider is, "Using the uncontested letters of the apostle Paul as our training data, could we discover any correlation or relationship with the disputed letters?" On vocabulary and lexical frequency alone, not taking into account the dating, historicity, theology, or provenance of Paul's letters, could we uncover any relationship between the uncontested and contested letters?

### Methods

This project is best accomplished using the original Greek; however, processing languages for machine learning other than English is a nightmare, so I will be the NASB, given its idiosyncratic literal, wooden translation, as my source data. The source data can be found here: https://my-bible-study.appspot.com/.

A logistic regression model is trained on a set of the apostle Paul's epistles where his authorship is undisputed, as well as a set of epistles where he is definitely not the author, like the Johannine letters. The sets will be randomly sampled to prevent biasing the model toward any specific part of the epistle, e.g. greeting, body, or closing.

To test the model, a validation set will be used. Each verse from the disputed epistles set will be classified as likely having been written by Paul (authentic, boolean value of True) or not likely (inauthentic, boolean value of False).

### Overall workflow

<ol>
    <li> Format source data </li>
    <li> Create data sets </li>
    <li> Train logistic model </li>
    <li> Analyze predictions </li>
</ol>

### Format source data

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

The first step is to store and format the data into a dataframe using pandas.

In [11]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

import string
def normalize_string(s):
    nones = [None] * len(string.punctuation)
    table = {k: v for k, v in zip([char for char in string.punctuation], nones)}
    return s.translate(str.maketrans(table)).lower()

import io
nasb = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/Logistic Regression Project/NASB_fixed.csv', header=None)
nasb.columns = ['id', 'chapter', 'verse', 'text']
key = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/Logistic Regression Project/key_english.csv')
id_to_book = pd.Series(key['book'].values, index=key['id']).to_dict()
nasb['book'] = nasb['id'].map(id_to_book)
nasb['text'] = nasb['text'].apply(normalize_string)
nasb = nasb[['id', 'book', 'chapter', 'verse', 'text']]

Mounted at /content/gdrive


### Create datasets

As mentioned above in methods, the next step is to create our training and validation sets. The training set will comprise an equal amount of verses from the undisputed epistles of the apostle Paul and from those that are non-Pauline. Then, the validation data set will be used to test on the disputed letters of Paul.

#### Training Data

In [12]:
# Creating training dataset - Authentic
authentic = nasb.loc[
    (nasb['book'] == 'Romans')|
    (nasb['book'] == '1 Corinthians')|
    (nasb['book'] == '2 Corinthians')|
    (nasb['book'] == 'Galatians')|
    (nasb['book'] == 'Philippians')|
    (nasb['book'] == '1 Thessalonians')|
    (nasb['book'] == 'Philemon')\
]

authentic_sample_1 = authentic.sample(n=250, random_state=1)
authentic_sample_2 = authentic.sample(n=250, random_state=2)
authentic_sample_3 = authentic.sample(n=250, random_state=3)

authentic_sample_1['authenticity'] = True
authentic_sample_2['authenticity'] = True
authentic_sample_3['authenticity'] = True

print('Created dataset - Authentic')
print(authentic.head())

Created dataset - Authentic
       id    book  chapter  verse  \
27569  45  Romans        1      1   
27570  45  Romans        1      2   
27571  45  Romans        1      3   
27572  45  Romans        1      4   
27573  45  Romans        1      5   

                                                    text  
27569  paul a bondservant of christ jesus called as a...  
27570  which he promised beforehand through his proph...  
27571  concerning his son who was born of a descendan...  
27572  who was declared the son of god with power by ...  
27573  through whom we have received grace and apostl...  


In [13]:
# Creating training dataset - Inauthentic
inauthentic = nasb.loc[
    (nasb['book'] == 'James')|
    (nasb['book'] == '1 Peter')|
    (nasb['book'] == '2 Peter')|
    (nasb['book'] == '1 John')|
    (nasb['book'] == '2 John')|
    (nasb['book'] == '3 John')|
    (nasb['book'] == 'Jude')
]

inauthentic_sample_1 = inauthentic.sample(n=250, random_state=1)
inauthentic_sample_2 = inauthentic.sample(n=250, random_state=2)
inauthentic_sample_3 = inauthentic.sample(n=250, random_state=3)

inauthentic_sample_1['authenticity'] = False
inauthentic_sample_2['authenticity'] = False
inauthentic_sample_3['authenticity'] = False

print('Created dataset - Inauthentic')

Created dataset - Inauthentic


In [14]:
# Creating validation sets

pauline = pd.concat([
    authentic_sample_1['id'],
    authentic_sample_2['id'],
    authentic_sample_3['id']
], ignore_index=True)

non_pauline = pd.concat([
    inauthentic_sample_1['id'],
    inauthentic_sample_2['id'],
    inauthentic_sample_3['id']
], ignore_index=True)

validation_pauline = authentic[~authentic['id'].isin(pauline)]
validation_nonpauline = inauthentic[~inauthentic['id'].isin(non_pauline)]
print('Sets validated')

Sets validated


#### Testing Data (Disputed Texts)

In [15]:
disputed = nasb.loc[
    (nasb['book'] == 'Ephesians')|
    (nasb['book'] == 'Colossians')|
    (nasb['book'] == '2 Thessalonians')|
    (nasb['book'] == '1 Timothy')|
    (nasb['book'] == '2 Timothy')|
    (nasb['book'] == 'Titus')
]
print('Testing dataset name: disputed')

Testing dataset name: disputed


### Training Model

In [16]:
# Combine
authentic_training_data = pd.concat([authentic_sample_1,
                                     authentic_sample_2,
                                     authentic_sample_3],
                                    ignore_index=True)
inauthentic_training_data = pd.concat([inauthentic_sample_1,
                                       inauthentic_sample_2,
                                       inauthentic_sample_3],
                                      ignore_index=True)
combined = pd.concat([authentic_training_data,
                      inauthentic_training_data],
                     ignore_index=True)

# Splitting traiining data into training and validating sets
X_train, X_valid, y_train, y_valid = train_test_split(combined['text'],
                                                      combined['authenticity'],
                                                      test_size=0.2,
                                                     random_state=42)

# Vectorize
vectorizer = CountVectorizer(stop_words=['is',
                                         'in',
                                         'therefore',
                                         'on',
                                         'and',
                                         'by',
                                         'with',
                                         'from',
                                         'to'])
X_train = vectorizer.fit_transform(X_train)
X_valid = vectorizer.transform(X_valid)
y_labels = combined['authenticity']


# Create the model and train the logistic regression model
log_model = LogisticRegression(solver='liblinear')
log_model.fit(X_train, y_train)

# Calculate the accuracy of the model
y_valid_pred = log_model.predict(X_valid)
valid_accuracy = accuracy_score(y_valid, y_valid_pred)
print("Validation Accuracy:", valid_accuracy)

# Vectorize test data
X_test = vectorizer.transform(disputed['text'])

# Predict the labels
y_pred = log_model.predict(X_test)
copy = disputed.copy()
copy['authenticity'] = y_pred
print(copy)

Validation Accuracy: 0.85
       id       book  chapter  verse  \
28841  49  Ephesians        1      1   
28842  49  Ephesians        1      2   
28843  49  Ephesians        1      3   
28844  49  Ephesians        1      4   
28845  49  Ephesians        1      5   
...    ..        ...      ...    ...   
29568  56      Titus        3     11   
29569  56      Titus        3     12   
29570  56      Titus        3     13   
29571  56      Titus        3     14   
29572  56      Titus        3     15   

                                                    text  authenticity  
28841  paul an apostle of christ jesus by the will of...          True  
28842  grace to you and peace from god our father and...          True  
28843  blessed be the god and father of our lord jesu...         False  
28844  just as he chose us in him before the foundati...         False  
28845  he predestined us to adoption as sons through ...          True  
...                                                  ..

## Analyzing the results

In [23]:
print('Pauline similarity by letter')
print('--------------------------')
print(copy.groupby('book')['authenticity'].mean())
print('\nAverage Similarity of all Disputed Letters')
print((copy.groupby('book')['authenticity'].mean()).mean())
print('\nPauline similarity by chapter')
print('-------------------------------')
print(copy.groupby(['book', 'chapter'])['authenticity'].mean())
print('\nPauline similarity by "Deutero-Pauline" epistles')
print('--------------------------------------------------')
print(copy.loc[disputed['book'].isin(['2 Thessalonians', 'Colossians', 'Ephesians'])].groupby(['book'])['authenticity'].mean())
print('\nAverage of "Deutero-Pauline" epistles')
print((copy.loc[disputed['book'].isin(['2 Thessalonians', 'Colossians', 'Ephesians'])].groupby(['book'])['authenticity'].mean()).mean())
print('\nPauline similarity by Pastoral epistles')
print('--------------------------------------------------')
print(copy.loc[disputed['book'].isin(['1 Timothy', '2 Timothy', 'Titus'])].groupby(['book'])['authenticity'].mean())
print('\nAverage of Pastoral epistles')
print((copy.loc[disputed['book'].isin(['1 Timothy', '2 Timothy', 'Titus'])].groupby(['book'])['authenticity'].mean()).mean())


Pauline similarity by letter
--------------------------
book
1 Timothy          0.592920
2 Thessalonians    0.574468
2 Timothy          0.698795
Colossians         0.589474
Ephesians          0.645161
Titus              0.565217
Name: authenticity, dtype: float64

Average Similarity of all Disputed Letters
0.6110059976081716

Pauline similarity by chapter
-------------------------------
book             chapter
1 Timothy        1          0.600000
                 2          0.733333
                 3          0.750000
                 4          0.687500
                 5          0.600000
                 6          0.285714
2 Thessalonians  1          0.500000
                 2          0.705882
                 3          0.500000
2 Timothy        1          0.666667
                 2          0.615385
                 3          0.764706
                 4          0.772727
Colossians       1          0.724138
                 2          0.565217
                 3          0.

## Visualizing the Results

In [18]:
import bokeh.io
import bokeh.plotting
import bokeh.palettes
bokeh.io.output_notebook()

In [20]:
df = copy.groupby('book')['authenticity'].mean().reset_index()
# plt
# df.plot(kind='bar',
#         x='Letter',
#         y='Authenticity',
#         color='blue')
# plt.title('Bar Plot of Pauline Authenticity')
# plt.show()

#bokeh
epistles = df['book'].tolist()
authenticity = df['authenticity'].tolist()

from bokeh.models import ColumnDataSource
from bokeh.palettes import Spectral, BuGn
from bokeh.plotting import figure, show, output_file
from bokeh.transform import factor_cmap

source = ColumnDataSource(data=dict(epistles=epistles, authenticity=authenticity))
p = figure(x_range=epistles,title="Pauline similarity by letter")
p.vbar(x='epistles',
       top='authenticity',
       width=0.9, source=source,
       legend_field = 'epistles',
       line_color='white',
       fill_color=factor_cmap('epistles', palette=Spectral[6], factors=epistles))
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.y_range.end = 1
p.legend.orientation = "horizontal"
p.legend.location = "top_center"
output_file('similarity.html')
show(p)

# sorting
sorted_epistles = sorted(epistles, key=lambda x:authenticity[epistles.index(x)])
p_sorted = figure(x_range=sorted_epistles, title='Similarity (from least to greatest)')
p_sorted.vbar(x='epistles',
       top='authenticity',
       width=0.9, source=source,
       legend_field = 'epistles',
       line_color='white',
       fill_color=factor_cmap('epistles', palette=Spectral[6], factors=epistles))
p_sorted.xgrid.grid_line_color = None
p_sorted.y_range.start = 0
p_sorted.y_range.end = 1
p_sorted.legend.orientation = "horizontal"
p_sorted.legend.location = "top_center"
output_file('most similar.html')
show(p_sorted)

## Summarizing the Results
------
As the results show, the model seems to demonstrate remarkable continuity amongst even the most disputed of the apostle Paul's letters. Modern consensus is that the Pastoral Epistles are the most disputed of all the epistles ascribed to Paul. Yet, the average of the Pastoral Epistles indicate a 61.8% similarity to the letters with undisputed Pauline authorship.

A limitation of the project, however, is that it tokenizes word count, not taking into account phrase count.

### Takeaways

1.   Perhaps word count alone cannot account for the sum total criteria to determine authorship. Style, historical setting, and theology must still be taken into account to holistically determine authorship.
2.   On the other hand, a word count-focused project demonstrates that, at the very least, there is an overlapping pool of words that the disputed letters and undisputed letters have in common, sitting right at around 61% of common words used.  

