# Translation Exploration & Data Cleaning
Author Brian Tam, 11/02/2020

This notebook is used to clean the [Bible corpus](https://www.kaggle.com/oswinrh/bible) as an intermediate setup to prep it for moding.
Specifically this initial process explored the different translations and their individual advantages:
1. Total vocabulary (for the purposes of dimensionality reduction)
2. How true to the original Greek/Hebrew is the translation
For a detailed breakdown look [here](https://commonwaychurch.com/wp-content/uploads/2015/11/bibletranslationchart.pdf)

There is a huge variety of weird bible versions, includeing [this one](https://www.cnet.com/news/bible-from-a-z-software-rewrites-entire-king-james-version-alphabetically/)

Utlimately I decided to use the BBE translation for its inhertly smaller vocabulary that leads to natural dimensionality reduction

In [1]:
# Get pandas and postgres to work together
from sqlalchemy import create_engine
import psycopg2 as pg
import pandas as pd
import numpy as np
import pickle 

# Panda overides for visuals
pd.set_option('display.max_colwidth', 1)

# Text Preprocessing
import re
import string
# Import spacy to do NLP
import spacy
parser = spacy.load('en_core_web_sm')

# Import sklearn to do CountVectorizing and TF-IDF document-term matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import matplotlib as plt

# Import custom spaCy preprocessing
from utilities.text_cleaning import spacy_tokenizer

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

# KJV translation

The King James Bible is an English translation of the Christian Bible commissioned for the Church of England in 1604 and completed and published in 1611.
- One of the oldest and most well respected versions of the bible
- Written in old English (so modern toolkits like Spacy may not filter through the words correctly)

In [None]:
kjv=pd.read_csv('bible_corpus/bible_databases-master/t_kjv.csv')

In [None]:
kjv['cleaned']=kjv['field.4'].apply(spacy_tokenizer)

In [None]:
# Split the data into X and y data sets
X = kjv.cleaned

In [None]:
tfidf1 = TfidfVectorizer()
X_train_tfidf1 = tfidf1.fit_transform(X)
len(pd.DataFrame(X_train_tfidf1.toarray(), columns=tfidf1.get_feature_names()).columns)

In [None]:
pd.DataFrame(X_train_tfidf1.toarray(), columns=tfidf1.get_feature_names())

In [None]:
# Acronynms: Latent Semantic Analysis (LSA) is just another name for 
#  Signular Value Decomposition (SVD) applied to Natural Language Processing (NLP)

TopicModel = NMF(10)
doc_topic = TopicModel.fit_transform(pd.DataFrame(X_train_tfidf1.toarray(), columns=tfidf1.get_feature_names()))

In [None]:
topics = display_topics(TopicModel, tfidf1.get_feature_names(), 3)

In [None]:
topic_word = pd.DataFrame(TopicModel.components_.round(3),
             index =  topics,
             columns = tfidf1.get_feature_names())
topic_word.head(10)

In [None]:
X_test_topic_array = TopicModel.transform(pd.DataFrame(X_train_tfidf1.toarray(), columns=tfidf1.get_feature_names()))

In [None]:
X_train_topics = pd.DataFrame(doc_topic.round(5),
             index = X.index,
             columns = topics)

In [None]:
X_train_topics

In [None]:
X_train_topics.values.argmax(axis=1)

# BBE translation

Bibe in basic English was a translations done by Professor S. H. Hooke following the standards of "Basic English", last revised in 1965
This implies a couple of restricitons:
- Basic English restricts Vocabulary to 1000 words 
    - 850 base words
    - 100 additional words for poetry
    - 50 additional words related to biblical context

In [3]:
# import the default 850 basic english words 
basic_english = pd.read_pickle('data/basic_english_list.pkl')
len(basic_english)

850

In [4]:
# Import BBE translation to df
bbe = pd.read_csv('bible_corpus/bible_databases-master/t_bbe.csv')

In [5]:
#remove lemmetize, remove stop-words and punctuation
bbe['cleaned']=bbe['field.4'].apply(spacy_tokenizer)

In [6]:
alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
bbe['cleaner']= bbe.cleaned.map(punc_lower)

In [7]:
bbe.iloc[16102:16106]

Unnamed: 0,field,field.1,field.2,field.3,field.4,cleaned,cleaner
16102,19123004,19,123,4,For long enough have men of pride made sport of our soul.,long man pride sport soul,long man pride sport soul
16103,19124001,19,124,1,&lt;A Song of the going up. Of David.&gt; If it had not been the Lord who was on our side (let Israel now say);,lt;a song david.&gt lord let israel,lt a song david gt lord let israel
16104,19124002,19,124,2,"If it had not been the Lord who was on our side, when men came up against us;",lord man come,lord man come
16105,19124003,19,124,3,"They would have made a meal of us while still living, in the heat of their wrath against us:",meal live heat wrath,meal live heat wrath


# Adding features to classify the texts

In [22]:
d = {False:'old', True: 'new'}
bbe['testiment']=(bbe['field.1']>39).map(d)

# mapping the actual book names field.1
books_of_bible = pd.read_pickle('data/books_of_bible.pkl')
books_dict = dict(zip(range(1,67),books_of_bible))
bbe['book'] = bbe['field.1'].map(books_dict)

# chapters
bbe['chapter'] = bbe['field.2']

# verse number
bbe['verse'] = bbe['field.3']

In [23]:
bbe

Unnamed: 0,field,field.1,field.2,field.3,field.4,cleaned,cleaner,book,chapter,verse,testiment
0,1001001,1,1,1,At the first God made the heaven and the earth.,god heaven earth,god heaven earth,Genesis,1,1,old
1,1001002,1,1,2,And the earth was waste and without form; and it was dark on the face of the deep: and the Spirit of God was moving on the face of the waters.,earth waste form dark face deep spirit god face water,earth waste form dark face deep spirit god face water,Genesis,1,2,old
2,1001003,1,1,3,"And God said, Let there be light: and there was light.",god let light light,god let light light,Genesis,1,3,old
3,1001004,1,1,4,"And God, looking on the light, saw that it was good: and God made a division between the light and the dark,",god look light good god division light dark,god look light good god division light dark,Genesis,1,4,old
4,1001005,1,1,5,"Naming the light, Day, and the dark, Night. And there was evening and there was morning, the first day.",light day dark night evening morning day,light day dark night evening morning day,Genesis,1,5,old
...,...,...,...,...,...,...,...,...,...,...,...
31098,66022017,66,22,17,"And the Spirit and the bride say, Come. And let him who gives ear, say, Come. And let him who is in need come; and let everyone desiring it take of the water of life freely.",spirit bride come let ear come let need come let desire water life freely,spirit bride come let ear come let need come let desire water life freely,Revelation,22,17,new
31099,66022018,66,22,18,"For I say to every man to whose ears have come the words of this prophet's book, If any man makes an addition to them, God will put on him the punishments which are in this book:",man ear come word prophet book man addition god punishment book,man ear come word prophet book man addition god punishment book,Revelation,22,18,new
31100,66022019,66,22,19,"And if any man takes away from the words of this book, God will take away from him his part in the tree of life and the holy town, even the things which are in this book.",man away word book god away tree life holy town thing book,man away word book god away tree life holy town thing book,Revelation,22,19,new
31101,66022020,66,22,20,"He who gives witness to these things says, Truly, I come quickly. Even so come, Lord Jesus.",witness thing truly come quickly come lord jesus,witness thing truly come quickly come lord jesus,Revelation,22,20,new


# Export to sql

In [24]:
engine = create_engine('postgresql://briantam:localhost@localhost/bible')

bbe.to_sql('bbe_alchemy', engine, if_exists='replace', index=False)


In [8]:
bbe.to_csv('data/bbe_cleaned.csv')

# **NO** stop_words **YES** basic_english 

In [None]:
# Define what you'll feed into the vectorizer as X
X = bbe['cleaner']
tfidf = TfidfVectorizer(stop_words='english')
bbe_cleaned_tfidf = tfidf.fit_transform(X)
bbe_cleaned_tfidf_df = pd.DataFrame(bbe_cleaned_tfidf.toarray(), columns=tfidf.get_feature_names())
print('Vocab size: ', len(bbe_cleaned_tfidf_df.columns))

In [None]:
mytolkens = parser(' '.join(list(bbe_cleaned_tfidf_df.columns)))
tolken_list = [tolken.pos_ for tolken in mytolkens]
BBE_POS_df = pd.DataFrame([(x, tolken_list.count(x)) for x in set(tolken_list)]).sort_values(1)
plt.barh(BBE_POS_df[0],BBE_POS_df[1])
plt.title('Vocab Distribution of BBE')

# Trying other stop_word filters

### **NO** stop_words **NO** basic_english 

In [None]:
# Define what you'll feed into the vectorizer as X
X = bbe['cleaned']
tfidf = TfidfVectorizer(stop_words = basic_english)
bbe_cleaned_tfidf = tfidf.fit_transform(X)
bbe_cleaned_tfidf = pd.DataFrame(bbe_cleaned_tfidf.toarray(), columns=tfidf.get_feature_names())
print('Vocab size: ', len(bbe_cleaned_tfidf.columns))

In [None]:
parser = spacy.load('en_core_web_sm')

In [None]:
mytolkens = parser(' '.join(list(bbe_cleaned_tfidf.columns)))
tolken_list = [tolken.pos_ for tolken in mytolkens]
BBE_POS_df = pd.DataFrame([(x, tolken_list.count(x)) for x in set(tolken_list)]).sort_values(1)
plt.barh(BBE_POS_df[0],BBE_POS_df[1])
plt.title('Vocab Distribution of BBE')

### **YES** stop_words **NO** basic_english 

In [None]:
# Define what you'll feed into the vectorizer as X
X = bbe['cleaned']
tfidf = TfidfVectorizer(max_df=.9 stop_words = basic_english)
X_train_tfidf1 = tfidf1.fit_transform(X)

len(pd.DataFrame(X_train_tfidf1.toarray(), columns=tfidf1.get_feature_names()).columns)

In [None]:
bbe_tfidf = pd.DataFrame(X_train_tfidf1.toarray(), columns=tfidf1.get_feature_names())

In [None]:
mytolkens = parser(' '.join(list(bbe_tfidf.columns)))

In [None]:
tolken_list = [tolken.pos_ for tolken in mytolkens]

In [None]:
BBE_POS_df = pd.DataFrame([(x, tolken_list.count(x)) for x in set(tolken_list)]).sort_values(1)
BBE_POS_df

In [None]:
plt.barh(BBE_POS_df[0],BBE_POS_df[1])

### **YES** stop_words **YES** basic_english 

In [None]:
# Define what you'll feed into the vectorizer as X
X = bbe['field.4']
tfidf = TfidfVectorizer()
bbe_tfidf = tfidf.fit_transform(X)

print('Vocab Size: ', len(pd.DataFrame(bbe_tfidf.toarray(), columns=tfidf.get_feature_names()).columns))

In [None]:
bbe_tfidf = pd.DataFrame(bbe_tfidf.toarray(), columns=tfidf.get_feature_names())
mytolkens = parser(' '.join(list(bbe_tfidf.columns)))
tolken_list = [tolken.pos_ for tolken in mytolkens]
BBE_POS_df = pd.DataFrame([(x, tolken_list.count(x)) for x in set(tolken_list)]).sort_values(1)
plt.barh(BBE_POS_df[0],BBE_POS_df[1])
plt.title('Vocab Distribution of BBE')

# WordClouds

In [None]:
from wordcloud import WordCloud
text = ' '.join([tolken.pos_ for tolken in mytolkens])

# Generate a word cloud image
wordcloud = WordCloud(width = 1000, height = 1000,
                background_color ="rgba(255, 255, 255, 0)", mode="RGBA").generate(text)

# Display the generated image:
# the matplotlib way:
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig(f'POS_BBE_not_in_BE.png',bbox_inches = 'tight', pad_inches = .25)
plt.show()

# Other Versions

### WEB

In [None]:
web = pd.read_csv('data/bible_databases-master/t_web.csv')
# Assign the 
X = web['field.4']

In [None]:
# Create TF-IDF of the array of words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X)
len(pd.DataFrame(X_train_tfidf.toarray(), columns=tfidf.get_feature_names()).columns)

### ASV

### DBY

### WBT

### YLT