# Translation Exploration & Data Cleaning
Author Brian Tam, 11/02/2020

This notebook is used to clean the [Bible corpus](https://www.kaggle.com/oswinrh/bible) as an intermediate setup to prep it for moding.
Specifically this initial process explored the different translations and their individual advantages:
1. Total vocabulary (for the purposes of dimensionality reduction)
2. How true to the original Greek/Hebrew is the translation
For a detailed breakdown look [here](https://commonwaychurch.com/wp-content/uploads/2015/11/bibletranslationchart.pdf)

There is a huge variety of weird bible versions, includeing [this one](https://www.cnet.com/news/bible-from-a-z-software-rewrites-entire-king-james-version-alphabetically/)

Utlimately I decided to use the BBE translation for its inhertly smaller vocabulary that leads to natural dimensionality reduction

In [2]:
# Get pandas and postgres to work together
from sqlalchemy import create_engine
import psycopg2 as pg
import pandas as pd
import numpy as np
import pickle 

# Panda overides for visuals
pd.set_option('display.max_colwidth', 1)

# Text Preprocessing
import re
import string
# Import spacy to do NLP
import spacy
parser = spacy.load('en_core_web_sm')

# Import sklearn to do CountVectorizing and TF-IDF document-term matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import matplotlib as plt

# Import custom spaCy preprocessing
from utilities.text_cleaning import spacy_tokenizer

# Suppress warnings 
import warnings
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Adding features to classify the texts

In [20]:
def bible_features(df):
    d = {False:'old', True: 'new'}
    df['testiment']=(df['field.1']>39).map(d)

    # mapping the actual book names field.1
    books_of_bible = pd.read_pickle('data/books_of_bible.pkl')
    books_dict = dict(zip(range(1,67),books_of_bible))
    df['book'] = df['field.1'].map(books_dict)

    # chapters
    df['chapter'] = df['field.2']

    # verse number
    df['verse'] = df['field.3']
    return df

# KJV translation

The King James Bible is an English translation of the Christian Bible commissioned for the Church of England in 1604 and completed and published in 1611.
- One of the oldest and most well respected versions of the bible
- Written in old English (so modern toolkits like Spacy may not filter through the words correctly)

In [22]:
kjv = bible_features(kjv) 

In [3]:
kjv=pd.read_csv('bible_corpus/bible_databases-master/t_kjv.csv')

In [15]:
kjv['cleaner']=kjv['field.4'].apply(spacy_tokenizer)

In [17]:
kjv.iloc[16102:16106]

Unnamed: 0,field,field.1,field.2,field.3,field.4,cleaned,cleaner
16102,19123004,19,123,4,"Our soul is exceedingly filled with the scorning of those that are at ease, and with the contempt of the proud.",soul exceedingly fill scorning ease contempt proud,soul exceedingly fill scorning ease contempt proud
16103,19124001,19,124,1,"If it had not been the LORD who was on our side, now may Israel say;",lord israel,lord israel
16104,19124002,19,124,2,"If it had not been the LORD who was on our side, when men rose up against us:",lord man rise,lord man rise
16105,19124003,19,124,3,"Then they had swallowed us up quick, when their wrath was kindled against us:",swallow quick wrath kindle,swallow quick wrath kindle


In [6]:
tfidf1 = TfidfVectorizer()
X_train_tfidf1 = tfidf1.fit_transform(X)
len(pd.DataFrame(X_train_tfidf1.toarray(), columns=tfidf1.get_feature_names()).columns)

10114

In [7]:
pd.DataFrame(X_train_tfidf1.toarray(), columns=tfidf1.get_feature_names())

Unnamed: 0,aaron,aaronites,abaddon,abagtha,abana,abarim,abase,abate,abba,abda,...,zorathites,zoreah,zorites,zorobabel,zuar,zuph,zur,zuriel,zurishaddai,zuzims
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31098,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31099,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# BBE translation

Bibe in basic English was a translations done by Professor S. H. Hooke following the standards of "Basic English", last revised in 1965
This implies a couple of restricitons:
- Basic English restricts Vocabulary to 1000 words 
    - 850 base words
    - 100 additional words for poetry
    - 50 additional words related to biblical context

In [None]:
# import the default 850 basic english words 
basic_english = pd.read_pickle('data/basic_english_list.pkl')
len(basic_english)

In [None]:
# Import BBE translation to df
bbe = pd.read_csv('bible_corpus/bible_databases-master/t_bbe.csv')

In [None]:
#remove lemmetize, remove stop-words and punctuation
bbe['cleaned']=bbe['field.4'].apply(spacy_tokenizer)

In [None]:
alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
bbe['cleaner']= bbe.cleaned.map(alphanumeric).map(punc_lower)

In [None]:
bbe.iloc[16102:16106]

# Export to sql

In [None]:
engine = create_engine('postgresql://briantam:localhost@localhost/bible')

bbe.to_sql('bbe_alchemy', engine, if_exists='replace', index=False)


In [23]:
engine = create_engine('postgresql://briantam:localhost@localhost/bible')

kjv.to_sql('kjv', engine, if_exists='replace', index=False)


In [None]:
bbe.to_csv('data/bbe_cleaned.csv')

# **NO** stop_words **YES** basic_english 

In [None]:
# Define what you'll feed into the vectorizer as X
X = bbe['cleaner']
tfidf = TfidfVectorizer(stop_words='english')
bbe_cleaned_tfidf = tfidf.fit_transform(X)
bbe_cleaned_tfidf_df = pd.DataFrame(bbe_cleaned_tfidf.toarray(), columns=tfidf.get_feature_names())
print('Vocab size: ', len(bbe_cleaned_tfidf_df.columns))

In [None]:
mytolkens = parser(' '.join(list(bbe_cleaned_tfidf_df.columns)))
tolken_list = [tolken.pos_ for tolken in mytolkens]
BBE_POS_df = pd.DataFrame([(x, tolken_list.count(x)) for x in set(tolken_list)]).sort_values(1)
plt.barh(BBE_POS_df[0],BBE_POS_df[1])
plt.title('Vocab Distribution of BBE')

# Trying other stop_word filters

### **NO** stop_words **NO** basic_english 

In [None]:
# Define what you'll feed into the vectorizer as X
X = bbe['cleaned']
tfidf = TfidfVectorizer(stop_words = basic_english)
bbe_cleaned_tfidf = tfidf.fit_transform(X)
bbe_cleaned_tfidf = pd.DataFrame(bbe_cleaned_tfidf.toarray(), columns=tfidf.get_feature_names())
print('Vocab size: ', len(bbe_cleaned_tfidf.columns))

In [None]:
parser = spacy.load('en_core_web_sm')

In [None]:
mytolkens = parser(' '.join(list(bbe_cleaned_tfidf.columns)))
tolken_list = [tolken.pos_ for tolken in mytolkens]
BBE_POS_df = pd.DataFrame([(x, tolken_list.count(x)) for x in set(tolken_list)]).sort_values(1)
plt.barh(BBE_POS_df[0],BBE_POS_df[1])
plt.title('Vocab Distribution of BBE')

### **YES** stop_words **NO** basic_english 

In [None]:
# Define what you'll feed into the vectorizer as X
X = bbe['cleaned']
tfidf = TfidfVectorizer(max_df=.9 stop_words = basic_english)
X_train_tfidf1 = tfidf1.fit_transform(X)

len(pd.DataFrame(X_train_tfidf1.toarray(), columns=tfidf1.get_feature_names()).columns)

In [None]:
bbe_tfidf = pd.DataFrame(X_train_tfidf1.toarray(), columns=tfidf1.get_feature_names())

In [None]:
mytolkens = parser(' '.join(list(bbe_tfidf.columns)))

In [None]:
tolken_list = [tolken.pos_ for tolken in mytolkens]

In [None]:
BBE_POS_df = pd.DataFrame([(x, tolken_list.count(x)) for x in set(tolken_list)]).sort_values(1)
BBE_POS_df

In [None]:
plt.barh(BBE_POS_df[0],BBE_POS_df[1])

### **YES** stop_words **YES** basic_english 

In [None]:
# Define what you'll feed into the vectorizer as X
X = bbe['field.4']
tfidf = TfidfVectorizer()
bbe_tfidf = tfidf.fit_transform(X)

print('Vocab Size: ', len(pd.DataFrame(bbe_tfidf.toarray(), columns=tfidf.get_feature_names()).columns))

In [None]:
bbe_tfidf = pd.DataFrame(bbe_tfidf.toarray(), columns=tfidf.get_feature_names())
mytolkens = parser(' '.join(list(bbe_tfidf.columns)))
tolken_list = [tolken.pos_ for tolken in mytolkens]
BBE_POS_df = pd.DataFrame([(x, tolken_list.count(x)) for x in set(tolken_list)]).sort_values(1)
plt.barh(BBE_POS_df[0],BBE_POS_df[1])
plt.title('Vocab Distribution of BBE')

# WordClouds

In [None]:
from wordcloud import WordCloud
text = ' '.join([tolken.pos_ for tolken in mytolkens])

# Generate a word cloud image
wordcloud = WordCloud(width = 1000, height = 1000,
                background_color ="rgba(255, 255, 255, 0)", mode="RGBA").generate(text)

# Display the generated image:
# the matplotlib way:
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.savefig(f'POS_BBE_not_in_BE.png',bbox_inches = 'tight', pad_inches = .25)
plt.show()

# Other Versions

### WEB

In [None]:
web = pd.read_csv('data/bible_databases-master/t_web.csv')
# Assign the 
X = web['field.4']

In [None]:
# Create TF-IDF of the array of words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X)
len(pd.DataFrame(X_train_tfidf.toarray(), columns=tfidf.get_feature_names()).columns)

### ASV

### DBY

### WBT

### YLT