# Text correlations

Creates a correlation matrix among a set of given documents showing how strong is the relationship or similarity with them. It grab all documents from a specified folder using tensorflow universal sentence encoder creates vectors of them. 

By default, the folder with the documents is called `my_documents/` and the resulting files with all correlations will be placed on `correlation/` forlder. 

The csv output file `correlation/correlation.csv` can be imported in spreadsheet processors such as Microsoft Excel, Libre Office and Google Spreadsheets. The correlation near to 1 is maximum, close to zero is minimal, meaning no correlation. It is also suggested to use their tools for conditional formating to facilitate visualization.

By default, intermediary files greater than 500 kBytes are skipped.

Two other folders `correlation/texts/` and `correlation/vectors/` are created with intermediary files.

The original documents can be PDF, DOC and EPUB and others ([see tika documentation](https://tika.apache.org/0.9/formats.html))

Developed using Python 3.7.7

In [49]:
# Imports
from tika import parser # For parsing PDF and other to TXT
from multiprocessing import Pool
import os
from os import path, stat
import glob # For listing files in forlders
import tensorflow_hub as hub
import numpy as np
import tensorflow_text
import dill # For binary files
import pandas as pd 
import altair as alt # for charts
import re # for searchs in text

In [50]:
# Setup
# Path for local documents (pdfs, docs, txts, etc.)
PATH_FOR_DOCUMENTS = "my_documents/"
# Path to put plain texts (can leave is as it is)
PATH_TEXTS = "correlation/texts/"
# Path to put vectors - enconded texts according to their contexts (can leave is as it is)
PATH_VECTORS = "correlation/vectors/"
# Path to put the output csv file
PATH_CORRELATION_FILE = "correlation/"
# Max size for the plain text file, skipping bigger than since taks too long to process
MAX_FILESIZE = 500*1024
# Plotting configurations
CHART_WIDTH = 600
CHART_HEIGHT = 400

In [51]:
# helpers
def parse_document(file):
    content = parser.from_file(file)
    if 'content' in content:
        text = content['content']
    else:
        return
    text = str(text)
    # Ensure text is utf-8 formatted
    safe_text = text.encode('utf-8', errors='ignore')
    # Escape any \ issues
    safe_text = str(safe_text).replace('\\', '\\\\').replace('"', '\\"')
    return safe_text
    
def encode_text(text):
        return embed(text)

def create_folder(folder):
    if not os.path.exists(folder):
        os.makedirs(folder)    

def save_text(file, text):
    text_file = open(PATH_TEXTS + os.path.basename(file) + ".txt", "w")
    text_file.write(text)
    text_file.close()
    
def read_text(file):
    text_file = open(PATH_TEXTS + os.path.basename(file), "r")
    text = text_file.read()
    text_file.close()
    return text
    
def read_vector(file):
    vector = dill.load(open(PATH_VECTORS + os.path.basename(file), "rb"))
    return vector

def save_vector(file, vector):
    dill.dump(vector, open(PATH_VECTORS + os.path.basename(file) + ".vec", "wb"))

In [52]:
# Create folders if does not exist
create_folder(PATH_TEXTS)
create_folder(PATH_VECTORS)

In [53]:
# tensorflow universal-sentence-encoder-multilingual
# 16 languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian) text encoder.
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

In [54]:
# Check if there are old and unnecessary output files
# It will not 
unnecessary_texts = []
files = glob.glob(PATH_FOR_DOCUMENTS + "*")
text_files = glob.glob(PATH_TEXTS + "*")
vector_files = glob.glob(PATH_VECTORS + "*")
for tf in text_files: 
    original_name = os.path.basename(tf[:-4])
    if PATH_FOR_DOCUMENTS + original_name not in files:
        unnecessary_texts.append(tf)
if (unnecessary_texts != []):
    print("The following text files seems to be unnecessary:")
    print(unnecessary_texts)
# Uncomment to remove them
    #for f in unnecessary_texts:
        #os.remove(f)
unnecessary_vectors = []
for vf in vector_files: 
    original_text_name = os.path.basename(vf[:-4])
    if PATH_TEXTS + original_text_name not in text_files:
        unnecessary_vectors.append(vf)
if (unnecessary_vectors != []):
    print("The following vector files seems to be unnecessary:")
    print(unnecessary_vectors)
# Uncomment to remove them
    #for f in unnecessary_vectors:
        #os.remove(f)

In [55]:
# List of files that will be converted to plain text
files = glob.glob(PATH_FOR_DOCUMENTS + "*")
# Create plain text files from original documets
for f in files: 
    if not path.exists(PATH_TEXTS + os.path.basename(f) + ".txt"):
        parsed_document = parse_document(f)
        save_text(f, parsed_document)

In [56]:
# List of plain text documents that exist in the PATH_TEXTS folder
text_files = glob.glob(PATH_TEXTS + "*")
# Create vectors for the plain texts
for tf in text_files: 
    if os.stat(tf).st_size < MAX_FILESIZE and not path.exists(PATH_VECTORS + os.path.basename(tf) + ".vec"):
        text = read_text(tf)
        vector = encode_text(text)
        save_vector(tf, vector)

In [57]:
# List of vectors that exist in the PATH_VECTORS folder
vector_files = glob.glob(PATH_VECTORS + "*")
# Read existing vectors
vectors = []
for vf in vector_files: 
    vector = read_vector(vf)
    vectors.append(vector)

In [58]:
# List all existing files, create a dataframe putting them as rows and coluns in a "n x n" structure
file_names = []
for vf in vector_files:
    file_name = os.path.basename(vf)
    file_names.append(file_name.split('.')[0])
df = pd.DataFrame(columns=[file_names])
df.columns = df.columns.map(str)
# Start the structure with zeros in the correlations
for fn in file_names:
    df.loc[fn] = np.zeros(len(file_names))

In [59]:
# Process correlations across each document with all others, uptade the "n x n" structure with these correlations
for i in range(len(file_names)):
    for j in range(len(file_names)):
        # df.values[row, column] = value
        df.values[i, j] = np.inner(vectors[i], vectors[j])

In [60]:
# Save the correlations into a csv file
df.to_csv(PATH_CORRELATION_FILE + "correlations.csv")

In [61]:
# For plotting, generate a table with the columns paper_a, paper_b, and correlation
file_names = []
for vf in vector_files:
    file_name = os.path.basename(vf)
    file_names.append(file_name.split('.')[0])
dfsingle = pd.DataFrame(columns=["paper_a", "paper_b", "correlation"])
dfsingle.columns = dfsingle.columns.map(str)

In [62]:
# Process correlations for table with 3 columns
for i in range(len(file_names)):
    for j in range(len(file_names)):
        # df.values[row, column] = value
        dfsingle.loc[(i*len(file_names))+j] = (file_names[i], 
                                            file_names[j], 
                                            round(float(np.inner(vectors[i], vectors[j])), 2) * 100)

In [63]:
# Save the correlations into a csv file
dfsingle.to_csv(PATH_CORRELATION_FILE + "single_correlations.csv")

In [65]:

dfplot = pd.DataFrame({'x': dfsingle["paper_a"].ravel(),
                   'y': dfsingle["paper_b"].ravel(),
                   'Correlation': dfsingle["correlation"].ravel()})

chart = alt.Chart(dfplot).mark_rect().encode(
    x=alt.X('x:O', axis=alt.Axis(title="")),
    y=alt.Y('y:O', axis=alt.Axis(title="")),
    color='Correlation:Q'
)

text = chart.mark_text(baseline='middle').encode(
    text='Correlation:Q',
    color=alt.condition(
        alt.datum.Correlation > 70,
        alt.value('black'),
        alt.value('white')
    )
)

# Draw the chart
plot = chart.properties(width=CHART_WIDTH, height=CHART_HEIGHT) + text
plot

In [68]:
for i in range(len(file_names)):
    text = read_text(text_files[i])
    fchar = re.search(r"[^b'\\n]", text).start()
    print(file_names[i], ": ", text[fchar:fchar+80].replace("\\n", " ").replace("\\", ""))

EMAS2019_paper_32 :  On Enactability of Agent Interaction Protocols: Towards a Unified Approach 
EMAS2019_paper_24 :  Jacamo-web is on the fly: an interactive Multi-Agent System IDE?  Cleber J
EMAS2019_paper_22 :  Agents are More Complex than Other Software: An Empirical Investigation  A
EMAS2019_paper_28 :  An Introduction to Engineering Multiagent Industrial Symbiosis Systems: Potent
EMAS2019_paper_33 :  Incorporating social practices in BDI agent systems  Stephen Cranefield1 and
EMAS2019_paper_5 :  From Programming Agents to Educating Agents xe2x80x93 A Jason-based Fram
EMAS2019_paper_26 :  The xe2x80x9cWhy did you do that?xe2x80x9d Button: Answering Why-q
EMAS2019_paper_21 :  From Goals to Organisations: automated organisation generator for MAS?  Cl
EMAS2019_paper_18 :  Using MATSim as a Component in Dynamic Agent-Based Micro-Simulations  Dhir
EMAS2019_paper_23 :  JS-son - A Minimal JavaScript BDI Agent Library  Timotheus Kampik and Juan
EMAS2019_paper_31 :  Whoxe2x80x99s that? - M

### Credits
Developed by [Cleber Jorge Amaral](https://cleberjamaral.github.io/), acknowledging it is highly inspired by a work presented by Aladdin Shamoug