# Papers correlations EMAS 2019
> Correlation matrix among papers accepted on 7th International Workshop on Engineering Multi-Agent Systems (EMAS 2019)

- toc: false
- badges: true
- comments: true
- author: Cleber Jorge Amaral
- categories: [comparison, altair, jupyter]
- image: images/text-correlations.png

In [32]:
#hide
# Imports
from tika import parser # For parsing PDF and other to TXT
import os
from os import path, stat
import tensorflow_hub as hub
import numpy as np
import tensorflow_text
import pandas as pd 
import altair as alt # for charts
import re # for searchs in text
from bs4 import BeautifulSoup # parse html
import requests # http requests
from altair_saver import save
from scipy import stats, spatial

In [33]:
#hide
# Plotting configurations
CHART_WIDTH = 600
CHART_HEIGHT = 400
# Database folder
DB_FOLDER = "../assets/db/"
# File containing the list of papers
PAPERS_LIST_FILE = DB_FOLDER + "EMAS2019_papers_list.txt"
# File containing paper descriptions
PAPERS_DESCRIPTION_FILE = DB_FOLDER + "EMAS2019_papers_description.txt"
# File containing the correlation matrix (for caching)
SIMILARITY_FILE = DB_FOLDER + "EMAS2019_simple_similarity.csv"
CORRELATIONS_FILE = DB_FOLDER + "EMAS2019_cosine_correlations.csv"
PEARSON_FILE = DB_FOLDER + "EMAS2019_pearson_similarity.csv"

In [34]:
#hide
# helpers
def parse_document(file):
    content = parser.from_file(file)
    if 'content' in content:
        text = content['content']
    else:
        return
    text = str(text)
    # Using utf-8 format
    safe_text = text.encode('utf-8', errors='ignore')
    # Removing special characters
    safe_text = re.sub("\\\\\\\\x..", "", text)
    # Removing returns
    safe_text = str(safe_text).replace("\\\\n", " ")
    # Removing sequences of spaces
    safe_text = ' '.join(safe_text.split())
    return safe_text
    
def encode_text(text):
        return embed(text)
    
def create_folder(folder):
    if not os.path.exists(folder):
        os.makedirs(folder) 

def save_text(file, text):
    text_file = open(file, "w")
    text_file.write(text)
    text_file.close()
    
def read_text_as_strlist(file):
    text_file = open(file, "r")
    strlist = text_file.read().splitlines() 
    text_file.close()
    return strlist

In [35]:
#hide
# tensorflow universal-sentence-encoder-multilingual
# 16 languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian) text encoder.
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

In [36]:
#hide
if not os.path.exists(PAPERS_LIST_FILE):
    emas_url = 'https://cgi.csc.liv.ac.uk/~lad/emas2019/accepted/'
    page_as_text = requests.get(emas_url).text
    parsed_html = BeautifulSoup(page_as_text, 'html.parser')
    files = [emas_url + '/' + node.get('href') for node in parsed_html.find_all('a') if node.get('href').endswith('pdf')]
    create_folder(DB_FOLDER)
    save_text(PAPERS_LIST_FILE, '\n'.join(map(str, files)))
else:
    files = read_text_as_strlist(PAPERS_LIST_FILE)
    descriptions = read_text_as_strlist(PAPERS_DESCRIPTION_FILE)

In [37]:
#hide
if not os.path.exists(CORRELATIONS_FILE):
    
    # Create vectors for documents
    texts = []
    vectors = []
    for f in files: 
        parsed_document = parse_document(f)
        texts.append(parsed_document)
        vector = encode_text(parsed_document)
        vectors.append(vector)
        
    # Generate papers descriptions file
    descriptions = []
    for i in range(len(files)):
        text = texts[i]
        fchar = re.search(r"[^b'\\n]", text).start()
        descriptions.append(text[fchar:fchar+80].replace("\\n", " ").replace("\\", ""))
        save_text(PAPERS_DESCRIPTION_FILE, '\n'.join(map(str, descriptions)))
        
    # For plotting, generate a table with the columns paper_a, paper_b, and correlation
    file_names = []
    for vf in files:
        file_name = os.path.basename(vf)
        file_names.append(file_name.split('.')[0])
        
    # https://numpy.org/doc/stable/reference/generated/numpy.inner.html
    dfsingle = pd.DataFrame(columns=["paper_a", "paper_b", "correlation"])
    dfsingle.columns = dfsingle.columns.map(str)
    
    # Process correlations for table with 3 columns
    for i in range(len(file_names)):
        for j in range(len(file_names)):
            # df.values[row, column] = value
            dfsingle.loc[(i*len(file_names))+j] = (file_names[i], 
                                            file_names[j], 
                                            round(float(np.inner(vectors[i], vectors[j])), 2) * 100)
                
    # For plotting, generate a table with the columns paper_a, paper_b, and correlation
    # https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
    dfpearson = pd.DataFrame(columns=["paper_a", "paper_b", "correlation"])
    dfpearson.columns = dfpearson.columns.map(str)
    
    # Process correlations for table with 3 columns
    for i in range(len(file_names)):
        for j in range(len(file_names)):
            # df.values[row, column] = value
            dfpearson.loc[(i*len(file_names))+j] = (file_names[i], 
                                            file_names[j], 
                                            round(float(stats.pearsonr(vectors[i][0], vectors[j][0])[0]), 2) * 100)

    # For plotting, generate a table with the columns paper_a, paper_b, and correlation
    # https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html
    dfcosine = pd.DataFrame(columns=["paper_a", "paper_b", "correlation"])
    dfcosine.columns = dfcosine.columns.map(str)
    
    # Process correlations for table with 3 columns
    for i in range(len(file_names)):
        for j in range(len(file_names)):
            # df.values[row, column] = value
            dfcosine.loc[(i*len(file_names))+j] = (file_names[i], 
                                            file_names[j], 
                                            round(float(1 - spatial.distance.cosine(vectors[i], vectors[j])), 2) * 100)
            
    # Save the correlations into a csv file
    dfsingle.to_csv(SIMILARITY_FILE)
    dfpearson.to_csv(PEARSON_FILE)
    dfcosine.to_csv(CORRELATIONS_FILE)
else:
    dfsingle = pd.read_csv(SIMILARITY_FILE)
    dfpearson = pd.read_csv(PEARSON_FILE)
    dfcosine = pd.read_csv(CORRELATIONS_FILE)

2020-06-06 19:28:53,431 [MainThread  ] [INFO ]  Retrieving https://cgi.csc.liv.ac.uk/~lad/emas2019/accepted//EMAS2019_paper_5.pdf to /tmp/lad-emas2019-accepted-emas2019_paper_5.pdf.
INFO:tika.tika:Retrieving https://cgi.csc.liv.ac.uk/~lad/emas2019/accepted//EMAS2019_paper_5.pdf to /tmp/lad-emas2019-accepted-emas2019_paper_5.pdf.
2020-06-06 19:28:58,039 [MainThread  ] [INFO ]  Retrieving https://cgi.csc.liv.ac.uk/~lad/emas2019/accepted//EMAS2019_paper_8.pdf to /tmp/lad-emas2019-accepted-emas2019_paper_8.pdf.
INFO:tika.tika:Retrieving https://cgi.csc.liv.ac.uk/~lad/emas2019/accepted//EMAS2019_paper_8.pdf to /tmp/lad-emas2019-accepted-emas2019_paper_8.pdf.
2020-06-06 19:29:01,185 [MainThread  ] [INFO ]  Retrieving https://cgi.csc.liv.ac.uk/~lad/emas2019/accepted//EMAS2019_paper_18.pdf to /tmp/lad-emas2019-accepted-emas2019_paper_18.pdf.
INFO:tika.tika:Retrieving https://cgi.csc.liv.ac.uk/~lad/emas2019/accepted//EMAS2019_paper_18.pdf to /tmp/lad-emas2019-accepted-emas2019_paper_18.pdf.
202

In [38]:
#hide
# hiding these results since pearson, inner and cosine are very close
dfplot = pd.DataFrame({'x': dfsingle["paper_a"].ravel(),
                   'y': dfsingle["paper_b"].ravel(),
                   'Correlation': dfsingle["correlation"].ravel()})

chart = alt.Chart(dfplot).mark_rect().encode(
    x=alt.X('x:O', axis=alt.Axis(title="")),
    y=alt.Y('y:O', axis=alt.Axis(title="")),
    color='Correlation:Q'
).properties(
    title=["Inner product (basic similarity function)"]
)

text = chart.mark_text(baseline='middle').encode(
    text='Correlation:Q',
    color=alt.condition(
        alt.datum.Correlation >= 70,
        alt.value('black'),
        alt.value('white')
    ),
    size=alt.value(14),
    opacity=alt.value(0.85)
)

# Draw the chart
plot = chart.properties(width=CHART_WIDTH, height=CHART_HEIGHT) + text
plot

In [39]:
#hide
# hiding these results since pearson, inner and cosine are very close
dfplot = pd.DataFrame({'x': dfcosine["paper_a"].ravel(),
                   'y': dfcosine["paper_b"].ravel(),
                   'Correlation': dfcosine["correlation"].ravel()})

chart = alt.Chart(dfplot).mark_rect().encode(
    x=alt.X('x:O', axis=alt.Axis(title="")),
    y=alt.Y('y:O', axis=alt.Axis(title="")),
    color='Correlation:Q'
).properties(
    title=["Cosine similarity"]
)

text = chart.mark_text(baseline='middle').encode(
    text='Correlation:Q',
    color=alt.condition(
        alt.datum.Correlation >= 70,
        alt.value('black'),
        alt.value('white')
    ),
    size=alt.value(14),
    opacity=alt.value(0.85)
)

# Draw the chart
plot = chart.properties(width=CHART_WIDTH, height=CHART_HEIGHT) + text
plot

In [40]:
#hide_input
dfplot = pd.DataFrame({'x': dfpearson["paper_a"].ravel(),
                   'y': dfpearson["paper_b"].ravel(),
                   'Correlation': dfpearson["correlation"].ravel()})

chart = alt.Chart(dfplot).mark_rect().encode(
    x=alt.X('x:O', axis=alt.Axis(title="")),
    y=alt.Y('y:O', axis=alt.Axis(title="")),
    color='Correlation:Q'
).properties(
    title=["Pearson correlation"]
)

text = chart.mark_text(baseline='middle').encode(
    text='Correlation:Q',
    color=alt.condition(
        alt.datum.Correlation >= 70,
        alt.value('black'),
        alt.value('white')
    ),
    size=alt.value(14),
    opacity=alt.value(0.85)
)

# Draw the chart
plot = chart.properties(width=CHART_WIDTH, height=CHART_HEIGHT) + text
plot

In [41]:
#hide_input
for i in range(len(files)):
    print(os.path.basename(files[i]), ": ", descriptions[i])

EMAS2019_paper_5.pdf :  From Goals to Organisations: automated organisation generator for MAS? Cleber Jo
EMAS2019_paper_8.pdf :  Jacamo-web is on the fly: an interactive Multi-Agent System IDE? Cleber Jorge Am
EMAS2019_paper_18.pdf :  SAT for Epistemic Logic using Belief Bases Fabián Romero1 and Emiliano Lorini1 
EMAS2019_paper_21.pdf :  JS-son - A Minimal JavaScript BDI Agent Library Timotheus Kampik and Juan Carlos
EMAS2019_paper_22.pdf :  An Architecture for Integrating BDI Agents with a Simulation Environment Alan Da
EMAS2019_paper_23.pdf :  Using MATSim as a Component in Dynamic Agent-Based Micro-Simulations Dhirendra S
EMAS2019_paper_24.pdf :  Plan Library Reconfigurability in BDI Agents? Rafael C. Cardoso, Louise A. Denni
EMAS2019_paper_25.pdf :  Incorporating social practices in BDI agent systems Stephen Cranefield1 and Fran
EMAS2019_paper_26.pdf :  Accountability and Agents for Engineering Business Processes Matteo Baldoni1, Cr
EMAS2019_paper_27.pdf :  The “Why did you do tha

### How it works?
It downloads the accepted papers available in [EMAS 2019](https://cgi.csc.liv.ac.uk/~lad/emas2019/accepted/) page. Each paper in PDF is converted to a plain text using [Apache Tika](https://tika.apache.org/). Then using [Google Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3) they are [vectorized](https://ai.googleblog.com/2019/07/multilingual-universal-sentence-encoder.html). These vectors are compared creating correlations. The correlations vary from 0 to 100% of similarity. Those values are presented in an [Altair correlation matrix](https://altair-viz.github.io/gallery/simple_heatmap.html).
### What else it can do?
I use it to find correlations across many papers and books I use in my researches. Since I use [Mendeley](https://www.mendeley.com/), all of them are in a plain folder. The project called [text-correlation](https://github.com/cleberjamaral/text-correlation) retrieves all documents from a local folder creating a correlation matrix `n x n` in a `.csv` file. It is better for larger number of documents and suitable to open in a spreadsheet processor.

### Credits
Developed by [Cleber Jorge Amaral](https://cleberjamaral.github.io/), acknowledging it is highly inspired by a work presented by Aladdin Shamoug

In [11]:
#hide
save(plot,"../images/text-correlations.png")

ValueError: No enabled saver found that supports format='png'