# 1. Setup (Installation and importing modules)

Here we will install all the libraries required for the preprocessing of the documents. We will be installing following packages:

- contractions==0.0.25
- ekphrasis==0.5.1
- tensorflow==2.1.0
- pandas==0.24.2
- numpy==1.18.1
- scikit-learn==0.23.1

In [None]:
!pip install -r requirements.txt

In [130]:
from lxml import etree # for processing xml
import re
import glob # file handling
import os
import pandas as pd
import pickle
from collections import Counter
import contractions # for expanding contractions
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer # For tokenizing documents
from ekphrasis.dicts.emoticons import emoticons # for processing emoticons

from nltk.corpus import wordnet # for nltk corpus
from nltk.stem import WordNetLemmatizer # for Lemmatizaton
import nltk

# 2. Data Loading

Firstly, we will load the label file and extract document ids present in training data

In [131]:
train_labels_df = pd.read_csv(os.path.join('all_data', 'train_labels.csv'))
train_files = train_labels_df['id'].to_list() # get training data id 

Lets extract labels for test data as well

In [132]:
test_labels_df = pd.read_csv(os.path.join('all_data', 'test_labels.csv'))
test_files = test_labels_df['id'].to_list() # get test data id 

We will parse xml into dictionary where each key represents the id of xml and the values are content of documents for a particular id in the xml

In [133]:
all_files = train_files+test_files # consists of all ids of entire dataset

In [160]:
base_path = os.path.join('all_data','data')
document_dict = dict()
document_test_dict = dict()

# store content of each xml into dictionary
for file_name in all_files:
    doc_list = []
    file_path = os.path.join(base_path, file_name+'.xml')
    tree = etree.parse(file_path) # parse xml tree
    root = tree.getroot() # get root element of tree
    # store each content of each doument element in a list
    for doc in root.iter('document'): 
        doc_list.append(doc.text)
    if file_name in test_files:
        document_test_dict[file_name] = doc_list # store in test data dictionary
    else:
        document_dict[file_name] = doc_list # store in train data dictionary

#### Find all special characters in the document

In [106]:
train_labels_df.head()

Unnamed: 0,id,gender
0,d7d392835f50664fc079f0f388e147a0,male
1,ee40b86368137b86f51806c9f105b34b,female
2,919bc742d9a22d65eab1f52b11656cab,male
3,15b97a08d65f22d97ca685686510b6ae,female
4,affa98421ef5c46ca7c8f246e0a134c1,female


Below regular expression will extract all special characters from all documents and store it.

```python
re.findall('([^\u0000-\u007F]+)
```

In [14]:
specialChars = []

# Iterate over each document file to store special characters
for id, doc_list in document_dict.items():
    for doc in doc_list:
        specialChars.extend(re.findall('([^\u0000-\u007F]+)', doc))

The most common special characters/ emojis are displayed below

In [15]:
Counter(specialChars).most_common(200)

[('…', 22137),
 ('’', 7174),
 ('😂', 2411),
 ('“', 1823),
 ('”', 1746),
 ('‘', 956),
 ('😊', 952),
 ('–', 886),
 ('😂😂', 872),
 ('—', 862),
 ('\xa0', 819),
 ('😍', 738),
 ('❤️', 658),
 ('😂😂😂', 652),
 ('😉', 642),
 ('£', 607),
 ('👍', 525),
 ('😭', 491),
 ('❤', 480),
 ('🤔', 471),
 ('🙄', 467),
 ('😘', 396),
 ('é', 389),
 ('•', 386),
 ('😀', 380),
 ('😁', 332),
 ('😳', 329),
 ('🙈', 320),
 ('😢', 305),
 ('€', 291),
 ('😜', 289),
 ('😩', 284),
 ('💕', 262),
 ('😎', 260),
 ('📷', 235),
 ('😄', 233),
 ('⚡️', 231),
 ('😬', 225),
 ('☺️', 222),
 ('👍🏻', 212),
 ('🙃', 211),
 ('👌', 210),
 ('😂😂😂😂', 208),
 ('😃', 199),
 ('😅', 194),
 ('♫', 193),
 ('👀', 185),
 ('😕', 182),
 ('🔥', 172),
 ('😏', 171),
 ('😡', 169),
 ('😱', 166),
 ('😒', 165),
 ('😔', 162),
 ('💜', 159),
 ('🎉', 157),
 ('☺', 154),
 ('✨', 144),
 ('😋', 140),
 ('💖', 139),
 ('👌🏻', 137),
 ('🎄', 136),
 ('👏', 136),
 ('😭😭', 135),
 ('🙂', 132),
 ('🎶', 131),
 ('😆', 131),
 ('😴', 130),
 ('👌🏼', 127),
 ('🙌🏻', 127),
 ('💔', 126),
 ('🇺🇸', 126),
 ('😫', 123),
 ('😝', 122),
 ('✌', 121),
 

Seems lots of people have used emojis in their tweets, which was expected

# 3. Text Processing

Firstly, we will collect all external data necessary for preprocessing the documents

We will first collect the dictioanry for unicode emojis replacement. The dictionary looks as follows:

```python
UNICODE_EMO = {'😜': ':face_with_stuck-out_tongue_&_winking_eye:',
               '😂': ':face_with_tears_of_joy:',
               '🤒': ':face_with_thermometer:',
               '😶': ':face_without_mouth:',
               '🏭': ':factory:',
               '🍂': ':fallen_leaf:',
               '👪': ':family:',
              .....}

```

This was downloaded from [here](https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py)

Also, some part of the dictionary was modified manually, by removing color of the skin present in the emoji (Just to be anti-racial) :)

In [135]:
# collect unicode repacement dictionary
with open('unicode_emo.pickle', 'rb') as handle:
    UNICODE_EMO = pickle.load(handle)

#### A small demo on how it looks

In [136]:
text = "game is on 🔥 :) 😂 and 😂😂😂"

# replacement from the above dictioanry for emojis
for emo, equivalent in UNICODE_EMO.items():
    text = text.replace(emo,equivalent).replace('_',' ')
    
# replacement for emoticons with tags
for emot, emot_words in emoticons.items():
    text = text.replace(emot,emot_words)
print(text)

game is on :fire: <happy> :face with tears of joy: and :face with tears of joy::face with tears of joy::face with tears of joy:


After carefully studying the most common special characters, I decided to form the below dictionary for the replacement of several characters

In [137]:
special_dict = {'‘':'\'', 
                '’':'\'',
                '“':'\"',
                '”':'\"',
                '–': '-',
                '—':'-',
                'é':'e',
                'è':'e',
                'ú':'us',
                '🇺🇸':'US',
                'á':'a',
                'í':'i',
                'ā':'a',
                '🇨🇦':'CA',
                '°':'degree ',
                '🇦🇺':'AU',
                'ó':'o',
                'É':'E',
                'ñ':'n',
                '1/3': '',
                '2/3': '',
                '1/2': '',
                '2/2': '',
                '3/3': '',
                '...': '.',
                '..': '.',
                '&': 'and '}

A small experiment was conducted after initial preprocessing to look into which words are most occuring, for which emebddings were not found. This experiment can be viewed in **experiment.ipynb notebook**

In [163]:
# these words were found by words coverage experiment 
replacement_dict = {'gr8': 'great',
                    'lmao': 'laughing',
                    'auspol': 'Australian politics',
                    'brexit': 'British exit'}

#### Getting some help from ekphrasis

We will normalize several type of texts as mentioned below and annotate some part of the text which is useful for preprocessing

In [164]:
# for removing these converted tags after processing 
replacement_dict1 = {'<url>':'', 
             '<hashtag>':'',
             '</hashtag>':'',
             '<elongated>':'',
             '<repeated>':'',
             '<emphasis>':'',
             '<user>':''}
    
text_processor = TextPreProcessor(
    # terms that will be normalized
    normalize=['url', 'email', 'percent', 'money', 'phone', 'user',
        'time', 'date', 'number', 'censored'],
    
    # terms that will be annotated
    annotate={"hashtag", "elongated", "repeated",
        'emphasis'},
    
    fix_html=True,  # fix HTML tokens like '&lt; &gt; &amp; \xa0
    
    # corpus from which the word statistics are going to be used 
    # for word segmentation 
    segmenter="twitter", 
    
    # corpus from which the word statistics are going to be used 
    # for spell correction
    corrector="twitter", 
    
    unpack_hashtags=True,  # perform word segmentation on hashtags
    unpack_contractions=False,  # Unpack contractions (can't -> can not) (Will do explicitly)
    spell_correct_elong=False,  # spell correction for elongated words
    
    # Social tokenizer, will take a string  as input and return a list of tokens
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions.
    dicts=[emoticons, replacement_dict1] 
)

Reading twitter - 1grams ...
Reading twitter - 2grams ...
Reading twitter - 1grams ...


Now, the below function will consists of steps to be carried for preprocessing the xmls.

Each xml contains several documents, preprocessing will be done document wise and will be stored in the dictionary

In [165]:
# remove special non ascii characters/emojis and some mappings 
def transform_doc(doc):
    '''
    preprocess each document for modelling
    @arg1: doc - document text to be preprocessed, dtype: String
    return: doc - preprocessed document, dtype: String
    '''
    
    # replace special characters from above dictionary 
    for char, replacement in special_dict.items():
        doc = doc.replace(char, replacement)
    
    # emoji replacement
    for emo, equivalent in UNICODE_EMO.items():
        doc = doc.replace(emo,equivalent).replace('_',' ')
    
    doc = doc.encode('ascii', "ignore").decode() # remove unnecessary encoded characters/words
    
    # removal of special urls
    url_pattern2 = re.compile(r'https?/\S+|www\.\S+')
    doc = url_pattern2.sub(r'', doc)
    
    # expand contractions
    doc = contractions.fix(doc)
    
    # preprocess all steps mentioned in above TextPreprocessor using ekphrasis
    doc = " ".join(text_processor.pre_process_doc(doc))
    
    # remove unnecessary special characters
    doc = re.sub('[^A-Za-z]+', ' ', doc)
    
    # replacing words found by coverage experiment
    for word, replacement in replacement_dict.items():
        doc = doc.replace(word, replacement)
    
    # remove extra spaces between words trims and remove extra \n
    doc = " ".join(doc.split())
    
    # convert to lower case
    doc = doc.lower()
    
    return doc

#### Preprocessing train data

We will preprocess using above training and testing data using above function

In [None]:
# iterate over each xml on training data for preprocessing
for xml_file in document_dict.keys(): 
    transformed_docs = []
    # transform each document for xml file using above function
    for doc in document_dict[xml_file]:
        transformed_docs.append(transform_doc(doc))

    document_dict.update({xml_file: transformed_docs})

We will perform similar processing for test data

In [None]:
# iterate over each xml on test data for preprocessing
for xml_file in document_test_dict.keys(): 
    transformed_docs = []
    # transform each document for xml file using above function
    for doc in document_test_dict[xml_file]:
        transformed_docs.append(transform_doc(doc))

    document_test_dict.update({xml_file: transformed_docs})

Now, we will convert this dictionary of transformed text into a dataframe.

The dataframe will be in a melted form.

In [None]:
# converting gender series to dictinary with id as index
label_lookup = pd.Series(train_labels_df.gender.values,index=train_labels_df.id).to_dict()

train_data = list()

# iterate over documents to build a list of tuples
for id in train_files:
    train_data.extend([tuple((id, doc, label_lookup[id])) for doc in document_dict[id]])

In [None]:
# convert list of tuples to dataframe
train_df = pd.DataFrame(train_data, columns=['id', 'document','gender'])

We will perform similar steps on test data 

In [None]:
# converting gender series to dictinary with id as index
label_lookup = pd.Series(test_labels_df.gender.values,index=test_labels_df.id).to_dict()

test_data = list()

# iterate over documents to build a list of tuples
for id in test_files:
    test_data.extend([tuple((id, doc, label_lookup[id])) for doc in document_test_dict[id]])

# convert list of tuples to dataframe
test_df = pd.DataFrame(test_data, columns=['id', 'document','gender'])

In [None]:
train_df.to_csv('training_data_dl.csv', index=False)
test_df.to_csv('testing_data_dl.csv', index=False)

At this point, the preprocessing steps carried till now were just developed so that features from text can be extracted using embeddings for each word or sentence. 

It is important to understand that preprocessing steps like stopwords removal or removing less/more frequent words should not be carried if we were to extract emebdding features as context of the word/sentence may be lost if we carry these steps.

For syntax understanding, you need to either leave in the stop words or alter your stop list, such that you don't lose that information. For instance, cutting out all verbs of being (is, are, should be, ...) can mess up a NN that depends somewhat on sentence structure.

We will now carry remaining steps for text preprocessing specific to Tf-IDf.

# 4. Pre-processing for TfIdf

### Stopwords removal

Lets grab stopwords from nltk, instead of importing, I have just copied it from nltk package and adjusted this list a bit to suit our need.

In [None]:
STOPWORDS = {'a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't",
             'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', 
             "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during',
             'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 
             'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', 
             "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn',
             "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or',
             'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should',
             "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than', 'that', "that'll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 've', 'very',
             'was', 'wasn', "wasn't", 'we', 'were', 'weren', "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will',
             'with', 'won', "won't", 'wouldn', "wouldn't", 'face', 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves'}

In [147]:
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

# remove stopwords from train data
train_df["document"] = train_df["document"].apply(lambda text: remove_stopwords(text))
train_df.head()

Unnamed: 0,id,document,gender
0,d7d392835f50664fc079f0f388e147a0,youch good things know sort stuff repairable,male
1,d7d392835f50664fc079f0f388e147a0,succumbed fomo bought gnr tickets remember ask...,male
2,d7d392835f50664fc079f0f388e147a0,brown eye broom cool number rescue clear broke...,male
3,d7d392835f50664fc079f0f388e147a0,shout auckland tennis fans get sleep morning w...,male
4,d7d392835f50664fc079f0f388e147a0,someone balls come tears joy,male


In [148]:
# remove stopwords from test data
test_df["document"] = test_df["document"].apply(lambda text: remove_stopwords(text))
test_df.head()

Unnamed: 0,id,document,gender
0,d6b08022cdf758ead05e1c266649c393,odds stops whining goes gets proper job like r...,male
1,d6b08022cdf758ead05e1c266649c393,would imagine moderately pleased draw,male
2,d6b08022cdf758ead05e1c266649c393,worth reading blog nick positive must admit tr...,male
3,d6b08022cdf758ead05e1c266649c393,hand take last race beaten number lengths jump...,male
4,d6b08022cdf758ead05e1c266649c393,certainly showing interest reading,male


### Remove frequent words

Here, we will first find most frequent words present in the corpus

In [149]:
cnt = Counter()

for text in train_df["document"].values:
    for word in text.split():
        cnt[word] += 1

cnt.most_common(15)

[('number', 44605),
 ('amp', 14117),
 ('like', 12256),
 ('one', 12016),
 ('time', 11304),
 ('trump', 10876),
 ('new', 10528),
 ('would', 10338),
 ('get', 10244),
 ('good', 9984),
 ('joy', 9839),
 ('tears', 9802),
 ('smiling', 9547),
 ('day', 9524),
 ('heart', 9211)]

We will remove these words from each document

In [150]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(15)])

def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

# remove freqwords from training data
train_df["document"] = train_df["document"].apply(lambda text: remove_freqwords(text))
train_df.head()

Unnamed: 0,id,document,gender
0,d7d392835f50664fc079f0f388e147a0,youch things know sort stuff repairable,male
1,d7d392835f50664fc079f0f388e147a0,succumbed fomo bought gnr tickets remember ask...,male
2,d7d392835f50664fc079f0f388e147a0,brown eye broom cool rescue clear broken windo...,male
3,d7d392835f50664fc079f0f388e147a0,shout auckland tennis fans sleep morning worth...,male
4,d7d392835f50664fc079f0f388e147a0,someone balls come,male


### Remove Rare words

Lets extract rare words 

In [151]:
set([(w,wc) for (w, wc) in cnt.most_common()[:-20-1:-1]])

{('beardyman', 1),
 ('bulmers', 1),
 ('drumkits', 1),
 ('forbidding', 1),
 ('haptic', 1),
 ('hingmy', 1),
 ('maalouf', 1),
 ('mayzing', 1),
 ('mcx', 1),
 ('moldavia', 1),
 ('pusilanimous', 1),
 ('satpal', 1),
 ('semen', 1),
 ('shitport', 1),
 ('shitstorms', 1),
 ('sozz', 1),
 ('tiddleywinks', 1),
 ('traid', 1),
 ('wighty', 1),
 ('zucks', 1)}

We will remove rare words now, from the documents

In [152]:
n_rare_words = 20
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]]) # set of rare_words

def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

# remove rare words from training data
train_df["document"] = train_df["document"].apply(lambda text: remove_rarewords(text))
train_df.head()

Unnamed: 0,id,document,gender
0,d7d392835f50664fc079f0f388e147a0,youch things know sort stuff repairable,male
1,d7d392835f50664fc079f0f388e147a0,succumbed fomo bought gnr tickets remember ask...,male
2,d7d392835f50664fc079f0f388e147a0,brown eye broom cool rescue clear broken windo...,male
3,d7d392835f50664fc079f0f388e147a0,shout auckland tennis fans sleep morning worth...,male
4,d7d392835f50664fc079f0f388e147a0,someone balls come,male


Lets perform this step for testing

In [153]:
cnt = Counter()

for text in test_df["document"].values:
    for word in text.split():
        cnt[word] += 1

cnt.most_common(15)

[('number', 7503),
 ('amp', 2216),
 ('trump', 1977),
 ('one', 1951),
 ('like', 1861),
 ('get', 1820),
 ('time', 1756),
 ('new', 1752),
 ('would', 1744),
 ('joy', 1740),
 ('tears', 1738),
 ('smiling', 1526),
 ('good', 1515),
 ('day', 1460),
 ('happy', 1414)]

In [154]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(15)])

# remove most frequent words in testing data
test_df["document"] = test_df["document"].apply(lambda text: remove_freqwords(text))
test_df.head()

Unnamed: 0,id,document,gender
0,d6b08022cdf758ead05e1c266649c393,odds stops whining goes gets proper job rest us,male
1,d6b08022cdf758ead05e1c266649c393,imagine moderately pleased draw,male
2,d6b08022cdf758ead05e1c266649c393,worth reading blog nick positive must admit tr...,male
3,d6b08022cdf758ead05e1c266649c393,hand take last race beaten lengths jumps recen...,male
4,d6b08022cdf758ead05e1c266649c393,certainly showing interest reading,male


In [155]:
set([w for (w, wc) in cnt.most_common()[:-20-1:-1]])

{'aby',
 'adedapo',
 'bellow',
 'boycot',
 'cojones',
 'endment',
 'endmentnow',
 'fooling',
 'gbollmann',
 'imbecile',
 'impeaching',
 'impeachtrumpandpence',
 'julios',
 'kellyann',
 'marxova',
 'montrel',
 'petkanas',
 'santuary',
 'trumperers',
 'wont'}

In [156]:
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]]) # set of rare_words

# remove rare words from testing data
test_df["document"] = test_df["document"].apply(lambda text: remove_rarewords(text))
test_df.head()

Unnamed: 0,id,document,gender
0,d6b08022cdf758ead05e1c266649c393,odds stops whining goes gets proper job rest us,male
1,d6b08022cdf758ead05e1c266649c393,imagine moderately pleased draw,male
2,d6b08022cdf758ead05e1c266649c393,worth reading blog nick positive must admit tr...,male
3,d6b08022cdf758ead05e1c266649c393,hand take last race beaten lengths jumps recen...,male
4,d6b08022cdf758ead05e1c266649c393,certainly showing interest reading,male


### Lemmatizer

We will use lemmatization here.

WordNetLemmatizer is used for preprocessing, and part of speech tagging is used from nltk to get POS for each word so that converting word to root form is easier.

In [157]:
# laod wordnet lemmatizer
lemmatizer = WordNetLemmatizer()

# POS mapping
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}
def lemmatize_words(text):
    '''custom function to get part of speech tagged text
    and convert it into root form'''
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

# all text from training data is converted to lemmatized form 
train_df["document"] = train_df["document"].apply(lambda text: lemmatize_words(text))
train_df.head()

Unnamed: 0,id,document,gender
0,d7d392835f50664fc079f0f388e147a0,youch thing know sort stuff repairable,male
1,d7d392835f50664fc079f0f388e147a0,succumb fomo buy gnr ticket remember ask paren...,male
2,d7d392835f50664fc079f0f388e147a0,brown eye broom cool rescue clear break window...,male
3,d7d392835f50664fc079f0f388e147a0,shout auckland tennis fan sleep morning worth ...,male
4,d7d392835f50664fc079f0f388e147a0,someone ball come,male


In [158]:
# all text from test data is converted to lemmatized form 
test_df["document"] = test_df["document"].apply(lambda text: lemmatize_words(text))
test_df.head()

Unnamed: 0,id,document,gender
0,d6b08022cdf758ead05e1c266649c393,odds stop whine go get proper job rest u,male
1,d6b08022cdf758ead05e1c266649c393,imagine moderately pleased draw,male
2,d6b08022cdf758ead05e1c266649c393,worth read blog nick positive must admit trust...,male
3,d6b08022cdf758ead05e1c266649c393,hand take last race beat lengths jump recency ...,male
4,d6b08022cdf758ead05e1c266649c393,certainly show interest reading,male


These dataframes are saved as csv which can be used for feature extraction using tfidf

In [159]:
train_df.to_csv('training_data_ml.csv', index=False)
test_df.to_csv('testing_data_ml.csv', index=False)

# References

1) https://towardsdatascience.com/why-you-should-avoid-removing-stopwords-aa7a353d2a52

2) UNICODE_EMO dictionary, by Neel Shah https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py

3) Contractions: a pypi library from http://pypi.org/project/contractions/

4) Ekphrasis: A text-preprocessing tool from https://github.com/cbaziotis/ekphrasis
