# Extraction of Subject -   Verb - Object tuples related to Categories and Named Entities of Selected Classes

### Step 1. Loading Spacy models
***

We install Spacy's language library for the first run. Then we can comment-out the download command. Note that we are loading Spacy's "medium" model.


In [1]:
import re
import pandas as pd
import numpy as np
import spacy
import sys

## Run to install the language library, then comment-out
!{sys.executable} -m spacy download en_core_web_md

nlp = spacy.load('en_core_web_md')
print('Finished loading.')


[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
Finished loading.


### Step 2. Pre-processing
***


In [2]:
def clean(x):
    if pd.isnull(x): return x  
    x = x.strip()

    ## parentheses with only +/- digits, dots, spaces, commas, percentage sign, minus sign: replace with space
    x = re.sub(r'\([\d\+\- \.\,%\-]+\)', ' ',x)
    
    ## delete ,000 commas in numbers    
    x = re.sub(r'\b(\d+),(\d+)\b','\\1\\2',x)
    
    ## delete  000 spaces in numbers
    x = re.sub(r'\b(\d+) (\d+)\b','\\1\\2',x)
    
    ## remove more than one spaces
    x = re.sub(r' +', ' ',x)
    
    ## remove start and end spaces
    x = re.sub(r'^ +| +$', '',x,flags=re.MULTILINE) 
    
    ## space-comma -> comma
    x = re.sub(r' \,',',',x)
    
    ## space-dot -> dot
    x = re.sub(r' \.','.',x)
    
    return x



* Read the file _articles_5_23_20_27.xlsx_ with the fresh scraped content from the SE articles, i.e. the titles, URLs, abstracts, context sections, paragraph titles, full contents and related categories into a dataframe _SE_df_. This file was created with the existing spider and is easy to reproduce. In later versions, it will be created from the tables in the database. 
* Discard records with missing or duplicate titles and/or abstracts and/or raw contents (but *not* context sections which are frequently the same and some are missing) and do some data cleansing using function _clean()_. 
* Discard records which have empty strings in any of these columns (titles, abstracts, raw contents) after this data cleansing.


In [3]:
SE_df = pd.read_excel('articles_5_23_20_27.xlsx')
SE_df = SE_df[['title','url','abstract','context','Titles','Raw content','categories']]
SE_df.rename(columns={'Titles':'par titles','Raw content':'raw content'},inplace=True)
SE_df = SE_df.replace('', np.nan) 

SE_df = SE_df.dropna(axis=0,subset=['title','abstract','raw content'],how='any')

SE_df = SE_df.drop_duplicates(subset=["title"])
SE_df = SE_df.drop_duplicates(subset=["abstract"])
SE_df = SE_df.drop_duplicates(subset=["raw content"])

SE_df['raw content'] = SE_df['raw content'].apply(clean)
SE_df['abstract'] = SE_df['abstract'].apply(clean)
SE_df['context'] = SE_df['context'].apply(clean)
SE_df['par titles'] = SE_df['par titles'].apply(clean)
SE_df['title'] = SE_df['title'].apply(lambda x: x.strip()) ## do not change anything else - reference field!

SE_df = SE_df.replace('', np.nan) ## check if empty strings produced and drop records if necessary
SE_df = SE_df.dropna(axis=0,subset=['title','abstract','raw content'],how='any')

SE_df.reset_index(drop=True, inplace=True)

SE_df

Unnamed: 0,title,url,abstract,context,par titles,raw content,categories
0,Adult learning statistics,https://ec.europa.eu/eurostat/statistics-expla...,This article provides an overview of adult lea...,Lifelong learning can take place in a variety ...,Participation rate of adults in learning in th...,Participation rate of adults in learning in th...,"['Education and training', 'Lifelong learning'..."
1,Age of young people leaving their parental hou...,https://ec.europa.eu/eurostat/statistics-expla...,Leaving the parental home is considered as a m...,"In addition to the Labour Force Survey (LFS), ...",Geographical differences. Gender differences. ...,Geographical differences. Map 1 indicates that...,"['Household composition and family situation',..."
2,Administrative and support service statistics ...,https://ec.europa.eu/eurostat/statistics-expla...,This article presents an overview of statistic...,The freedom to provide services and the freedo...,Structural profile. Sectoral analysis. Country...,Structural profile. In 2017 there were 1.4 mil...,"['Services', 'Statistical article', 'Structura..."
3,Adult learning statistics - characteristics of...,https://ec.europa.eu/eurostat/statistics-expla...,This article presents an overview of European ...,Adults with a low level of educational attainm...,Formal and non-formal adult education and trai...,Formal and non-formal adult education and trai...,"['Education and training', 'Participation in e..."
4,Accommodation and food service statistics - NA...,https://ec.europa.eu/eurostat/statistics-expla...,This article presents an overview of statistic...,Tourism plays an important role in Europe and ...,Structural profile. Sectoral analysis. Country...,Structural profile. The accommodation and food...,"['Services', 'Statistical article', 'Structura..."
...,...,...,...,...,...,...,...
587,Ageing Europe - statistics on social life and ...,https://ec.europa.eu/eurostat/statistics-expla...,Ageing Europe — looking at the lives of older ...,,Physical activity of older people. Older peopl...,Physical activity of older people. People at w...,"['Statistical article', 'Poverty and social ex..."
588,Ageing Europe - statistics on working and movi...,https://ec.europa.eu/eurostat/statistics-expla...,Ageing Europe — looking at the lives of older ...,,Employment patterns among older people. Focus ...,Employment patterns among older people. In 201...,"['Statistical article', 'Labour market', 'Acci..."
589,Ageing Europe - statistics on health and disab...,https://ec.europa.eu/eurostat/statistics-expla...,Ageing Europe — looking at the lives of older ...,,Life expectancy and healthy life years among o...,Life expectancy and healthy life years among o...,"['Statistical article', 'Health', 'Mortality a..."
590,Agri-environmental indicator - commitments,https://ec.europa.eu/eurostat/statistics-expla...,This article provides a fact sheet of the Euro...,Agri-environmental instruments are needed to s...,Key messages. Assessment.,Key messages. At the end of the Rural Developm...,"['Agriculture', 'Environment', 'Environment an..."


* Similarly, read file _concepts_5_23_21_55.xlsx_ with the fresh scraped content from the SE Glossary articles, i.e. the titles, URLs, definitions and related categories into a dataframe _GL_df_. This file was created with the existing spider and is easy to reproduce. In later versions, it will be created from the tables in the database.
* Discard records with missing titles and/or URLs and/or definitions and do some data cleansing of the definitions using function _clean()_. 
* Drop records with duplicate URLs. 
* Discard records with definitions which point to redirections ('Redirect to ...) or are the remnants of deleted articles ('The revision #...').
* Discard duplicates in titles and definitions (which point to the same articles).

In [4]:
GL_df = pd.read_excel('concepts_5_23_21_55.xlsx')

GL_df = GL_df[['title','url','definition','categories']]
GL_df = GL_df.replace('', np.nan) 
GL_df = GL_df.dropna(axis=0,subset=['title','url','definition'],how='any')

GL_df['title'] = GL_df['title'].apply(lambda x: x.strip())
GL_df['url'] = GL_df['url'].apply(lambda x: x.strip())
GL_df['definition'] = GL_df['definition'].apply(clean)

GL_df = GL_df.drop_duplicates(subset=["url"])

idx = GL_df[GL_df['definition'].str.startswith('The revision #')].index
GL_df.drop(idx , inplace=True)
idx = GL_df[GL_df['definition'].str.startswith('Redirect to')].index
GL_df.drop(idx , inplace=True)

GL_df = GL_df.drop_duplicates(subset=["title","definition"])

GL_df.reset_index(drop=True, inplace=True)
GL_df


Unnamed: 0,title,url,definition,categories
0,Accrual recording,https://ec.europa.eu/eurostat/statistics-expla...,Accrual recording is the recording of the valu...,"['Glossary', 'Short-term business statistics g..."
1,Accidents to persons caused by rolling stock i...,https://ec.europa.eu/eurostat/statistics-expla...,Accidents to one or more persons that are eith...,"['Glossary', 'Statistical indicator', 'Transpo..."
2,Active enterprises - FRIBS,https://ec.europa.eu/eurostat/statistics-expla...,"<Brief user-oriented definition, one or a few ...",['Under construction']
3,Activation policies,https://ec.europa.eu/eurostat/statistics-expla...,The activation policies are policies designed ...,"['Economy and finance glossary', 'Glossary', '..."
4,Active enterprise,https://ec.europa.eu/eurostat/statistics-expla...,An active enterprise is an enterprise that had...,"['Economy and finance glossary', 'Glossary', '..."
...,...,...,...,...
1273,Aggregate demand,https://ec.europa.eu/eurostat/statistics-expla...,Aggregate demand is the total amount of goods ...,"['Economy and finance glossary', 'Glossary', '..."
1274,Age of vehicle,https://ec.europa.eu/eurostat/statistics-expla...,Age of vehicle is the length of time after the...,"['Glossary', 'Statistical indicator', 'Transpo..."
1275,Adult education,https://ec.europa.eu/eurostat/statistics-expla...,Adult education is specifically targeted at in...,"['Education and training glossary', 'Glossary'..."
1276,Activity rate,https://ec.europa.eu/eurostat/statistics-expla...,Activity rate is the percentage of active pers...,"['Economy and finance glossary', 'Glossary', '..."


* Create a dataframe _Categories_SE_ with:
    * the unique categories met in the SE articles in column _category_,
    * their stemmed tokens, without stop-words, in column _category tokens_. 
    * Stemming is carried out with library _nltk_ because it is not available in Spacy. 
    * Drop the category _Statistical article_.
* Do the same with the categories found in the SE Glossary articles (omitting the "glossary" in the end), drop the categories _Under construction_ and _Glossary_, create a dataframe _Categories_GL_, and 
* Merge the two dataframes into a _Categories_df_ dataframe dropping duplicates.


In [5]:

## create the Categories dataframe
import nltk
from nltk.stem.porter import *
stemmer = PorterStemmer()
all_stopwords = nlp.Defaults.stop_words

import ast
Categories_SE = pd.DataFrame(np.unique([el for i in range(len(SE_df)) 
                                        for el in ast.literal_eval(SE_df.loc[i,'categories'])]),
                                        columns=['category'])
Categories_SE['category tokens'] = Categories_SE['category'].apply(lambda x: 
                                                             [stemmer.stem(w.text.lower()) for w in nlp(str(x)) 
                                                             if not w.is_punct and not w.text.lower() in all_stopwords])                                                       

Categories_SE.drop( Categories_SE[ Categories_SE['category'] == 'Statistical article' ].index, inplace=True)

Categories_SE.reset_index(drop=True, inplace=True)

Categories_SE.to_excel('Categories_SE.xlsx')
Categories_SE

Categories_GL = pd.DataFrame(np.unique([el for i in range(len(GL_df)) 
                                        for el in ast.literal_eval(GL_df.loc[i,'categories'])]),
                                        columns=['category'])
Categories_GL['category'] = Categories_GL['category'].apply(lambda x: re.sub('glossary$','',x)) 
Categories_GL['category tokens'] = Categories_GL['category'].apply(lambda x: 
                                                             [stemmer.stem(w.text.lower()) for w in nlp(str(x)) 
                                                             if not w.is_punct and not w.text.lower() in all_stopwords])                                                       

idx = Categories_GL[ (Categories_GL['category'] == 'Under construction') | (Categories_GL['category'] == 'Glossary') ].index
Categories_GL.drop(idx , inplace=True)

Categories_GL.reset_index(drop=True, inplace=True)


Categories_GL.to_excel('Categories_GL.xlsx')
Categories_GL

Categories_df = pd.concat([Categories_SE,Categories_GL])
Categories_df.drop_duplicates(subset=["category"],inplace=True)
Categories_df.reset_index(drop=True, inplace=True)
del(Categories_SE, Categories_GL)
Categories_df

Unnamed: 0,category,category tokens
0,Accidents at work,"[accid, work]"
1,Acquisition of citizenship,"[acquisit, citizenship]"
2,Africa,[africa]
3,Agricultural performance,"[agricultur, perform]"
4,Agriculture,[agricultur]
...,...,...
237,Statistical method,"[statist, method]"
238,Structural business statistics,"[structur, busi, statist]"
239,Survey,[survey]
240,Tourism,[tourism]


### Step 3. An improved version of a Subject-Verb-Object extraction function using Spacy
***

* By Peter de Vocht, see [GitHub code](https://github.com/peter3125/enhanced-subject-verb-object-extraction/blob/master/subject_verb_object_extract.py).
* Function needs some **DESCRIPTION**.


In [6]:
# Copyright 2017 Peter de Vocht
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#import en_core_web_sm
from collections.abc import Iterable

# use spacy small model
#nlp = en_core_web_sm.load()



##ClearNLP Dependency Labels
## https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md

##https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf

## https://www.mathcs.emory.edu/~choi/doc/cu-2012-choi.pdf

# dependency markers for subjects
SUBJECTS = {"nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"}
## nominal subject, nominal subject passive, clausal subject, clausal subject passive, agent (e.g. killed by the "agent"), 
## expletive - an existential “there”

# dependency markers for objects
OBJECTS = {"dobj", "dative", "attr", "oprd"}
## direct object, dative (indirect object), attr: “to be”, “to seem”, “to appear”, object predicate

# POS tags that will break adjoining items
BREAKER_POS = {"CCONJ", "VERB"}
## coordinating conjunction, verb

# words that are negations
NEGATIONS = {"no", "not", "n't", "never", "none"}


# does dependency set contain any coordinating conjunctions?
def contains_conj(depSet):
    return "and" in depSet or "or" in depSet or "nor" in depSet or \
           "but" in depSet or "yet" in depSet or "so" in depSet or "for" in depSet


# get subs joined by conjunctions
def _get_subs_from_conjunctions(subs):
    more_subs = []
    for sub in subs:
        # rights is a generator
        rights = list(sub.rights)
        rightDeps = {tok.lower_ for tok in rights} 
        if contains_conj(rightDeps):
            more_subs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
            if len(more_subs) > 0:
                more_subs.extend(_get_subs_from_conjunctions(more_subs))
    return more_subs


# get objects joined by conjunctions
def _get_objs_from_conjunctions(objs):
    more_objs = []
    for obj in objs:
        # rights is a generator
        rights = list(obj.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if contains_conj(rightDeps):
            more_objs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
            if len(more_objs) > 0:
                more_objs.extend(_get_objs_from_conjunctions(more_objs))
    return more_objs


# find sub dependencies
def _find_subs(tok):
    head = tok.head
    while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head:
        head = head.head
    if head.pos_ == "VERB":
        subs = [tok for tok in head.lefts if tok.dep_ == "SUB"] ## !!! CHANGE: not stop-words ?
        if len(subs) > 0:
            verb_negated = _is_negated(head)
            subs.extend(_get_subs_from_conjunctions(subs))
            return subs, verb_negated
        elif head.head != head:
            return _find_subs(head)
    elif head.pos_ == "NOUN":
        return [head], _is_negated(tok)
    return [], False


# is the tok set's left or right negated?
def _is_negated(tok):
    parts = list(tok.lefts) + list(tok.rights)
    for dep in parts:
        if dep.lower_ in NEGATIONS:
            return True
    return False


# get all the verbs on tokens with negation marker
def _find_svs(tokens):
    svs = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB"]
    for v in verbs:
        subs, verbNegated = _get_all_subs(v)
        if len(subs) > 0:
            for sub in subs:
                svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_))
    return svs


# get grammatical objects for a given set of dependencies (including passive sentences)
def _get_objs_from_prepositions(deps, is_pas):
    objs = []
    for dep in deps:
        if dep.pos_ == "ADP" and (dep.dep_ == "prep" or (is_pas and dep.dep_ == "agent")):
            objs.extend([tok for tok in dep.rights if tok.dep_  in OBJECTS or
                         (tok.pos_ == "PRON" and tok.lower_ == "me") or
                         (is_pas and tok.dep_ == 'pobj')])
    return objs


# get objects from the dependencies using the attribute dependency
def _get_objs_from_attrs(deps, is_pas):
    for dep in deps:
        if dep.pos_ == "NOUN" and dep.dep_ == "attr":
            verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"]
            if len(verbs) > 0:
                for v in verbs:
                    rights = list(v.rights)
                    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
                    objs.extend(_get_objs_from_prepositions(rights, is_pas))
                    if len(objs) > 0:
                        return v, objs
    return None, None


# xcomp; open complement - verb has no suject
def _get_obj_from_xcomp(deps, is_pas):
    for dep in deps:
        if dep.pos_ == "VERB" and dep.dep_ == "xcomp":
            v = dep
            rights = list(v.rights)
            objs = [tok for tok in rights if tok.dep_ in OBJECTS]
            objs.extend(_get_objs_from_prepositions(rights, is_pas))
            if len(objs) > 0:
                return v, objs
    return None, None


# get all functional subjects adjacent to the verb passed in
def _get_all_subs(v):
    verb_negated = _is_negated(v)
    ## !!! CHANGE: exclude stop-words ?
    subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"]
    if len(subs) > 0:
        subs.extend(_get_subs_from_conjunctions(subs))
    else:
        foundSubs, verb_negated = _find_subs(v)
        subs.extend(foundSubs)
    return subs, verb_negated


# find the main verb - or any aux verb if we can't find it
## !!! CHANGE: exclude stop-words ?
def _find_verbs(tokens):
    verbs = [tok for tok in tokens if _is_non_aux_verb(tok)] ### !!!
    if len(verbs) == 0:
        verbs = [tok for tok in tokens if _is_verb(tok)] ### !!!
    
    return verbs


# is the token a verb?  (excluding auxiliary verbs)
def _is_non_aux_verb(tok):
    return tok.pos_ == "VERB" and (tok.dep_ != "aux" and tok.dep_ != "auxpass")


# is the token a verb?  (excluding auxiliary verbs)
def _is_verb(tok):
    return tok.pos_ == "VERB" or tok.pos_ == "AUX"


# return the verb to the right of this verb in a CCONJ relationship if applicable
# returns a tuple, first part True|False and second part the modified verb if True
def _right_of_verb_is_conj_verb(v):
    # rights is a generator
    rights = list(v.rights)

    # VERB CCONJ VERB (e.g. he beat and hurt me)
    if len(rights) > 1 and rights[0].pos_ == 'CCONJ':
        for tok in rights[1:]:
            if _is_non_aux_verb(tok):
                return True, tok

    return False, v


# get all objects for an active/passive sentence
def _get_all_objs(v, is_pas):
    # rights is a generator
    rights = list(v.rights)

    objs = [tok for tok in rights if tok.dep_ in OBJECTS or (is_pas and tok.dep_ == 'pobj')]
    objs.extend(_get_objs_from_prepositions(rights, is_pas))

    #potentialNewVerb, potentialNewObjs = _get_objs_from_attrs(rights)
    #if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
    #    objs.extend(potentialNewObjs)
    #    v = potentialNewVerb

    potential_new_verb, potential_new_objs = _get_obj_from_xcomp(rights, is_pas)
    if potential_new_verb is not None and potential_new_objs is not None and len(potential_new_objs) > 0:
        objs.extend(potential_new_objs)
        v = potential_new_verb
    if len(objs) > 0:
        objs.extend(_get_objs_from_conjunctions(objs))
    return v, objs


# return true if the sentence is passive - at the moment a sentence is assumed passive if 
# it has an auxpass (auxiliary passive) verb
def _is_passive(tokens):
    for tok in tokens:
        if tok.dep_ == "auxpass":
            return True
    return False


# resolve a 'that' where/if appropriate
def _get_that_resolution(toks):
    for tok in toks:
        if 'that' in [t.orth_ for t in tok.lefts]:
            return tok.head
    return None


# simple stemmer using lemmas
def _get_lemma(word: str):
    tokens = nlp(word)
    if len(tokens) == 1:
        return tokens[0].lemma_
    return word


# print information for displaying all kinds of things of the parse tree
def printDeps(toks):
    for tok in toks:
        print(tok.orth_, tok.dep_, tok.pos_, tok.head.orth_, [t.orth_ for t in tok.lefts], [t.orth_ for t in tok.rights])


# expand an obj / subj np using its chunk
def expand(item, tokens, visited):
    if item.lower_ == 'that':
        temp_item = _get_that_resolution(tokens)
        if temp_item is not None:
            item = temp_item

    parts = []

    if hasattr(item, 'lefts'):
        for part in item.lefts:
            if part.pos_ in BREAKER_POS:
                break
            if not part.lower_ in NEGATIONS:
                parts.append(part)

    parts.append(item)

    if hasattr(item, 'rights'):
        for part in item.rights:
            if part.pos_ in BREAKER_POS:
                break
            if not part.lower_ in NEGATIONS:
                parts.append(part)

    if hasattr(parts[-1], 'rights'):
        for item2 in parts[-1].rights:
            if item2.pos_ == "DET" or item2.pos_ == "NOUN":
                if item2.i not in visited:
                    visited.add(item2.i)
                    parts.extend(expand(item2, tokens, visited))
            break

    return parts


# convert a list of tokens to a string
def to_str(tokens):
    if isinstance(tokens, Iterable):
        return ' '.join([item.text for item in tokens])
    else:
        return ''


# find verbs and their subjects / objects to create SVOs, detect passive/active sentences
def findSVOs(tokens):
    svos = []
    is_pas = _is_passive(tokens) ## is an "auxpass" verb contained in the tokens?
    verbs = _find_verbs(tokens) ## get the main verbs (or aux verbs if none) 
    visited = set()  # recursion detection
    for v in verbs:
        subs, verbNegated = _get_all_subs(v)
        # hopefully there are subs, if not, don't examine this verb any longer
        if len(subs) > 0:
            isConjVerb, conjV = _right_of_verb_is_conj_verb(v)
            if isConjVerb:
                v2, objs = _get_all_objs(conjV, is_pas)
                for sub in subs:
                    for obj in objs:
                        objNegated = _is_negated(obj)
                        if is_pas:  # reverse object / subject for passive
                            svos.append((to_str(expand(obj, tokens, visited)),
                                         "!" + v.lemma_ if verbNegated or objNegated else v.lemma_, to_str(expand(sub, tokens, visited))))
                            svos.append((to_str(expand(obj, tokens, visited)),
                                         "!" + v2.lemma_ if verbNegated or objNegated else v2.lemma_, to_str(expand(sub, tokens, visited))))
                        else:
                            svos.append((to_str(expand(sub, tokens, visited)),
                                         "!" + v.lower_ if verbNegated or objNegated else v.lower_, to_str(expand(obj, tokens, visited))))
                            svos.append((to_str(expand(sub, tokens, visited)),
                                         "!" + v2.lower_ if verbNegated or objNegated else v2.lower_, to_str(expand(obj, tokens, visited))))
            else:
                v, objs = _get_all_objs(v, is_pas)
                for sub in subs:
                    if len(objs) > 0:
                        for obj in objs:
                            objNegated = _is_negated(obj)
                            if is_pas:  # reverse object / subject for passive
                                svos.append((to_str(expand(obj, tokens, visited)),
                                             "!" + v.lemma_ if verbNegated or objNegated else v.lemma_, to_str(expand(sub, tokens, visited))))
                            else:
                                svos.append((to_str(expand(sub, tokens, visited)),
                                             "!" + v.lower_ if verbNegated or objNegated else v.lower_, to_str(expand(obj, tokens, visited))))
                    else:
                        # no obj - just return the SV parts
                        ## !!! CHANGE: return 'Object:None' as object
                        svos.append((to_str(expand(sub, tokens, visited)),
                                     "!" + v.lower_ if verbNegated else v.lower_,'Object:None'))
                        #print('just return SV: ',(to_str(expand(sub, tokens, visited)),
                        #             "!" + v.lower_ if verbNegated else v.lower_,))                              
    
    return svos

### Step 4. Apply the SVO function to the various texts and find tuples relevant to Named Entities and Categories 
***


In each dataframe (SE_df and GL_df), create column **NER** which will hold dictionaries with the entities recognized as: 
 * Companies, agencies, institutions, etc. (code ORG), 
 * Countries, cities, states (code GPE), 
 * Nationalities or religious or political groups (code NORP), 
 * Non-GPE locations, mountain ranges, bodies of water (code LOCATION). 
 * Buildings, airports, highways, bridges, etc. (code FACILITY),
 * Named hurricanes, battles, wars, sports events, etc. (code EVENT),
 * Named documents made into laws (code LAW),
 * Any named language (code LANGUAGE),
 * People, including fictional (code PERSON).

In column **NER** in a record, the key is the entity and the values are:
* a list with the tuples of the occurences of the entity (token span's *start* index position, token span's *stop* index position), 
* a list of the corresponding (coded) sources, and 
* the count of occurences in the content of the text processed.

In each dataframe, we also create column **NER_SVOs** which will hold dictionaries with SVOs involving the above entities. In each dictionary in column NER_SVOs in a record, the key is the entity and the values are:
* a list with the SVO tuples, 
* a list with the corresponding coded sources, 
* three lists with the titles, URLs and sentences  where the corresponding SVOs were found (for debugging purposes), 
* the count of occurences in the content of the text processed.

Column **NER_SVOs** will also store keys of the form **"Cat:category_name"** corresponding to the **Categories**, with values **the lists of SVOs whose tokenized and stemmed terms have an [overlap coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient) with some category's stemmed tokens** $\ge$ 0.4. This value was found after experimentation. 

Finally, we also create a separate dictionary **Glob_NER_SVOs** gathering the above SVOs information from all texts.



In [7]:
SE_df['NER'] = [dict() for i in range(len(SE_df))]
SE_df['NER_SVOs'] = [dict() for i in range(len(SE_df))]
GL_df['NER'] = [dict() for i in range(len(GL_df))]
GL_df['NER_SVOs'] = [dict() for i in range(len(GL_df))]
Glob_NER_SVOs = dict() ## a separate dictionary holding all SVOs from all articles
Cat_threshold = 0.4

In [8]:
def Overlap(lst1, lst2):
    return len(set(lst1).intersection(lst2))/min(len(set(lst1)),len(set(lst1)))

def process_texts(dat,source,column):

    nlp.max_length = 1500000
    
    for i in range(len(dat)):
        if (i+1) % 100 == 0: print('article i = ',i+1,' of ',len(dat))
        if all(dat.loc[i,[column]].isna()): continue    
        doc = nlp(dat.loc[i,column]) ## pre-process text
        url = dat.loc[i,'url']

        sents = doc.sents ## segment into sentences
        sents_list = [sent for sent in doc.sents]
        num_sents = len(sents_list)
        if num_sents ==0: 
            print(sents_list)
            raise Exception("Error A!") 

        for (j,sent) in enumerate(sents_list): ## Loop A over sentences #column 8
            #----------------------------------------------------------
            doc_sent = nlp(sent.text) ## pre-process sentence # column 12
            
            entities = doc_sent.ents ## general entities in sentence       
            selected_ents=[]
            if len(entities) > 0: ## otherwise proceed with SVOs vs. the categories only
                for ent in entities: ## just a check to verify the span of each entity IN THE SENTENCE
                    if ent.text != doc_sent.text[ent.start_char: ent.end_char]:
                        raise Exception("Error B!")             
            
                ## continue with selected named entities if any
                selected_ents = [ent for ent in entities if ent.label_ in ['ORG','GPE','NORP','LOCATION','FACILITY','EVENT','LAW','LANGUAGE','PERSON']] ## selected  entities
                ## cut +8.3, -17.4, 31353, etc.
                selected_ents = [ent for ent in selected_ents if not re.search(r'^[\d\+\-\.\,%\-]+$',ent.text) ] 


            svos = findSVOs(doc_sent) 
            for sv in svos: ## loop B1 over SVOs in sentence
            #--------------------------------------------------------------   
                if sv[-1] == 'Object:None': 
                    continue
                if '-' in sv or '%' in sv: 
                    continue
                if any([x.startswith('Figure') or x.startswith('Table') for x in sv]):
                    continue
                if any([re.search(r'(\d|\.|\+|\-)+',x) for x in sv]):
                    continue
                if sum([1 for x in sv if x.lower() in all_stopwords])>=1:
                    continue
                    
                ## open a parenthesis and then a number
                sv = tuple(re.sub(r'(\(|\))$','',x) for x in sv)    
                sv = tuple(re.sub(r'(\(|\))$','',x) for x in sv)  
                #print(sv)
                    
                for s in sv: ## loop C1 over each SVO # column 16
                #------------------------------------------------    
                    #print('searching in: ',s)
                    for e in selected_ents: ## loop D1 over each selected entity in an SVO # column 20
                    #----------------------    
                        #print('searching for ',e.text)
                        if s.find(e.text) != -1:
                            #print(sv,' : found ',e.text)
                            key = e.text.upper()
                            if key in dat.loc[i,'NER'].keys():
                                dat.loc[i,'NER'][key][0].append((e.start,e.end)) 
                                dat.loc[i,'NER'][key][1].append(source) 
                                dat.loc[i,'NER'][key][2] += 1 
                            else:    
                                dat.loc[i,'NER'][key] = [[(e.start,e.end)],[source],1]
                        
                            if key in dat.loc[i,'NER_SVOs'].keys():
                                if sv not in dat.loc[i,'NER_SVOs'][key][0]:
                                    dat.loc[i,'NER_SVOs'][key][0].append(sv) 
                                    dat.loc[i,'NER_SVOs'][key][1].append(source) 
                                    dat.loc[i,'NER_SVOs'][key][2] += 1 
                            else:    
                                dat.loc[i,'NER_SVOs'][key] = [[sv],[source],1] 
                        
                            ## global dictionary - avoid duplicates
                            key = e.text.upper()
                            if key in Glob_NER_SVOs.keys():
                                if sv not in Glob_NER_SVOs[key][0]:
                                    Glob_NER_SVOs[key][0].append(sv) 
                                    Glob_NER_SVOs[key][1].append(source)
                                    Glob_NER_SVOs[key][2].append(dat.loc[i,'title'])
                                    Glob_NER_SVOs[key][3].append(dat.loc[i,'url'])
                                    Glob_NER_SVOs[key][4].append(sent.text)
                                    Glob_NER_SVOs[key][5] += 1     
                            else:    
                                Glob_NER_SVOs[key] = [[sv],[source],[dat.loc[i,'title']],[dat.loc[i,'url']],[sent.text],1] 
                
            
                ## Continue loop C1 over each SVO # column 16, now with the Categories
                sj = ' '.join(sv)
                doc_sj = nlp(sj)
                sj = [w.text.lower() for w in doc_sj if not w.is_punct]
                sj = [w for w in sj if not w in all_stopwords]
                sj = [stemmer.stem(w) for w in sj]
                
                # sj = [stemmer.stem(w.text.lower()) for w in doc_sj if not w.is_punct and not w.text.lower() in all_stopwords]
                if len(sj) == 0: continue
                #print('\n',sv_copy)
                ##print('sj = ',sj)
                for m in range(len(Categories_df)): ## loop C2 over categories vs an SVO in a sentence
                #-----------------------------------------------------------------------------------    
                    ##print('cats:',categories_df.loc[m,'Category tokens'])
                    try:
                        overlap = Overlap(sj,Categories_df.loc[m,'category tokens'])
                    except:
                        print('sj = ',sj)
                        print('m=',m)
                        print('cats:',Categories_df.loc[m,'category tokens'])
                        raise
                    if overlap >= Cat_threshold:
                        ##print('sj = ',sj)
                        ##print(categories_df.loc[m,'Category'])
                        key = 'Cat:'+Categories_df.loc[m,'category'].upper()
                        if key in dat.loc[i,'NER_SVOs'].keys():
                            if sv not in dat.loc[i,'NER_SVOs'][key][0]:
                                dat.loc[i,'NER_SVOs'][key][0].append(sv) 
                                dat.loc[i,'NER_SVOs'][key][1].append(source) 
                                dat.loc[i,'NER_SVOs'][key][2] +=1
                        else:
                            dat.loc[i,'NER_SVOs'][key] = [[sv],[source],1]
                            
                        ## global dictionary 
                        if key in Glob_NER_SVOs.keys():
                            if sv not in Glob_NER_SVOs[key][0]:
                                Glob_NER_SVOs[key][0].append(sv) 
                                Glob_NER_SVOs[key][1].append(source) 
                                Glob_NER_SVOs[key][2].append(dat.loc[i,'title'])
                                Glob_NER_SVOs[key][3].append(dat.loc[i,'url'])
                                Glob_NER_SVOs[key][4].append(sent.text)                                
                                Glob_NER_SVOs[key][5] += 1                             
                        else:
                            Glob_NER_SVOs[key] = [[sv],[source],[dat.loc[i,'title']],[dat.loc[i,'url']],[sent.text],1]                             
                       
    return dat                              
 
                
                
                
                




#PERSON People, including fictional
#NORP Nationalities or religious or political groups
#FACILITY Buildings, airports, highways, bridges, etc.
#ORGANIZATION Companies, agencies, institutions, etc.
#GPE Countries, cities, states
#LOCATION Non-GPE locations, mountain ranges, bodies of water
#PRODUCT Vehicles, weapons, foods, etc. (Not services)
#EVENT Named hurricanes, battles, wars, sports events, etc.
#WORK OF ART Titles of books, songs, etc.
#LAW Named documents made into laws 
#LANGUAGE Any named language
#The following values are also annotated in a style similar to names:
#DATE Absolute or relative dates or periods
#TIME Times smaller than a day
#PERCENT Percentage (including “%”)
#MONEY Monetary values, including unit
#QUANTITY Measurements, as of weight or distance
#ORDINAL “first”, “second”
#CARDINAL Numerals that do not fall under another typ



### Step 5. Apply this  procedure to the various texts
***

* Update column NER in both dataframes.
* Update column NER_SVOs in both dataframes. 
* Update the separate global dictionary Glob_NER_SVOs.

#### SE articles titles.

In [9]:
SE_df = process_texts(SE_df,'SE title','title')

article i =  100  of  592
article i =  200  of  592
article i =  300  of  592
article i =  400  of  592
article i =  500  of  592


#### SE articles paragraph titles.

In [10]:

SE_df = process_texts(SE_df,'SE par. titles','par titles')

article i =  100  of  592
article i =  200  of  592
article i =  300  of  592
article i =  400  of  592
article i =  500  of  592


#### SE articles abstracts.

In [11]:

SE_df = process_texts(SE_df,'SE abstract','abstract')

article i =  100  of  592
article i =  200  of  592
article i =  300  of  592
article i =  400  of  592
article i =  500  of  592


#### SE articles context sections.

In [12]:
SE_df = process_texts(SE_df,'SE context','context')

article i =  100  of  592
article i =  200  of  592
article i =  300  of  592
article i =  400  of  592
article i =  500  of  592


#### SE articles full contents.

In [13]:

SE_df = process_texts(SE_df,'SE content','raw content')
              


article i =  100  of  592
article i =  200  of  592
article i =  300  of  592
article i =  400  of  592
article i =  500  of  592


#### SE Glossary articles titles.

In [14]:
GL_df = process_texts(GL_df,'GL title','title')

article i =  100  of  1278
article i =  200  of  1278
article i =  300  of  1278
article i =  400  of  1278
article i =  500  of  1278
article i =  600  of  1278
article i =  700  of  1278
article i =  800  of  1278
article i =  900  of  1278
article i =  1000  of  1278
article i =  1100  of  1278
article i =  1200  of  1278


#### SE Glossary articles definitions.

In [15]:
GL_df = process_texts(GL_df,'GL definition','definition')

article i =  100  of  1278
article i =  200  of  1278
article i =  300  of  1278
article i =  400  of  1278
article i =  500  of  1278
article i =  600  of  1278
article i =  700  of  1278
article i =  800  of  1278
article i =  900  of  1278
article i =  1000  of  1278
article i =  1100  of  1278
article i =  1200  of  1278


### Step 6. Exporting the dataframes to Excel
***

This is also useful for the manual inspection and the design of rules for the fine-tuning of the NER engine and the SVO extraction. This output can then directly be imported in the database.


In [16]:
import datetime
current_time = datetime.datetime.now() 
outfile1 = 'SE_SVOs_'+str(current_time.month)+ '_' + str(current_time.day) + '_' + str(current_time.hour)+ '_' + str(current_time.minute)  +'.xlsx'
outfile2 = 'GL_SVOs_'+str(current_time.month)+ '_' + str(current_time.day) + '_' + str(current_time.hour)+ '_' + str(current_time.minute)  +'.xlsx'
#SE_df.to_excel(outfile1)
#GL_df.to_excel(outfile2)

SE_df.to_excel('SE_df.xlsx')
GL_df.to_excel('GL_df.xlsx')


### Step 7. Checking the dictionary with all SVOs collected
***
* And write all SVOs to both Excel and text files. The files include all information useful for debugging.

In [17]:
import unidecode
#import pickle


import datetime

def file_name(pre,ext):
    current_time = datetime.datetime.now() 
    return pre + '_'+ str(current_time.month)+ '_' + str(current_time.day) + \
                 '_' + str(current_time.hour)+ '_' + str(current_time.minute)  +'.'+ext
    
outfile3 = file_name('SVOs_all','txt')
outfile3b = file_name('SVOs_all','pkl')
outfile3c = file_name('SVOs_all','xlsx')

#with open(outfile3b, 'wb') as file:
#        pickle.dump(Glob_NER_SVOs, file, pickle.HIGHEST_PROTOCOL)

Glob_NER_SVOs_2 = {k:v for k,v in sorted(Glob_NER_SVOs.items(), key=lambda item: item[0])}

results = pd.DataFrame(index=range(len(Glob_NER_SVOs.items())),columns=['Key','Source','Subject','Verb','Object','Title','URL','Sentence'])
c = -1
with open(outfile3, 'w') as file:
    for key in Glob_NER_SVOs_2.keys():
            print('<'+key+'>',end=' ')
            number = Glob_NER_SVOs_2[key][5]
            print(number, ' entries')
            #Glob_NER_SVOs[key][0].append(sv) 
            #Glob_NER_SVOs[key][1].append(source) 
            #Glob_NER_SVOs[key][2].append(dat.loc[i,'title'])
            #Glob_NER_SVOs[key][3].append(dat.loc[i,'url'])
            #Glob_NER_SVOs[key][4].append(sent.text)                                
            #Glob_NER_SVOs[key][5] += 1    
            phrases, sources, titles, urls, sentences = Glob_NER_SVOs_2[key][0:5]
            for (i,(phrase,source,title,url,sentence)) in enumerate(zip(phrases,sources,titles,urls,sentences)):
                s = unidecode.unidecode(str(phrase))
                s0 = unidecode.unidecode(phrase[0])
                s1 = unidecode.unidecode(phrase[1])
                s2 = unidecode.unidecode(phrase[2])
                st = unidecode.unidecode(title)
                surl=url
                ss = unidecode.unidecode(sentence)
                ## print('{0:70s} / {1:30s} {2:5d}: {3:s}\n'.format(key,source,i,s))
                ##print(ss)
                file.write('{0:70s} / {1:30s} {2:5d}: {3:30s} {4:30s} {5:30s} {6:s} {7:s}\n'.format(unidecode.unidecode(key),source,i,s0,s1,s2,st,ss))
                #file.write('{0:40s} / {1:16s} {2:4d}: {3:s}\n'.format(unidecode.unidecode(key),source,i,s))
                c +=1
                results.loc[c,'Key'] = format(unidecode.unidecode(key))
                results.loc[c,'Source'] = source
                results.loc[c,'Subject'] = s0
                results.loc[c,'Verb'] = s1
                results.loc[c,'Object'] = s2
                results.loc[c,'Title'] = st
                results.loc[c,'URL'] = surl
                results.loc[c,'Sentence'] = ss

#results.to_excel(outfile3c)
results.to_excel('SVOs_all.xlsx')

<A COUNCIL RECOMMENDATION> 1  entries
<AAA> 1  entries
<ACER> 1  entries
<AEA> 8  entries
<AEI> 3  entries
<AES> 2  entries
<AETIOLOGY> 1  entries
<AF> 1  entries
<AFGHAN> 4  entries
<AFGHANS> 1  entries
<AFRICAN> 7  entries
<AGEING WORKING GROUP> 1  entries
<AGENCY> 1  entries
<AGRICULTURAL> 6  entries
<AIC> 1  entries
<AIREN> 1  entries
<ALBANIA> 13  entries
<ALBANIAN> 5  entries
<ALBANIANS> 1  entries
<ALENTEJO> 2  entries
<ALGARVE> 1  entries
<ALGECIRAS> 1  entries
<ALGERIA> 15  entries
<ALGERIAN> 2  entries
<ALL EUROPEAN UNION> 1  entries
<ALPS> 1  entries
<AMIF> 1  entries
<AMSTERDAM> 7  entries
<AMSTERDAM SCHIPHOL> 1  entries
<ANIMAL> 1  entries
<ANTWERPEN> 7  entries
<ARA> 1  entries
<ARABIC> 1  entries
<ARGENTINA> 1  entries
<ARMENIA> 17  entries
<AROPE> 2  entries
<ASEAN> 20  entries
<ASEM> 25  entries
<ASIAN> 13  entries
<ASSOCIATION AGREEMENTS> 1  entries
<ATHENS> 1  entries
<ATTIKI> 1  entries
<AUSTRALIA> 3  entries
<AUSTRALIAN> 1  entries
<AUSTRIA> 62  entries
<AUSTRIAN> 

<Cat:NON-EU COUNTRIES> 40  entries
<Cat:PAGES USING DUPLICATE ARGUMENTS IN TEMPLATE CALLS> 1  entries
<Cat:PAGES WITH BROKEN FILE LINKS> 2  entries
<Cat:PARTICIPATION IN CULTURE> 6  entries
<Cat:PARTICIPATION IN EDUCATION AND TRAINING> 12  entries
<Cat:POPULATION> 1  entries
<Cat:POPULATION > 1  entries
<Cat:POPULATION AGEING> 27  entries
<Cat:POPULATION AND SOCIAL CONDITIONS> 4  entries
<Cat:POPULATION BY AREA AND REGION> 34  entries
<Cat:POPULATION SIZE AND PROJECTIONS> 13  entries
<Cat:POSTAL STATISTICS > 3  entries
<Cat:POVERTY AND SOCIAL EXCLUSION> 16  entries
<Cat:PRICE LEVELS BY CONSUMPTION GROUPS> 21  entries
<Cat:PRODUCTION STATISTICS> 10  entries
<Cat:QUALITY OF LIFE> 21  entries
<Cat:REGIONAL YEARBOOK> 4  entries
<Cat:REGIONS - AGRICULTURE> 5  entries
<Cat:REGIONS - ECONOMY AND FINANCE> 8  entries
<Cat:REGIONS - EDUCATION AND TRAINING> 13  entries
<Cat:REGIONS - HEALTH> 5  entries
<Cat:REGIONS - LABOUR MARKET> 42  entries
<Cat:REGIONS - POPULATION> 13  entries
<Cat:REGIONS -

<LUXEMBOURGIAN> 1  entries
<LUXEMBOURGISH> 1  entries
<MAASTRICHT> 5  entries
<MACINTOSH> 1  entries
<MADEIRA> 3  entries
<MADRID> 2  entries
<MALAYSIA> 3  entries
<MALTA> 122  entries
<MALTESE> 2  entries
<MANAGEMENT BOARD> 1  entries
<MARSEILLE> 2  entries
<MARTINIQUE> 1  entries
<MAYOTTE> 5  entries
<MBT> 1  entries
<MELILLA> 1  entries
<MEMBER STATES> 65  entries
<MERCOSUR> 1  entries
<MESSINA> 1  entries
<MEXICO> 3  entries
<MICROBLOGGING> 1  entries
<MILAN> 1  entries
<MIP> 1  entries
<MMTCDE> 1  entries
<MNC> 1  entries
<MNE> 1  entries
<MOLDOVA> 13  entries
<MONGOLIA> 1  entries
<MONTENEGRO> 29  entries
<MOROCCAN> 1  entries
<MOROCCANS> 1  entries
<MOROCCO> 20  entries
<MOX> 1  entries
<MPI> 2  entries
<MSITS> 1  entries
<MSY> 1  entries
<MUNICH> 2  entries
<MUNICH REINSURANCE COMPANY> 1  entries
<MYANMAR> 4  entries
<NACE> 26  entries
<NACE DIVISIONS> 1  entries
<NACE SECTIONS> 1  entries
<NATIONAL ACTION PLANS> 1  entries
<NATIONAL STATISTICAL INSTITUTES> 3  entries
<NATURA> 

<TURKISH> 2  entries
<UAA> 10  entries
<UCI> 1  entries
<UK> 7  entries
<UKRAINE> 18  entries
<UKRAINIAN> 3  entries
<UKRAINIANS> 2  entries
<ULTIMATE> 1  entries
<UN> 21  entries
<UNECE> 1  entries
<UNEMPLOYED> 1  entries
<UNESCO> 3  entries
<UNESCO S&T> 1  entries
<UNFCCC> 1  entries
<UNION> 1  entries
<UNITED STATES> 1  entries
<UNSC> 1  entries
<UNWTO> 4  entries
<UR> 2  entries
<URBAN> 2  entries
<US> 8  entries
<USA> 1  entries
<USP> 1  entries
<UTRECHT> 3  entries
<UWWTP> 1  entries
<VALENCIA> 1  entries
<VENICE> 1  entries
<VET> 4  entries
<VIETNAM> 4  entries
<VLAAMS GEWEST> 1  entries
<VOLKSWAGEN> 1  entries
<VÝCHODNÉ SLOVENSKO> 1  entries
<WA> 2  entries
<WARSZAWSKI> 2  entries
<WATERWAYS> 1  entries
<WHITE> 4  entries
<WHITE PAPER> 2  entries
<WLTP> 1  entries
<WOLFSBURG> 1  entries
<WORLD WAR II> 1  entries
<WTO> 7  entries
<XG ECO> 1  entries
<YEI> 1  entries
<YOUTH> 4  entries
<YUGOZAPADEN> 1  entries
<YUZHEN> 1  entries
<ZARAGOZA> 6  entries
<ZEEBRUGGE> 1  entries
<ZEEL

* Verify the information written to the file.

In [18]:
%%script false --no-raise-error

with open(outfile3c, 'r') as f:
    count = 0
 
    while True:
        line = f.readline()
        if not line:
            break
        print(line)
 
