# Text extraction using spaCy  

The results of the execution of this notebook is a json file called `info/text_extraction.json` to be processed by [text_extraction_processing](text_extraction_processing.ipynb).  

This notebook extracts depenedncies from the text, for example, consider the following string, extracted from a wikia site `Link.html`:

 > **Link** is the main protagonist of the [Legend of Zelda series](The_Legend_of_Zelda_series.html). He is the everlasting hero of the setting, having appeared throughout the ages in a neverending line of incarnations. The various heroes who use the name Link are courageous young boys or teenagers in [green clothing](Hero%27s_Clothes.html) who leave their homes to save the world from evil forces threatening it.
 
The result of working with this fragment of text would be something like this:

```
{
 "Link.html": {
  "wikia": {
   "name": "Link",
   "paragraphs": [
    {
     "text": "Link is the main protagonist of the Legend of Zelda series. He is the everlasting hero of the setting, having appeared throughout the ages in a neverending line of incarnations. The various heroes who use the name Link are courageous young boys or teenagers in green clothing who leave their homes to save the world from evil forces threatening it.",
     "links": [
      {
       "href": "The_Legend_of_Zelda_series.html",
       "text": "Legend of Zelda series"
      }, ...
     ],
     "bolds": [
      "Link"
     ],
     "relations": [
      {
       "subject": "Link",
       "relation": "is",
       "attribute": "protagonist"
      }, ...
     ],
     "details": [
      {
       "attribute": "protagonist",
       "relation": "of",
       "subject": "the Legend of Zelda"
      }, ...
     ]
    }
   ]
  }
 }
```


In [1]:
import spacy
import re
import os
import json

import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
from slugify import slugify

from urllib.parse import urlparse
from urllib.parse import unquote

from nltk.tokenize import sent_tokenize

from extracted_entities import ParsedParagraph, Relation, RelationDetails, ExtractedEncoder
from ie_conf import get_htmls_route

Load information from disk and merge it into a single dataframe. Also, define some important info about the paths of our documents.

In [2]:
sources = {
    'gamepedia': get_htmls_route("gamepedia"),
    'wikia': get_htmls_route("wikia")
}

wikia = pd.read_csv("info/entities.wikia.csv", 
                    names=["id", "name", "url"],
                    usecols=["name", "url"], 
                    header=0, index_col=["url"])
gamepedia = pd.read_csv("info/entities.gamepedia.csv", 
                        names=["id", "name", "url"],
                        usecols=["name", "url"], 
                        header=0, index_col=["url"])

grouped = pd.merge(wikia, gamepedia, 
                   left_index=True, right_index=True,
                   suffixes=["_wikia","_gamepedia"], 
                   how='outer')

Construct a Doc object. The most common way to get a Doc object is via the nlp object.

In [3]:
nlp = spacy.load('en_core_web_lg')

## Extracting *specific* dependencies

In [7]:
text = "Link is the main protagonist of the Legend of Zelda series. He is the everlasting hero of " + \
"the setting, having appeared throughout the ages in a neverending line of incarnations. " + \
"Veil Springs is a location from The Legend of Zelda"

sentences = sent_tokenize(text)

def debug_token(word, indent=0):
    print(("\t" * indent) + str(word), 
          word.dep_, 
          "None" if word.ent_type_ == None else word.ent_type_)
    
def get_dependencies(sent):
    relations =[]
    details = []
    doc = nlp(sent)
    for ent in doc.ents:
        ent.merge(tag=ent.root.tag_, lemma=ent.text, ent_type=ent.label_)

    for word in doc: # word is spacy.tokens.token.Token
        if word.dep_ in ('attr'): # dep_ is Syntactic dependency relation
            attr = word
            relation = word.head # The syntactic parent, or "governor", of 'attr'.
            for subject in relation.lefts: # The leftward immediate children of the 'parent'
                relations.append(Relation(subject, relation, attr, subject.idx))
        elif word.dep_ == 'pobj':
            subject = word
            relation = word.head
            attr = relation.head
            if attr.dep_ == 'attr':
                details.append(RelationDetails(attr, relation, subject, subject.idx))

    return relations, details

for s in sentences:
    rels, dets = get_dependencies(s)
    for rel in rels:
        print(rel)
    for det in dets:
        print(det)
    print()

Link > is > protagonist
protagonist > of > the Legend of Zelda

He > is > hero
hero > of > setting

Veil Springs > is > location
location > from > The Legend of Zelda



## Operate with our specific wikis case

In [8]:
spac_s = re.compile("\s+([\,\.\?\!]{1})")
spaces = re.compile("\s+")
japs = re.compile("\(.*[ぁ-んァ-ン]+.+\)\s")
sqbr = re.compile("\[[0-9a-z\s]+\]")

def clean_string(label):
    """Clean string removing all special characters"""
    st = label
    st = st.replace('"', '')
    st = re.sub(spac_s, '\g<1>', st)
    st = re.sub(spaces, ' ', st)
    st = re.sub(japs, '', st)
    st = re.sub(sqbr, '', st)
    return st.strip()

def extract_p_features(p):
    links_ = p.findAll('a')
    links = []
    if links_:
        links = [{'href':anchor.get('href', '#'), 'text':clean_string(anchor.text)} 
                 for anchor in links_ if not anchor.get('href', '#').startswith("../../")]
    bolds_ = p.find('b')
    bolds = []
    if bolds_:
        bolds = [clean_string(str(b)) for b in bolds_]
    txt = clean_string(p.text)
    return ParsedParagraph(txt, links, bolds)

def extract_paragraphs(file, num_pharagraphs=1):
    page:BeautifulSoup = None
    with open(file, "r", encoding="utf8") as r:
        page = BeautifulSoup(r, "lxml")
    content = page.find('div', {'id':'mw-content-text'})
    ps = content.findAll('p', recursive=False)
    paragraphs = []
    for i in range(min(len(ps),num_pharagraphs)):
        paragraphs.append(extract_p_features(ps[i]))
    return paragraphs


In [9]:
sample = grouped.sample(2)
sample.head()

Unnamed: 0_level_0,name_wikia,name_gamepedia
url,Unnamed: 1_level_1,Unnamed: 2_level_1
Sacred_Shield_III.html,,Badge
Romani_Mask.html,,Romani's Mask


In [10]:
def get_rels_from_df(dataframe, paragraph_count=1):
    extracted = {}

    for r in dataframe.iterrows():
        resource = r[0]
        extracted[resource] = { }
        for source in sources:
            if pd.notna(r[1]["name_" + source]):
                f = os.path.join(sources[source], resource)
                if not os.path.exists(f):
                    continue
                paragraphs = []
                extracted_paragraphs = extract_paragraphs(f,paragraph_count)
                for paragraph in extracted_paragraphs:
                    sentences = sent_tokenize(paragraph.text)
                    relations = []
                    details = []
                    for s in sentences:
                        rels, dets = get_dependencies(s)
                        if rels:
                            relations.extend(rels)
                        if dets:
                            details.extend(dets)
                            
                    if relations:
                        paragraph.relations = relations
                    if details:
                        paragraph.details = details
                    paragraphs.append(paragraph)
                extracted[resource][source] = {
                    'name': r[1]["name_" + source]
                }
                extracted[resource][source]["paragraphs"] = paragraphs
    return extracted

extracted = get_rels_from_df(sample)
print(json.dumps(extracted, indent=1,cls=ExtractedEncoder))

{
 "Sacred_Shield_III.html": {
  "gamepedia": {
   "name": "Badge",
   "paragraphs": [
    {
     "text": "Badges are objects in Hyrule Warriors and Hyrule Warriors Legends.",
     "links": [
      {
       "href": "Hyrule_Warriors.html",
       "text": "Hyrule Warriors"
      },
      {
       "href": "Hyrule_Warriors_Legends.html",
       "text": "Hyrule Warriors Legends"
      }
     ],
     "bolds": [
      "Badges"
     ],
     "relations": [
      {
       "subject": "Badges",
       "relation": "are",
       "attribute": "objects"
      }
     ],
     "details": [
      {
       "attribute": "objects",
       "relation": "in",
       "subject": "Hyrule Warriors"
      }
     ]
    }
   ]
  }
 },
 "Romani_Mask.html": {
  "gamepedia": {
   "name": "Romani's Mask",
   "paragraphs": [
    {
     "text": "The Romani's Mask is a mask in Majora's Mask.",
     "links": [
      {
       "href": "The_Legend_of_Zelda__Majora%27s_Mask.html",
       "text": "Majora's Mask"
      }
     ],
  

## Now... process all the files

In [11]:
extracted = get_rels_from_df(grouped)
with open("info/text_extraction.json", "w") as w:
    json.dump(extracted, w, indent=1,cls=ExtractedEncoder)