<a href="https://colab.research.google.com/github/antnewman/nlp-infoextract-notebook/blob/main/nlp_infoextract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Extraction Notebook

**Information Extraction**

There is a plethora of information obtained within text data. Usually, some will be relevant and some irrelevant. Sometimes one might want to extract the names of entities, other times the relationships between certain entities.

# Setup

Check Python version and install appropriate minicoda.

In [1]:
!which python # should return /usr/local/bin/python
!python --version

/usr/local/bin/python
Python 3.6.5 :: Anaconda, Inc.


In [2]:
!echo $PYTHONPATH # If /env/python then unset the path, becaue this directory doesn't seeem to exist within the Google Colab file system 

/env/python


Unset pythonpath variable before installing Miniconda as it can cause problems if there are packages installed and accessible via directories included in the PYTHONPATH that are not compatible with the version of Python included with Miniconda.

In [3]:
%env PYTHONPATH=

env: PYTHONPATH=


**Installing Miniconda**

Download the installer script for the appropriate version of Miniconda and install it into /usr/local. 

Installing directly into /usr/local, rather than into the default location ~/miniconda3, insures that Conda and all its required dependencies will be automatically available for use within Google Colab.

In [4]:
%%bash
MINICONDA_INSTALLER_SCRIPT=Miniconda3-4.5.4-Linux-x86_64.sh
MINICONDA_PREFIX=/usr/local
wget https://repo.continuum.io/miniconda/$MINICONDA_INSTALLER_SCRIPT
chmod +x $MINICONDA_INSTALLER_SCRIPT
./$MINICONDA_INSTALLER_SCRIPT -b -f -p $MINICONDA_PREFIX

PREFIX=/usr/local
installing: python-3.6.5-hc3d631a_2 ...
installing: ca-certificates-2018.03.07-0 ...
installing: conda-env-2.6.0-h36134e3_1 ...
installing: libgcc-ng-7.2.0-hdf63c60_3 ...
installing: libstdcxx-ng-7.2.0-hdf63c60_3 ...
installing: libffi-3.2.1-hd88cf55_4 ...
installing: ncurses-6.1-hf484d3e_0 ...
installing: openssl-1.0.2o-h20670df_0 ...
installing: tk-8.6.7-hc745277_3 ...
installing: xz-5.2.4-h14c3975_4 ...
installing: yaml-0.1.7-had09818_2 ...
installing: zlib-1.2.11-ha838bed_2 ...
installing: libedit-3.1.20170329-h6b74fdf_2 ...
installing: readline-7.0-ha6073c6_4 ...
installing: sqlite-3.23.1-he433501_0 ...
installing: asn1crypto-0.24.0-py36_0 ...
installing: certifi-2018.4.16-py36_0 ...
installing: chardet-3.0.4-py36h0f667ec_1 ...
installing: idna-2.6-py36h82fb2a8_1 ...
installing: pycosat-0.6.3-py36h0a5515d_0 ...
installing: pycparser-2.18-py36hf9f622e_1 ...
installing: pysocks-1.6.8-py36_0 ...
installing: ruamel_yaml-0.15.37-py36h14c3975_2 ...
installing: six-1.11

--2021-02-27 13:01:14--  https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.18.201.79, 104.18.200.79, 2606:4700::6812:c94f, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.18.201.79|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://repo.anaconda.com/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh [following]
--2021-02-27 13:01:14--  https://repo.anaconda.com/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58468498 (56M) [application/x-sh]
Saving to: ‘Miniconda3-4.5.4-Linux-x86_64.sh.2’

     0K .......... .......... .......... .......... ..........  0% 3.04M 18s
    50K .......... .......... .......... .......... .......... 

Verify that:
- the conda executable is available
- the version is correct
- Installing has not impacted the python executable
- Verify which version of Python has been install by Miniconda

In [5]:
!which conda # should return /usr/local/bin/conda
!conda --version # should return 4.5.4
!which python # still returns /usr/local/bin/python
!python --version # now returns Python 3.6.5 :: Anaconda, Inc.

/usr/local/bin/conda
conda 4.5.4
/usr/local/bin/python
Python 3.6.5 :: Anaconda, Inc.


**Updating Conda**

We need to update Conda and its dependencies to their most recent versions without updating Python beyond 3.6

In [6]:
%%bash

# Updates Conda to the most recent version, but hold Python version to <3.7
conda install --channel defaults conda python=3.6 --yes

# Updates all of Conda’s dependencies to their most recent versions.
conda update --channel defaults --all --yes 

Solving environment: ...working... 
  - defaults::packaging-20.9-pyhd3eb1b0_0.conda
  - defaults::packaging-20.9-pyhd3eb1b0_0done

## Package Plan ##

  environment location: /usr/local

  added / updated specs: 
    - conda
    - python=3.6


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    cryptography-1.8.1         |           py36_0         846 KB
    yaml-0.1.6                 |                0         246 KB
    libffi-3.2.1               |                1          38 KB
    readline-6.2               |                2         606 KB
    asn1crypto-1.4.0           |             py_0          77 KB
    pycosat-0.6.2              |           py36_0         197 KB
    cffi-1.10.0                |           py36_0         341 KB
    openssl-1.0.2l             |                0         3.2 MB
    tk-8.5.18                  |                0         1.9 MB
    pyopenssl-17.0.0       



  current version: 4.5.4
  latest version: 4.9.2

Please update conda by running

    $ conda update -n base conda


cryptography-1.8.1   |  846 KB |            |   0% cryptography-1.8.1   |  846 KB | 1          |   1% cryptography-1.8.1   |  846 KB | #####8     |  58% cryptography-1.8.1   |  846 KB | ########6  |  86% cryptography-1.8.1   |  846 KB | ########## | 100% 
yaml-0.1.6           |  246 KB |            |   0% yaml-0.1.6           |  246 KB | ########## | 100% 
libffi-3.2.1         |   38 KB |            |   0% libffi-3.2.1         |   38 KB | ########## | 100% 
readline-6.2         |  606 KB |            |   0% readline-6.2         |  606 KB | ########9  |  89% readline-6.2         |  606 KB | ########## | 100% 
asn1crypto-1.4.0     |   77 KB |            |   0% asn1crypto-1.4.0     |   77 KB | ########## | 100% 
pycosat-0.6.2        |  197 KB |            |   0% pycosat-0.6.2        |  197 KB | 6          |   6% pycosat-0.6.2        |  197 KB | ##########

Check versions of conda and python.

In [7]:
!conda --version # now returns 4.5.2
!python --version # now returns Python 3.6.x :: Anaconda, Inc.

conda 4.5.2
Python 3.6.2 :: Continuum Analytics, Inc.


**Append to** *sys.path* 

We need to add the directory, to which Conda will install packages to the list of directories that Python will search when looking for modules to import.

Check the current list of dirs that Python will search by inspecting the *sys.path*.

In [8]:
import sys
sys.path

['',
 '/content',
 '/env/python',
 '/usr/lib/python37.zip',
 '/usr/lib/python3.7',
 '/usr/lib/python3.7/lib-dynload',
 '/usr/local/lib/python3.7/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.7/dist-packages/IPython/extensions',
 '/root/.ipython']

Pre-installed packages are in dist-packages. Conda installed packages are in site-packages.

In [10]:
!ls /usr/local/lib/python3.7/dist-packages

absl
absl_py-0.10.0.dist-info
alabaster
alabaster-0.7.12.dist-info
albumentations
albumentations-0.1.12.dist-info
altair
altair-4.1.0.dist-info
apiclient
appdirs-1.4.4.dist-info
appdirs.py
apt
apt_inst.cpython-37m-x86_64-linux-gnu.so
apt_inst.pyi
apt_pkg.cpython-37m-x86_64-linux-gnu.so
apt_pkg.pyi
aptsources
argon2
argon2_cffi-20.1.0.dist-info
asgiref
asgiref-3.3.1.dist-info
astor
astor-0.8.1.dist-info
astropy
astropy-4.2.dist-info
astunparse
astunparse-1.6.3.dist-info
async_generator
async_generator-1.10.dist-info
atari_py
atari_py-0.2.6.dist-info
atomicwrites
atomicwrites-1.4.0.dist-info
attr
attrs-20.3.0.dist-info
audioread
audioread-2.1.9.dist-info
autograd
autograd-1.3.dist-info
babel
Babel-2.9.0.dist-info
backcall
backcall-0.2.0.dist-info
beautifulsoup4-4.6.3.dist-info
bin
bleach
bleach-3.3.0.dist-info
blis
blis-0.4.1.dist-info
bokeh
bokeh-2.1.1.dist-info
bottleneck
Bottleneck-1.3.2.dist-info
branca
branca-0.4.2.dist-info
bs4
bs4-0.0.1.dist-info
bson
cachecontrol
CacheControl-0.1

In [11]:
import sys
_ = (sys.path.append("/usr/local/lib/python3.7/site-packages"))

Note that the dist-packages directory containing the pre-installed Colab packages appears ahead of the site-packages directory where Conda installs packages, henceforth the version of a package available via Colab will take precedence over any version of the same package installed via Conda.

**Installing packages**

Remember to include the --yes flag when installing packages to avoid getting prompted to confirm the package plan.

In [22]:
!pip install urllib3
!pip install py2neo 
!pip install neuralcoref

Collecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.3-py2.py3-none-any.whl (137 kB)
[K     |████████████████████████████████| 137 kB 7.7 MB/s 
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.22
    Uninstalling urllib3-1.22:
      Successfully uninstalled urllib3-1.22
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
py2neo 4.1.3 requires urllib3[secure]<1.23,>=1.21.1, but you have urllib3 1.26.3 which is incompatible.[0m
Successfully installed urllib3-1.26.3


**Import**

In [23]:
from google.colab import drive

import spacy
import neuralcoref
import itertools
import json
import urllib
import pandas as pd

from neo4j import GraphDatabase
from string import punctuation
import nltk

from string import punctuation
from flask import Flask, request
from string import punctuation
from urllib.request import urlopen

ModuleNotFoundError: ignored

Mount the Google Drive

In [None]:
drive.mount('/content/drive')

# Connect to Neo4J Sandbox

In [None]:
# Change the line of code below to use your Bolt URL, and Password of your Sandbox.
# graph = Graph("<Bolt URL>", auth=("neo4j", "<Password>"))
graph = Graph("bolt://34.227.162.188:7687", auth=("neo4j", "library-triangles-comparisons"))

# Load the BBC Data Set 

**Context**

News article datasets, originating from BBC News, provided for use as benchmarks for machine learning research. The original data is processed to form a single csv file for ease of use, the news title and the related text file name is preserved along with the news content and its category. This dataset is made available for non-commercial and research purposes only.

All rights, including copyright, in the content of the original articles are owned by the BBC.

**Content**

Consists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005.
Class Labels: 5 (business, entertainment, politics, sport, tech)

In [None]:
url = ''


# IGNORE AFTER HERE FOR NOW 

In [None]:
import json
import urllib
import pandas as pd
from neo4j import GraphDatabase

driver = GraphDatabase.driver('bolt://localhost:7687', auth=('neo4j', 'letmein'))

def ie_pipeline(text, relation_threshold=0.9, entities_threshold=0.8):
    # Prepare the URL.
    data = urllib.parse.urlencode([
        ("text", text), ("relation_threshold", relation_threshold),
        ("entities_threshold", entities_threshold)])
    
    url = "http://localhost:5000?" + data
    req = urllib.request.Request(url, data=data.encode("utf8"), method="GET")
    with urllib.request.urlopen(req, timeout=150) as f:
        response = f.read()
        response = json.loads(response.decode("utf8"))
    # Output the annotations.
    return response

import_refactored_query = """
UNWIND $params as value
CREATE (a:Article{content:value.content})
FOREACH (rel in value.ie.relations | 
  MERGE (s:Entity{name:rel.source})
  MERGE (t:Entity{name:rel.target})
  MERGE (s)-[:RELATION]->(r:Relation{type:rel.type})-[:RELATION]->(t)
  MERGE (a)-[:MENTIONS_REL]->(r))
WITH value, a
UNWIND value.ie.entities as entity
MERGE (e:Entity{name:entity.title})
SET e.wikiId = entity.wikiId
MERGE (a)-[:MENTIONS_ENT]->(e)
WITH entity, e
CALL apoc.create.addLabels(e,[entity.label]) YIELD node
RETURN distinct 'done'
"""

with driver.session() as session:
    params = []
    for i,article in list(data.iterrows())[:500]:
        content = article['content']
        ie_data = ie_pipeline(content)
        params.append({'content':content, 'ie':ie_data})

        if (len(params) % 100 == 0):
            session.run(import_refactored_query, {'params':params})
            params = []

    session.run(update_query, {'params':params})

**Set Entity Types**

IGNORE FOR NOW

In [None]:
# ENTITY_TYPES = ["human", "person", "company", "enterprise", "business", "geographic region",
#                "human settlement", "geographic entity", "territorial entity type", "organization"]

# Coreference Resolution

Here we want to find all expressions that refer to the same entity in a text. This is not only useful for our infomation extraction, but other higher level NLP tasks that involve natural language understanding such as document summarization and question answering.

In [None]:
# Load SpaCy
nlp = spacy.load('en')

# Add neuralcoref to SpaCy's pipe
neuralcoref.add_to_pipe(nlp)

def coref_resolution(text):
    """Function that executes coreference resolution on a given text"""
    doc = nlp(text)
    # fetches tokens with whitespaces from spacy document
    tok_list = list(token.text_with_ws for token in doc)
    for cluster in doc._.coref_clusters:
        # get tokens from representative cluster name
        cluster_main_words = set(cluster.main.text.split(' '))
        for coref in cluster:
            if coref != cluster.main:  # if coreference element is not the representative element of that cluster
                if coref.text != cluster.main.text and bool(set(coref.text.split(' ')).intersection(cluster_main_words)) == False:
                    # if coreference element text and representative element text are not equal and none of the coreference element words are in representative element. This was done to handle nested coreference scenarios
                    tok_list[coref.start] = cluster.main.text + \
                        doc[coref.end-1].whitespace_
                    for i in range(coref.start+1, coref.end):
                        tok_list[i] = ""

    return "".join(tok_list)

Make the info extraction pipeline

In [None]:
def iep

In [None]:
sample = """Elizabeth was born in Mayfair, London, as the first child of the Duke and Duchess of York (later King George VI and Queen Elizabeth). Her father ascended the throne on the abdication of his brother King Edward VIII in 1936, from which time she was the heir presumptive. She was educated privately at home and began to undertake public duties during the Second World War, serving in the Auxiliary Territorial Service. In 1947 she married Philip, Duke of Edinburgh, a former prince of Greece and Denmark, with whom she has four children: Charles, Prince of Wales; Anne, Princess Royal; Prince Andrew, Duke of York; and Prince Edward, Earl of Wessex.

When her father died in February 1952, Elizabeth became head of the Commonwealth and queen regnant of seven independent Commonwealth countries: the United Kingdom, Canada, Australia, New Zealand, South Africa, Pakistan, and Ceylon. She has reigned as a constitutional monarch through major political changes, such as devolution in the United Kingdom, accession of the United Kingdom to the European Communities, Brexit, Canadian patriation, and the decolonisation of Africa. Between 1956 and 1992, the number of her realms varied as territories gained independence, and as realms, including South Africa, Pakistan, and Ceylon (renamed Sri Lanka), became republics. Her many historic visits and meetings include a state visit to the Republic of Ireland and visits to or from five popes. Significant events have included her coronation in 1953 and the celebrations of her Silver, Golden, and Diamond Jubilees in 1977, 2002, and 2012, respectively. In 2017, she became the first British monarch to reach a Sapphire Jubilee. She is the longest-lived and longest-reigning British monarch. She is the longest-serving female head of state in world history, and the world's oldest living monarch, longest-reigning current monarch, and oldest and longest-serving current head of state.

Elizabeth has occasionally faced republican sentiments and press criticism of the royal family, in particular after the breakdown of her children's marriages, her annus horribilis in 1992, and the death in 1997 of her former daughter-in-law Diana, Princess of Wales. However, support for the monarchy in the United Kingdom has been and remains consistently high, as does her personal popularity."""

In [None]:
print(sample)