<a href="https://colab.research.google.com/github/antnewman/nlp-infoextract-notebook/blob/main/nlp_infoextract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Extraction Notebook

**Information Extraction**

There is a plethora of information obtained within text data. Usually, some will be relevant and some irrelevant. Sometimes one might want to extract the names of entities, other times the relationships between certain entities.

# Setup

Check Python version and install appropriate minicoda.

In [None]:
!which python # should return /usr/local/bin/python
!python --version

In [None]:
!echo $PYTHONPATH # If /env/python then unset the path, becaue this directory doesn't seeem to exist within the Google Colab file system 

Unset pythonpath variable before installing Miniconda as it can cause problems if there are packages installed and accessible via directories included in the PYTHONPATH that are not compatible with the version of Python included with Miniconda.

In [None]:
%env PYTHONPATH=

**Installing Miniconda**

Download the installer script for the appropriate version of Miniconda and install it into /usr/local. 

Installing directly into /usr/local, rather than into the default location ~/miniconda3, insures that Conda and all its required dependencies will be automatically available for use within Google Colab.

In [None]:
%%bash
MINICONDA_INSTALLER_SCRIPT=Miniconda3-4.5.4-Linux-x86_64.sh
MINICONDA_PREFIX=/usr/local
wget https://repo.continuum.io/miniconda/$MINICONDA_INSTALLER_SCRIPT
chmod +x $MINICONDA_INSTALLER_SCRIPT
./$MINICONDA_INSTALLER_SCRIPT -b -f -p $MINICONDA_PREFIX

Verify that:
- the conda executable is available
- the version is correct
- Installing has not impacted the python executable
- Verify which version of Python has been install by Miniconda

In [None]:
!which conda # should return /usr/local/bin/conda
!conda --version # should return 4.5.4
!which python # still returns /usr/local/bin/python
!python --version # now returns Python 3.6.5 :: Anaconda, Inc.

**Updating Conda**

We need to update Conda and its dependencies to their most recent versions without updating Python beyond 3.7.

In [None]:
%%bash

# Updates Conda to the most recent version, but hold Python version fixed at 3.7
conda install --channel defaults conda python=3.7 --yes

# Updates all of Conda’s dependencies to their most recent versions.
conda update --channel defaults --all --yes 

Check versions of conda and python.

In [None]:
!conda --version # now returns 4.9.2
!python --version # now returns Python 3.7.9 :: Anaconda, Inc.

**Append to** *sys.path* 

We need to add the directory, to which Conda will install packages to the list of directories that Python will search when looking for modules to import.

Check the current list of dirs that Python will search by inspecting the *sys.path*.

In [None]:
import sys
sys.path

Pre-installed packages are in dist-packages. Conda installed packages are in site-packages.

In [None]:
!ls /usr/local/lib/python3.7/dist-packages

In [None]:
import sys
_ = (sys.path.append("/usr/local/lib/python3.7/site-packages"))

Note that the dist-packages directory containing the pre-installed Colab packages appears ahead of the site-packages directory where Conda installs packages, henceforth the version of a package available via Colab will take precedence over any version of the same package installed via Conda.

**Installing packages**

Remember to include the --yes flag when installing packages to avoid getting prompted to confirm the package plan.

In [None]:
!conda install --channel conda-forge featuretools --yes
!conda install -c conda-forge py2neo --yes
!conda install -c conda-forge neuralcoref --yes

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): - \ 

**Import**

In [None]:
from py2neo import Graph

import spacy
import neuralcoref
import itertools
import json
from string import punctuation
import nltk

from string import punctuation
from flask import Flask, request
from string import punctuation
from urllib.request import urlopen

ModuleNotFoundError: ignored

**Connect to Neo4J Sandbox**

In [None]:
# Change the line of code below to use your Bolt URL, and Password of your Sandbox.
# graph = Graph("<Bolt URL>", auth=("neo4j", "<Password>"))
graph = Graph("bolt://3.84.29.113:7687", auth=("neo4j", "distortions-capability-flower"))

In [None]:
def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())

**Set Entity Types**

IGNORE FOR NOW

In [None]:
# ENTITY_TYPES = ["human", "person", "company", "enterprise", "business", "geographic region",
#                "human settlement", "geographic entity", "territorial entity type", "organization"]

# Coreference Resolution

Here we want to find all expressions that refer to the same entity in a text. This is not only useful for our infomation extraction, but other higher level NLP tasks that involve natural language understanding such as document summarization and question answering.

In [None]:
# Load SpaCy
nlp = spacy.load('en')

# Add neuralcoref to SpaCy's pipe
neuralcoref.add_to_pipe(nlp)

def coref_resolution(text):
    """Function that executes coreference resolution on a given text"""
    doc = nlp(text)
    # fetches tokens with whitespaces from spacy document
    tok_list = list(token.text_with_ws for token in doc)
    for cluster in doc._.coref_clusters:
        # get tokens from representative cluster name
        cluster_main_words = set(cluster.main.text.split(' '))
        for coref in cluster:
            if coref != cluster.main:  # if coreference element is not the representative element of that cluster
                if coref.text != cluster.main.text and bool(set(coref.text.split(' ')).intersection(cluster_main_words)) == False:
                    # if coreference element text and representative element text are not equal and none of the coreference element words are in representative element. This was done to handle nested coreference scenarios
                    tok_list[coref.start] = cluster.main.text + \
                        doc[coref.end-1].whitespace_
                    for i in range(coref.start+1, coref.end):
                        tok_list[i] = ""

    return "".join(tok_list)

Make the info extraction pipeline

In [None]:
def iep

In [None]:
sample = """Elizabeth was born in Mayfair, London, as the first child of the Duke and Duchess of York (later King George VI and Queen Elizabeth). Her father ascended the throne on the abdication of his brother King Edward VIII in 1936, from which time she was the heir presumptive. She was educated privately at home and began to undertake public duties during the Second World War, serving in the Auxiliary Territorial Service. In 1947 she married Philip, Duke of Edinburgh, a former prince of Greece and Denmark, with whom she has four children: Charles, Prince of Wales; Anne, Princess Royal; Prince Andrew, Duke of York; and Prince Edward, Earl of Wessex.

When her father died in February 1952, Elizabeth became head of the Commonwealth and queen regnant of seven independent Commonwealth countries: the United Kingdom, Canada, Australia, New Zealand, South Africa, Pakistan, and Ceylon. She has reigned as a constitutional monarch through major political changes, such as devolution in the United Kingdom, accession of the United Kingdom to the European Communities, Brexit, Canadian patriation, and the decolonisation of Africa. Between 1956 and 1992, the number of her realms varied as territories gained independence, and as realms, including South Africa, Pakistan, and Ceylon (renamed Sri Lanka), became republics. Her many historic visits and meetings include a state visit to the Republic of Ireland and visits to or from five popes. Significant events have included her coronation in 1953 and the celebrations of her Silver, Golden, and Diamond Jubilees in 1977, 2002, and 2012, respectively. In 2017, she became the first British monarch to reach a Sapphire Jubilee. She is the longest-lived and longest-reigning British monarch. She is the longest-serving female head of state in world history, and the world's oldest living monarch, longest-reigning current monarch, and oldest and longest-serving current head of state.

Elizabeth has occasionally faced republican sentiments and press criticism of the royal family, in particular after the breakdown of her children's marriages, her annus horribilis in 1992, and the death in 1997 of her former daughter-in-law Diana, Princess of Wales. However, support for the monarchy in the United Kingdom has been and remains consistently high, as does her personal popularity."""

In [None]:
print(sample)

Elizabeth was born in Mayfair, London, as the first child of the Duke and Duchess of York (later King George VI and Queen Elizabeth). Her father ascended the throne on the abdication of his brother King Edward VIII in 1936, from which time she was the heir presumptive. She was educated privately at home and began to undertake public duties during the Second World War, serving in the Auxiliary Territorial Service. In 1947 she married Philip, Duke of Edinburgh, a former prince of Greece and Denmark, with whom she has four children: Charles, Prince of Wales; Anne, Princess Royal; Prince Andrew, Duke of York; and Prince Edward, Earl of Wessex.

When her father died in February 1952, Elizabeth became head of the Commonwealth and queen regnant of seven independent Commonwealth countries: the United Kingdom, Canada, Australia, New Zealand, South Africa, Pakistan, and Ceylon. She has reigned as a constitutional monarch through major political changes, such as devolution in the United Kingdom, 