<a href="https://colab.research.google.com/github/antnewman/nlp-infoextract-notebook/blob/main/nlp_infoextract.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Extraction Notebook

**Information Extraction**

There is a plethora of information obtained within text data. Usually, some will be relevant and some irrelevant. Sometimes one might want to extract the names of entities, other times the relationships between certain entities.

# Setup

Check Python version and install appropriate minicoda.

In [1]:
!which python # should return /usr/local/bin/python
!python --version

/usr/local/bin/python
Python 3.7.10


In [2]:
!echo $PYTHONPATH # If /env/python then unset the path, becaue this directory doesn't seeem to exist within the Google Colab file system 

/env/python


Unset pythonpath variable before installing Miniconda as it can cause problems if there are packages installed and accessible via directories included in the PYTHONPATH that are not compatible with the version of Python included with Miniconda.

In [3]:
%env PYTHONPATH=

env: PYTHONPATH=


**Installing Miniconda**

Download the installer script for the appropriate version of Miniconda and install it into /usr/local. 

Installing directly into /usr/local, rather than into the default location ~/miniconda3, insures that Conda and all its required dependencies will be automatically available for use within Google Colab.

In [4]:
%%bash
MINICONDA_INSTALLER_SCRIPT=Miniconda3-4.5.4-Linux-x86_64.sh
MINICONDA_PREFIX=/usr/local
wget https://repo.continuum.io/miniconda/$MINICONDA_INSTALLER_SCRIPT
chmod +x $MINICONDA_INSTALLER_SCRIPT
./$MINICONDA_INSTALLER_SCRIPT -b -f -p $MINICONDA_PREFIX

PREFIX=/usr/local
installing: python-3.6.5-hc3d631a_2 ...
installing: ca-certificates-2018.03.07-0 ...
installing: conda-env-2.6.0-h36134e3_1 ...
installing: libgcc-ng-7.2.0-hdf63c60_3 ...
installing: libstdcxx-ng-7.2.0-hdf63c60_3 ...
installing: libffi-3.2.1-hd88cf55_4 ...
installing: ncurses-6.1-hf484d3e_0 ...
installing: openssl-1.0.2o-h20670df_0 ...
installing: tk-8.6.7-hc745277_3 ...
installing: xz-5.2.4-h14c3975_4 ...
installing: yaml-0.1.7-had09818_2 ...
installing: zlib-1.2.11-ha838bed_2 ...
installing: libedit-3.1.20170329-h6b74fdf_2 ...
installing: readline-7.0-ha6073c6_4 ...
installing: sqlite-3.23.1-he433501_0 ...
installing: asn1crypto-0.24.0-py36_0 ...
installing: certifi-2018.4.16-py36_0 ...
installing: chardet-3.0.4-py36h0f667ec_1 ...
installing: idna-2.6-py36h82fb2a8_1 ...
installing: pycosat-0.6.3-py36h0a5515d_0 ...
installing: pycparser-2.18-py36hf9f622e_1 ...
installing: pysocks-1.6.8-py36_0 ...
installing: ruamel_yaml-0.15.37-py36h14c3975_2 ...
installing: six-1.11

--2021-02-24 17:21:33--  https://repo.continuum.io/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.18.200.79, 104.18.201.79, 2606:4700::6812:c84f, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.18.200.79|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://repo.anaconda.com/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh [following]
--2021-02-24 17:21:33--  https://repo.anaconda.com/miniconda/Miniconda3-4.5.4-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.130.3, 104.16.131.3, 2606:4700::6810:8203, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.130.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58468498 (56M) [application/x-sh]
Saving to: ‘Miniconda3-4.5.4-Linux-x86_64.sh’

     0K .......... .......... .......... .......... ..........  0% 18.7M 3s
    50K .......... .......... .......... .......... ..........  0%

Verify that:
- the conda executable is available
- the version is correct
- Installing has not impacted the python executable
- Verify which version of Python has been install by Miniconda

In [5]:
!which conda # should return /usr/local/bin/conda
!conda --version # should return 4.5.4
!which python # still returns /usr/local/bin/python
!python --version # now returns Python 3.6.5 :: Anaconda, Inc.

/usr/local/bin/conda
conda 4.5.4
/usr/local/bin/python
Python 3.6.5 :: Anaconda, Inc.


**Updating Conda**

We need to update Conda and its dependencies to their most recent versions without updating Python beyond 3.7.

In [6]:
%%bash

# Updates Conda to the most recent version, but hold Python version fixed at 3.7
conda install --channel defaults conda python=3.7 --yes

# Updates all of Conda’s dependencies to their most recent versions.
conda update --channel defaults --all --yes 

Solving environment: ...working... done

## Package Plan ##

  environment location: /usr/local

  added / updated specs: 
    - conda
    - python=3.7


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1j             |       h27cfd23_0         3.8 MB
    libstdcxx-ng-9.1.0         |       hdf63c60_0         4.0 MB
    cffi-1.14.5                |   py37h261ae71_0         224 KB
    setuptools-52.0.0          |   py37h06a4308_0         921 KB
    tqdm-4.56.0                |     pyhd3eb1b0_0          76 KB
    libffi-3.3                 |       he6710b0_2          54 KB
    idna-2.10                  |     pyhd3eb1b0_0          52 KB
    xz-5.2.5                   |       h7b6447c_0         438 KB
    six-1.15.0                 |     pyhd3eb1b0_0          13 KB
    conda-4.9.2                |   py37h06a4308_0         3.1 MB
    yaml-0.2.5                 |       h7b6447c_0   

openssl-1.1.1j       |  3.8 MB |            |   0% openssl-1.1.1j       |  3.8 MB | #######6   |  76% openssl-1.1.1j       |  3.8 MB | #########3 |  93% openssl-1.1.1j       |  3.8 MB | ########## | 100% 
libstdcxx-ng-9.1.0   |  4.0 MB |            |   0% libstdcxx-ng-9.1.0   |  4.0 MB | #######6   |  77% libstdcxx-ng-9.1.0   |  4.0 MB | #########9 |  99% libstdcxx-ng-9.1.0   |  4.0 MB | ########## | 100% 
cffi-1.14.5          |  224 KB |            |   0% cffi-1.14.5          |  224 KB | 5          |   5% cffi-1.14.5          |  224 KB | ########## | 100% 
setuptools-52.0.0    |  921 KB |            |   0% setuptools-52.0.0    |  921 KB | ########   |  80% setuptools-52.0.0    |  921 KB | #########8 |  99% setuptools-52.0.0    |  921 KB | ########## | 100% 
tqdm-4.56.0          |   76 KB |            |   0% tqdm-4.56.0          |   76 KB | ########## | 100% 
libffi-3.3           |   54 KB |            |   0% libffi-3.3           |   54 KB | ########## | 100% 
idna-

Check versions of conda and python.

In [7]:
!conda --version # now returns 4.9.2
!python --version # now returns Python 3.7.9 :: Anaconda, Inc.

conda 4.9.2
Python 3.7.9


**Append to** *sys.path* 

We need to add the directory, to which Conda will install packages to the list of directories that Python will search when looking for modules to import.

Check the current list of dirs that Python will search by inspecting the *sys.path*.

In [8]:
import sys
sys.path

['',
 '/content',
 '/env/python',
 '/usr/lib/python37.zip',
 '/usr/lib/python3.7',
 '/usr/lib/python3.7/lib-dynload',
 '/usr/local/lib/python3.7/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.7/dist-packages/IPython/extensions',
 '/root/.ipython']

Pre-installed packages are in dist-packages. Conda installed packages are in site-packages.

In [9]:
!ls /usr/local/lib/python3.7/dist-packages

absl
absl_py-0.10.0.dist-info
alabaster
alabaster-0.7.12.dist-info
albumentations
albumentations-0.1.12.dist-info
altair
altair-4.1.0.dist-info
apiclient
appdirs-1.4.4.dist-info
appdirs.py
apt
apt_inst.cpython-37m-x86_64-linux-gnu.so
apt_inst.pyi
apt_pkg.cpython-37m-x86_64-linux-gnu.so
apt_pkg.pyi
aptsources
argon2
argon2_cffi-20.1.0.dist-info
asgiref
asgiref-3.3.1.dist-info
astor
astor-0.8.1.dist-info
astropy
astropy-4.1.dist-info
astunparse
astunparse-1.6.3.dist-info
async_generator
async_generator-1.10.dist-info
atari_py
atari_py-0.2.6.dist-info
atomicwrites
atomicwrites-1.4.0.dist-info
attr
attrs-20.3.0.dist-info
audioread
audioread-2.1.9.dist-info
autograd
autograd-1.3.dist-info
babel
Babel-2.9.0.dist-info
backcall
backcall-0.2.0.dist-info
beautifulsoup4-4.6.3.dist-info
bin
bleach
bleach-3.3.0.dist-info
blis
blis-0.4.1.dist-info
bokeh
bokeh-2.1.1.dist-info
bottleneck
Bottleneck-1.3.2.dist-info
branca
branca-0.4.2.dist-info
bs4
bs4-0.0.1.dist-info
bson
cachecontrol
CacheControl-0.1

In [10]:
import sys
_ = (sys.path
from py2neo import Graph        .append("/usr/local/lib/python3.7/site-packages"))

Note that the dist-packages directory containing the pre-installed Colab packages appears ahead of the site-packages directory where Conda installs packages, henceforth the version of a package available via Colab will take precedence over any version of the same package installed via Conda.

**Installing packages**

Remember to include the --yes flag when installing packages to avoid getting prompted to confirm the package plan.

In [11]:
!conda install --channel conda-forge featuretools --yes
!conda install -c conda-forge py2neo --yes
!conda install -c conda-forge neuralcoref --yes

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::asn1crypto==0.24.0=py36_0
| / - \ | / - \ | / - \ | / - \ | / done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - featuretools


The following p

**Import**

In [13]:
from py2neo import Graph

import spacy
import urllib
import neuralcoref
import itertools
import json
from string import punctuation
import nltk

from string import punctuation
from flask import Flask, request
from string import punctuation


**Connect to Neo4J Sandbox**

In [14]:
# Change the line of code below to use your Bolt URL, and Password of your Sandbox.
# graph = Graph("<Bolt URL>", auth=("neo4j", "<Password>"))
graph = Graph("bolt://3.84.29.113:7687", auth=("neo4j", "distortions-capability-flower"))

**Set Entity Types**

In [20]:
ENTITY_TYPES = ["human", "person", "company", "enterprise", "business", "geographic region",
                "human settlement", "geographic entity", "territorial entity type", "organization"]