NLPeasy

Build NLP pipelines the easy way

Disclaimer: This is in Alpha stage, lot of things can go wrong. It could possibly change your Elasticsearch Data the API is not fixed yet and even the name NLPeasy might change.

Free software: Apache Software License 2.0

Usage

For this example to completely work you need to have Python at least in Version 3.6 installed. Also you need to have install and start either

Docker https://www.docker.com/get-started, direct download links for Mac (DMG) and Windows (exe).
Elasticsearch and Kibana: https://www.elastic.co/downloads/ or https://www.elastic.co/downloads/elasticsearch-oss (pure Apache licensed version)

Then on the terminal issue:

python -m venv venv
source venv/bin/activate
pip install nlpeasy scikit-learn
python -m spacy download en_core_web_md

The package scikit-learn is just used in this example to get the newsgroups data and preprocess it. The last command downloads a spacy model for the english language - for the following you need to have at least it's md (=medium) version which has wordvectors.

import pandas as pd
import nlpeasy as ne
from sklearn.datasets import fetch_20newsgroups

# connect to running elastic or else start an Open Source stack on your docker
elk = ne.connect_elastic(docker_prefix='nlp', elk_version='7.10.2', mount_volume_prefix=None)
# If it is started on docker it will on the first time pull the images (1.3GB)!
# Setting mountVolumePrefix="./elastic-data/" would keep the data of elastic in your
# filesystems and then the data survives container restarts

# read data as Pandas data frame
news_raw = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
news_groups = [news_raw['target_names'][i] for i in news_raw['target']]
news = pd.DataFrame({'newsgroup': news_groups, 'message': news_raw['data']})

# setup NLPeasy pipeline with name for the elastic index and set the text column
pipeline = ne.Pipeline(index='news', text_cols=['message'], tag_cols=['newsgroup'], elk=elk)

pipeline += ne.VaderSentiment('message', 'sentiment')
pipeline += ne.SpacyEnrichment(nlp='en_core_web_md', cols=['message'], vec=True)

# do the pipeline - just for first 100, the whole thing would take 10 minutes
news_enriched = pipeline.process(news.head(10000000), write_elastic=True)

# Create Kibana Dashboard of all the columns
pipeline.create_kibana_dashboard()

# open Kibana in webbrowser
elk.show_kibana()

Let's have some fun outside of Elastic/Kibana - but this needs pip install matplotlib

import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
grouped = news_enriched.loc[~news_enriched.message_vec.isna()].groupby('newsgroup')
group_vec = grouped.apply(lambda z: np.stack(z.message_vec.values).mean(axis=0))
clust = linkage(np.stack(group_vec), 'ward')
# calculate full dendrogram
plt.figure(figsize=(10, 10))
plt.title('Hierarchical Clustering Dendrogram Newsgroups')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    clust,
    leaf_rotation=0.,  # rotates the x axis labels
    leaf_font_size=8.,  # font size for the x axis labels
    labels=group_vec.index,
    orientation='left'
)
plt.show()

Installation

Prerequisites:

Python 3 (we use Python 3.7)
Elastic: Several possibilities
- Have Docker installed - needs to have the docker package installed (see below).
- Install and start Elasticsearch and Kibana: https://www.elastic.co/downloads/ or https://www.elastic.co/downloads/elasticsearch-oss (pure Apache licensed version)
- Use any running Elasticsearch and Kibana (on premise or cloud)...
Pretrained Models: See below for Spacy Language Models and WordVectors

It is recommended to use a virtual environment:

cd $PROJECT_DIR
python -m venv venv
source venv/bin/activate

The source statement has to be repeated whenever you open a new terminal.

Then install

pip install nlpeasy

Or the development version from GitHub:

pip install --upgrade git+https://github.com/d-one/nlpeasy

If you want to use spaCy language models download them (90-200 MB), e.g.

python -m spacy download en_core_web_md
# and/or
python -m spacy download de_core_news_md

If you want to use pretrained FastText-Wordvectors (each ~7GB):

curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.zip
curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.de.zip

If you want to use Jupyter, install it to the virtual environment:

pip install jupyterlab

Development

To install this module in Dev-mode, i.e. change files and reload module:

git clone https://github.com/d-one/nlpeasy
cd nlpeasy

It is recommended to use a virtual environment:

python -m venv venv
source venv/bin/activate

Install the version in edit mode:

pip install -e .

In Jupyter you can have reloaded code when you change the files as in:

%load_ext autoreload
%autoreload 2

Features

Pandas based pipeline
Support for any extensions - now includes some for Regex, spaCy, VaderSentiment
Write results to ElasticSearch
Automatic Kibana dashboard generation
Have Elastic started in Docker if it is not installed locally or remotely
Apache License 2.0

Credits

This package was created with Cookiecutter and the [audreyr/cookiecutter-pypackage]https://github.com/audreyr/cookiecutter-pypackage project template.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.binder		.binder
.github		.github
nlpeasy		nlpeasy
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.md		AUTHORS.md
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
demo.ipynb		demo.ipynb
demo_kibana.png		demo_kibana.png
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLPeasy

Usage

Installation

Development

Features

Credits

About

Releases

Packages

Languages

License

d-one/NLPeasy

Folders and files

Latest commit

History

Repository files navigation

NLPeasy

Usage

Installation

Development

Features

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages