Skip to content

Architecture

rduerr edited this page Nov 20, 2018 · 7 revisions

The insights portion of the Polar Deep Insights project contains 2 major components:

  • Insight Generator
  • Insight Visualizer


Insight Generator

The insight generator is a python library which provides an interface to extract entities, locations, file metadata and measurements from documents.

Our python library interfaces with the following context extraction libraries to extract required types of meta information.

Tool Type
Apache tika content and file-metadata
Stanford's core NLP / NER dates and locations
Python regex entities
Grobid Quantities measurements

Setup extraction libraries as described here. Ensure that they are running on the mentioned ports.

Python API

cd ./insight-generator

The given a file path as argument the main.py script recurses down the directory tree and extracts the above mentioned meta information from each file and saves the extracted contents onto a local file.

This extraction library works with files on S3/HDFS. It requires the files to be mounted onto the local file system.

# Syntax
python main.py [ ROOT-PATH ] [ OUTPUT-PATH ]

# Example
python main.py "/tmp/dump" entities.txt

Users can build custom implementations to handle the extracted meta information.

from extractors.base import InformationExtractor
from util.dir_tree import DirTreeTraverser

def customProcessor(metaInfo):
  # Do something with the extracted meta information
  pass

def process(PATH):
  mI = InformationExtractor(PATH).extract()
  customProcessor(mI)

DirTreeTraverser(BASE_PATH).iterateAndPerform(process)

Users can use the extract.py script as a stand alone meta information extractor.

# Syntax
python extract.py [ FILE-PATH ]

# Example
python extract.py /tmp/dump/test.html | json_pp

Insight Visualizer

This is a angular-js SPA which facilitates building an 'ontology-of-interest' using the concept editor interface.

Users can then gain insights from the extracted information from the insight generator module through the [query interface] (Guide).

The query interface is extensible. Each visualization tab is an angular component. If users which to add their custom visualization elements, they can define custom angular components as follows.

<!-- Custom visualization component -->
<div polar-analytics-my-custom-visualization data-filters="filters"></div>
// Custom component controller
$scope.data = Document.query($FilterParser($scope.filters), {
  // Custom elastic search aggregate query
});
// Build your visualization with the aggregated data from Elastic Search set on the controller scope.

Open an issue with regard to setup or contributions.