Crawl Data Repository (CDR) Implementation

Introduction

This is a implementation for the Memex CDR schema. This implementation provides:

A class for representation of the CDR schema
Default metadata and content extractions using Apache Tika
A builder class for CDR objects with validation
JSON serialization of objects

Use cases

Use in production for generation and serialization of objects
Extraction of default metadata and textual content
Validation of objects generated by third part implementations

API usage

Creating a CDR document

CDRDocumentBuilder builder = new CDRDocumentBuilder();

builder.withUrl("http://www.darpa.mil/program/memex")
       .withRawContent("<html><head><title>Sample title</title></head><body>Original text</body></html>")
       .withContentType("text/html")
       .withCrawler("memex-crawler")
       .withTeam("DARPA")
       .withTimestamp(new Date().getTime());

// A object to acccess CDR document fields
CDRDocument doc = builder.build();

// A object already serialized in JSON format       
String json = builder.buildAsJson();

The output variable json will contain a JSON representation for valid CDR object, including metadata and textual content extracted.

Sample output in JSON format:

{
  "url": "http://www.darpa.mil/program/memex",
  "timestamp": 1457053240762,
  "team": "DARPA",
  "crawler": "memex-crawler",
  "raw_content": "<html><head><title>Sample title</title></head><body>Original text</body></html>",
  "content_type": "text/html",
  "crawl_data": null,
  "extracted_metadata": {
    "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
    "dc:title": "Sample title",
    "Content-Encoding": "ISO-8859-1",
    "title": "Sample title",
    "Content-Type": "text/html; charset=ISO-8859-1"
  },
  "extracted_text": "Original text"
}```

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

build.gradle

build.gradle

Repository files navigation

Crawl Data Repository (CDR) Implementation

Introduction

Use cases

API usage

Creating a CDR document

About

Releases

Packages

Languages

VIDA-NYU/memex-cdr

Folders and files

Latest commit

History

Repository files navigation

Crawl Data Repository (CDR) Implementation

Introduction

Use cases

API usage

Creating a CDR document

About

Resources

Stars

Watchers

Forks

Languages