Skip to content
.NET API for document file format identification, text extraction, metadata extraction, and embedded object/attachment extraction
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
README.md

README.md

COMING SOON - Welcome to Open Discover® SDK for .NET

Open Discover SDK is a .NET application programming interface (API) that allows for:

  • Identifying file formats using internal binary signatures for reliable and fast file format identification (versus using unreliable file extensions). 1400+ file formats supported for identification.
  • Extracting text from supported file formats and optionally identifying languages present in the extracted text
  • Extracting metadata from supported file formats (over 1,325 known and documented metadata fields in total)
  • Extracting embedded items/attachments from supported document formats
  • Extracting archive container items (7ZIP, ZIP, RAR, TAR, etc)
  • Extracting mail store container email objects (PST, OST, OST2013, MBOX, etc)

Open Discover SDK API is purposed for users to develop higher level document processing applications for:

  • Full text search using Lucene.NET
  • Machine learning using extracted text and metadata
  • Text analytics and document concept clustering
  • Information governance
  • Website crawling/full-text website search
  • Enterprise search and content management
  • IT Departments - identify and de-duplicate documents on file servers
  • eDiscovery applications
  • And more...

This GitHub repository hosts the following C# examples that illustrate how to use the Open Discover SDK API

  • DocumentIdentifier Example: shows how to use SDK to identify the document file formats of all files under an input directory/sub-directories
  • ContentExtraction Example: illustrates the following SDK features:
    • How to extract text and metadata from office documents, PDFs, XPS, raster images, vector images, multimedia, and more
    • How to decrypt password protected office documents, PDFs, and archives
    • How to identify the languages present in extracted text
    • MD5/SHA-1 binary hashes and sophisticated content based hashes for emails and office documents. Hashes are useful for de-duplicating copies of same document or email whether saved as .msg, .eml, or .emlx.
    • How to extract items from archives such as 7ZIP, ZIP, RAR, split archives, self-extracting archives, etc.
    • Extract email objects from PST, OST, and MBOX mail stores
  • Indexing Example: illustrates a simple indexing strategy using SDK with Lucene.NET and also how to make indexes better by:
    • Indexing document format ID as a field. Users can limit searches for documents with very specific formats.
    • Indexing document format classification as fields (ex: WordProcessing, Spreadsheet, etc are file format classifications). Users can limit searches to all "WordProcessing" or all "Spreadsheet" document classifications, for example.
    • Indexing MD5/SHA-1 binary and content based hashes as fields. When searching index, duplicate documents can be indicated and returned as a group.
You can’t perform that action at this time.