Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.
PunKeel edited this page Mar 29, 2017 · 2 revisions

DocBleach - Introduction

DocBleach is an opinionated Content Disarm and Reconstruction solution, a tool that removes potential threats from your documents.

The word "opinionated" is important here: DocBleach is not perfect, and might never be, but decisions have been taken to prevent most of the danger.

We believe in a world where data is valuable, time is precious and softwares are able to take decisions on their own, without relying on the user to be perfect. Thus, DocBleach does its best to keep the data intact, may it be the layout or the content. DocBleach is fast, it does not depend on virtualization and time-consuming processes to take a decision: every threats are considered equal, and removed. Last but not least, we know the user is not perfect and don't want him to be. Preventing threats is our mission, creating value is his responsibility.

Today’s malwares are very volatile. They change so fast that the antivirus industry is lagging to deliver updates to block them. This means that end users have no ways to protect themselves from the daily viruses variants that reach their email inbox.

With this in mind, understanding why and how we designed DocBleach is easy.

The user has documents, tools, and wants to use them. If DocBleach prevents him to do so, he won't use DocBleach. And he's right, who are we to tell him what to do?

DocBleach is a potential threat removal tools for documents. Documents layout are untouched and can be use without changing user habit.

Using a conversion processing tool, like PanDoc or OpenOffice's Converter, is not an acceptable solution here. It would work from a security point of view: macros are removed when you convert a Word file into an image. But, doing so means we also lose some data: sometimes the layout is lost, sometimes the content might be scrambled. Instead, we parse the original document and hit specific threat classes, in a surgical manner. The content, layout, and what's around it is left intact. What we don't expect to be there is removed. The file keeps its original format, and everybody's happy.

How does it work?

Well, the sanitation process depends on the file format.

As of today, we do support:

We remove what we consider to be potential threats. Macros, JavaScript, embedded objects.

Please stay aware that this might not work against well designed 0-days: if an exploit is found in Office's Parser, and works without macros/objects, we won't remove anything. Using DocBleach might break it anyway, as it partially rewrites the file, but this is a (nice) side effect.

Why Java?

We chose Java for multiple reasons. First of all, Java is great as a language. There are many tools, it is commonly used and is battle tested.

Then, and this is an important point for us, Java has a great community, with powerful projects like Apache POI or Apache PDFBox, that we use. Thanks to them, it is possible to read and write Office and PDF files without corrupting them, something hard in other languages (without having to do it ourselves, indeed).

What this means for us is that we don’t have to reinvent a new parser and maintain it. DocBleach's code should be stable instead of requiring hard work.

Java is also nice because it runs "everywhere", may it be on your desktop or on some Linux server, for instance. Being able to target multiple platforms is great, we don't have to adapt DocBleach on each platform and the user doesn't need to spend time trying to configure Java, it should be easy enough.

If you don't like our code, for some reason, feel free to do it your way. Maybe will we be jealous if you're prettier or more efficient, but in the end everybody would benefit from your solution.

What alternatives are there?

As of today only one (abandonned, unfortunately) open source project seems to share its goal with us: ExeFilter. It depends on Python2 and Ruby for its PDF library. Its author has written a lot about Weaponized File Formats, and has been a great source of knowledge for us.

Other non free solutions exist, but are paid and/or not open source. To mention a few: Resec, OpsWat's Metadefender, Votiro.

We don't believe in security thru obscurity, and we do think that using an Open Source Software lets you control what happens to your data.