DOM Distiller
DOM Distiller aims to provide a better reading experience by distilling the content of the page. This distilled content can then be used in a variety of ways.
The current efforts that are or will be powered by DOM Distiller:
- Reader mode: a mobile-friendly viewing mode for Chrome mobile
- [Simplify page for print] (https://plus.google.com/+FrancoisBeaufort/posts/dDPD2gVThuv)
Report a bug
We use the same bug tracking system Chromium uses (http://crbug.com), and the
DOM distiller related bugs are filed under [component:UI>Browser>ReaderMode]
(https://bugs.chromium.org/p/chromium/issues/list?q=component%3AUI%3EBrowser%3EReaderMode)
component.
If the extracted contents have missing or extra text or images, it's considered a bug. If a long non-mobile-friendly article doesn't trigger the infobar on Chrome on Android, you can also file a bug.
How to use Reader mode on Chrome on Android
- Open Chrome on your Android phone
- Navigate to chrome://flags and search for "Reader mode" (Menu -> Find in page -> Reader Mode triggering), or directly go to [chrome://flags#reader-mode-heuristics] (chrome://flags#reader-mode-heuristics)
- Choose "Appears to be an article" to turn on Reader mode for non-mobile-friendly long articles, or choose "Always" for debugging.
- Click "Relaunch Now" at the bottom of the page
- Next time you're trying to read a page, tap on the "Make page mobile-friendly" infobar to try it out!
Continuous integration
- [![Build Status] (https://travis-ci.org/chromium/dom-distiller.svg?branch=master)] (https://travis-ci.org/chromium/dom-distiller)
- Travis-CI waterfall
Get the code
In a folder where you want the code (outside of the chromium checkout):
git clone https://github.com/chromium/dom-distiller.gitA dom-distiller folder will be created in the folder you run that command.
Environment setup
Before you build for the first time, you need to install the build dependencies.
For all platforms, it is require to download and install [Google Chrome browser] (https://www.google.com/chrome/browser/desktop/).
ChromeDriver requires Google Chrome to be installed at a specific location (the default location for the platform). See [ChromeDriver documentation] (https://code.google.com/p/selenium/wiki/ChromeDriver) for details.
Also install the git hooks:
./create-hook-symlinksDeveloping on Ubuntu/Debian
Install the dependencies by entering the dom-distiller folder and running:
sudo ./install-build-deps.shUbuntu 14.04 64-bit is recommended.
Developing on Mac OS X
-
Install JDK 7 using either your organizations software management tool, or download it from [Oracle] (http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html).
-
Install Homebrew.
-
Install
antandpythonusing Homebrew:brew install ant python
-
Since both the protocol buffer compiler and Python bindings are needed, install the
protobufpackage with the--with-pythoncommand line parameter:brew install protobuf --with-python
-
Create a folder named
buildtoolsinside your DOM Distiller checkout -
Download ChromeDriver (chromedriver_mac32.zip) from the [Download page] (https://sites.google.com/a/chromium.org/chromedriver/downloads)
-
Unzip the
chromedriver_mac32.zipand ensure the binary ends up in yourbuildtoolsfolder. -
Install the PyPI package management tool
pipby running:sudo easy_install pip
-
Install
seleniumusingpip:pip install --user selenium
For the rest of this guide, there are sometimes references to a tool called
xvfb and specifically when running shell commands using xvfb-run. When you
develop using a Mac OS X, you can remove that part of the command. For example
xvfb-run echo would just become echo.
Developing with Vagrant
This option could be useful if you want to develop on an unsupported system like Windows or Red Hat Linux. Even if you are on a supported system but would rather not touch the system too much, Vagrant is a viable alternative.
The Vagrant VM is based on Ubuntu 14.04.
-
Install Vagrant on your system. Version 1.7.2 or higher is recommended.
-
Launch the Vagrant VM instance
vagrant up
-
SSH to the VM
vagrant ssh
Tools for contributing
The DOM Distiller project uses the Chromium tools for collaboration. For code
reviews, the [Chromium Rietveld code review tool]
(https://codereview.chromium.org/) is used and the set of tools found in
depot_tools is also required.
To get depot_tools, follow the guide at [Chrome infrastructure documentation
for depot_tools]
(http://commondatastorage.googleapis.com/chrome-infra-docs/flat/depot_tools/docs/html/depot_tools_tutorial.html#_setting_up).
The TL;DR of that is to run this from a folder where you install developer
tools, for example in your $HOME folder:
git clone https://chromium.googlesource.com/chromium/tools/depot_tools
export PATH="/path/to/depot_tools:$PATH"You must also setup your local checkout needs to point to the Chromium Rietveld
server. This is a one-time setup for your checkout, so from your dom-distiller
checkout folder, run:
git cl configRietveld server:https://codereview.chromium.org- You can leave the rest of the fields blank.
Building
Using ant
ant is the tool we use to build, and the available targets can be listed using
ant -p, but the typical targets you might use when you work on this project
is:
ant testRuns all tests.ant test -Dtest.filter=$FILTER_PATTERNwhere$FILTER_PATTERNis a [gtest_filter pattern] (https://code.google.com/p/googletest/wiki/AdvancedGuide#Running_a_Subset_of_the_Tests). For example*.FilterTest.*:*Foo*-*Bar*would run all tests containing.FilterTest.andFoo, but not those withBar.ant gwtccompiles .class + .java files to JavaScript. Standalone JavaScript is available atwar/domdistiller/domdistiller.nocache.js.ant gwtc.jstestscreates a standalone JavaScript for the tests.ant extractjscreates standalone JavaScript from output of ant gwtc. The compiled JavaScript file is available atout/domdistiller.js.ant extractjs.jstestscreates a standalone JavaScript for the tests.ant packageCopies the main build artifacts into theout/packagefolder, typically the extracted JS and protocol buffer files.
Contributing
You can use regular git command when developing in this project and use git cl for collaboration.
Uploading a CL for review
On your branch, run: git cl upload. The first time you do this, you will have
to provide a username and password.
- For username, use your @chromium.org. account.
- For password, get it from [GoogleCode.com settings page]
(https://code.google.com/hosting/settings) when logged into your
@chromium.org account, and add the full
machine code.google.com loginline to your~/.netrcfile.
Committing a CL
-
Change upstream to remote master, push cl, then revert upstream to local:
git branch -u origin/master git cl land git branch -u master
-
For username, use your GitHub account name (the username, not the full e-mail).
-
For password, use your GitHub password.
- If you have two-factor authentication enabled, create a personal access token at your [application settings page] (https://github.com/settings/applications) and use that as your password.
Code formatting
Before uploading a CL it is recommended to run git cl format. However, this
requires adding symbolic links to your chromium checkout.
Inside the buildtools folder of your checkout, add the following symbolic
links:
clang_format→/path/to/chromium/src/buildtools/clang_format/linux64→/path/to/chromium/src/buildtools/linux64/(only for Linux 64-bit platform)mac→/path/to/chromium/mac/buildtools/linux64/(only for Mac platform)
Doing this enables you to run the command git cl format to fix the formatting
of your code.
Run in Chrome for desktop
In this section, the following shell variables and are assumed correctly set:
export CHROME_SRC=/path/to/chromium/src
export DOM_DISTILLER_DIR=/path/to/dom-distiller-
Pull generated package (from ant package) into Chrome. You can use this handy bash-function to help with that:
roll-distiller () { ( (cd $DOM_DISTILLER_DIR && ant package) && \ rm -rf $CHROME_SRC/third_party/dom_distiller_js/dist/* && \ cp -rf $DOM_DISTILLER_DIR/out/package/* $CHROME_SRC/third_party/dom_distiller_js/dist/ && \ touch $CHROME_SRC/components/resources/dom_distiller_resources.grdp ) }
-
From
$CHROME_SRCrun GYP to setup ninja build files usingbuild/gyp_chromium
Running the Chrome browser with distiller support
-
For running Chrome, you need to build the
chrometarget:ninja -C out/Debug chrome
-
Run chrome with DOM Distiller enabled:
out/Debug/chrome --enable-dom-distiller
-
This adds a menu item
Distill pagethat you can use to distill web pages. -
You can also go to
chrome://dom-distillerto access the debug page. -
To have a unique user profile every time you run Chrome, you can also add
--user-data-dir=/tmp/$(mktemp -d)as a command line parameter. On Mac OS X, you can instead write--user-data-dir=$(mktemp -d 2>/dev/null || mktemp -d -t 'chromeprofile').
Running the automated tests in Chromium
-
For running the tests, you need to build the
components_browserteststarget:ninja -C out/Debug components_browsertests
-
Run the
components_browsertestsbinary to execute the tests. You can prefix the command withxvfb-runto avoid pop-up windows:xvfb-run out/Debug/components_browsertests
-
To only run tests related to DOM Distiller, run:
xvfb-run out/Debug/components_browsertests --gtest_filter=\*Distiller\*
-
For running tests as isolates, you need to build
components_browsertests_runand execute them using the swarming tool:ninja -C out/Debug components_browsertests_run python tools/swarming_client/isolate.py run -s out/Debug/components_browsertests.isolated
Running the content extractor
To extract the content from a web page directly, you can run:
xvfb-run out/Debug/components_browsertests \
--gtest_filter='*MANUAL_ExtractUrl' \
--run-manual \
--test-tiny-timeout=600000 \
--output-file=./extract.out \
--url=http://www.example.com \
> ./extract.log 2>&1extract.out has the extracted HTML, extract.log has the console logging.
If you need more logging, you can add the following arguments to the command:
- Chrome browser:
--vmodule=*distiller*=2 - Content extractor:
--debug-level=99
If this is something you often do, you can put the following function in a bash
file you include (for example ~/.bashrc) and use it for iterative development:
distill() {
(
roll-distiller && \
ninja -C out/Debug components_browsertests &&
xvfb-run out/Debug/components_browsertests \
--gtest_filter='*MANUAL_ExtractUrl' \
--run-manual \
--test-tiny-timeout=600000 \
--output-file=./extract.out \
--url=$1 \
> ./extract.log 2>&1
)
}Usage when running from $CHROME_SRC:
distill http://example.com/article.htmlDebug Code
Interactive debugging
You can use the Chrome Developer Tools to debug DOM Distiller:
-
Update the test JavaScript by running
ant extractjs.jstestsorant test. -
Open
war/test.htmlin Chrome desktop -
Open the
Consolepanel in Developer Tools (Ctrl-Shift-J). On Mac OS X you can use ⌥-⌘-I (uppercaseI) as the shortcut. -
Run all tests by calling:
org.chromium.distiller.JsTestEntry.run()
-
To run only a subset of tests, you can use a regular expression that matches a single test or multiple tests:
org.chromium.distiller.JsTestEntry.runWithFilter('MyTestClass.testSomething')
The Sources panel contains both the extracted JavaScript and all the Java
source files as long as you haven't disabled JavaScript source maps in Developer
Tools. You can set breakpoints in the Java source files and then inspect all
kinds of different interesting things when that breakpoint is hit.
When a test fails, you will see several stack traces. One of these contains clickable links to the corresponding Java source files for the stack frames.
Developer extension
After running ant package, the out/extension folder contains an unpacked
Chrome extension. This can be added to Chrome and used for development.
- Go to
chrome://extensions - Enable developer mode
- Select to load an unpacked extension and point to the
out/extensionfolder.
Features
The extension currently supports profiling the extraction code.
It also adds a panel to the Developer Tools which you can use to trigger
extraction on the inspected page. This can be used to trigger and profile
extraction on a mobile device which you are currently inspecting using
chrome://inspect.
Logging
To add logging, you can use the LogUtil. You can use the Java function
LogUtil.logToConsole(). Destination of logs:
ant test: Terminal. To get more verbose output, useant test -Dtest.debug_level=99.- Chrome browser: the Chrome log file, as set by shell variable
$CHROME_LOG_FILE. A release mode build of Chrome will log all JavaScriptINFOthere if you start Chrome with--enable-logging. You can add--enable-logging=stderrto have the log go to stderr instead of a file. - Content extractor: See [documentation about
extract.logabove] (#running-the-content-extractor).
For an example, see
$DOM_DISTILLER_DIR/java/org/chromium/distiller/PagingLinksFinder.java.
Use ant package '-Dgwt.custom.args=-style PRETTY' for easier JavaScript
debugging.
Mobile distillation from desktop
- In the tab with the interesting URL, bring up the Developer Tools emulation panel (the mobile device icon).
- Select the desired
Deviceand reload the page. Verify that you get what you expect. For example a Nexus 4 might get a mobile site, whereas Nexus 7 might get the desktop site. - The User-Agent can be copied directly out from the
UAfield. This field does not even require reload after changing device, but it is good practice to verify that you get what you expect. Copy this to the clipboard. - (Re)start chrome with
--user-agent="$USER_AGENT_FROM_CLIPBOARD". Remember to also add--enable-dom-distiller. - Distill the same URL in viewer by either using the menu
Distill pageor by going tochrome://dom-distillerand using the input field there. - Have fun scrutinizing the Chrome log file.
If you want you can copy some of these User-Agent aliases into normal bash aliases for easy access later. For example, Nexus 4 would be:
--user-agent="Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19"
Steps 1-3 in the guide above can typically be done in a stable version of Chrome, whereas the rest of the steps is typically done in your own build of Chrome (hence the "(Re)" in step 4). Besides speed, this also facilitates side-by-side comparison.