Skip to content

Commit

Permalink
fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
jaanisoe committed Nov 29, 2019
1 parent 448bdbc commit db7656d
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ public Database(String database) throws FileNotFoundException {

this.webpages = db.hashMap("webpages", Serializer.STRING, Serializer.JAVA).counterEnable().open();
this.docs = db.hashMap("docs", Serializer.STRING, Serializer.JAVA).counterEnable().open();

logger.debug("Opened database {} with {} publications, {} webpages, {} docs", database, publications.sizeLong(), webpages.sizeLong(), docs.sizeLong());
}

@SuppressWarnings("unchecked")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ public boolean isUsable(FetcherArgs fetcherArgs) {

@Override
public boolean isFinal(FetcherArgs fetcherArgs) {
return !isBroken() && isUsable(fetcherArgs);
return !isBroken() && isUsable(fetcherArgs) && !content.isEmpty();
}

public boolean isBroken() {
Expand Down
2 changes: 1 addition & 1 deletion docs/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Parameter Description
``-l`` or ``--log`` The path of the log file
=================== ===========

PubFetcher-CLI will output its log to the console (to stderr). With the ``--log`` parameter we can specify a text file location where this same log will be output. It will not be coloured as the console output, but will include DEBUG level messages omitted in the console (currently this includes just the first line listing all parameters the program was run with).
PubFetcher-CLI will output its log to the console (to stderr). With the ``--log`` parameter we can specify a text file location where this same log will be output. It will not be coloured as the console output, but will include a few DEBUG level messages omitted in the console (this includes the very first line listing all parameters the program was run with).

If the specified file already exists, then new log messages will be appended to its end. In case of a new log file creation, any missing parent directories will be created as necessary.

Expand Down
12 changes: 6 additions & 6 deletions docs/output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -387,27 +387,27 @@ webpages

.. _webpage_title:
title
The webpage title (as extracted by the corresponding :ref:`scraping rule <scraping>`; or text from the HTML ``<head>`` element if scraping rules were not found)
The webpage title (as extracted by the corresponding :ref:`scraping rule <scraping>`; or text from the HTML ``<title>`` element if scraping rules were not found)

.. _webpage_empty:
empty
``true``, if the `webpage content`_ is empty; ``false`` otherwise
``true``, if `webpage title`_ and `webpage content`_ are empty; ``false`` otherwise

.. _webpage_usable:
usable
``true``, if the webpage can be used as input for other applications (i.e., if it is not broken_, not empty and is final); ``false`` otherwise
``true``, if the length of `webpage title`_ plus the length of `webpage content`_ is large enough (at least :ref:`webpageMinLength <webpageminlength>` characters), that is, the webpage can be used as input for other applications; ``false`` otherwise

.. _webpage_final:
final
``true``, if the length of title plus the length of content is large enough (at least :ref:`webpageMinLength <webpageminlength>` characters); ``false`` otherwise
``true``, if the webpage is not broken_ and the webpage is usable_ and the length on the `webpage content`_ is larger than 0; ``false`` otherwise

.. _broken:
broken
``true``, if the webpage with the given URL could not be fetched (based on the values of statusCode_ and finalUrl_); ``false`` otherwise

.. _webpage_content:
content
The webpage content (as extracted by the corresponding :ref:`scraping rule <scraping>`; or all text parsed from the HTML if scraping rules were not found)
The webpage content (as extracted by the corresponding :ref:`scraping rule <scraping>`; or the :ref:`automatically cleaned <cleaning>` content from the entire HTML of the page if scraping rules were not found)

If ``--plain`` is specified, then only startUrl_, `webpage title`_ and `webpage content`_ will be present.

Expand Down Expand Up @@ -447,7 +447,7 @@ Log file

PubFetcher-CLI will log to stderr using the `Apache Log4j 2 <https://logging.apache.org/log4j/2.x/>`_ library. With the ``--log`` parameter (described in :ref:`Logging <logging>`), a text file where the same log will be output to can be specified.

Each log line will consist of the following: the data and time, log level, log message, the name of the logger that published the logging event and the name of the thread that generated the logging event. The date and time will be the local time in the format "2018-08-24 11:37:20,187". Log level can be DEBUG, INFO, WARN and ERROR. DEBUG level messages are only output to the log file (and not to the console), currently the only DEBUG level message is the very first message listing all parameters the program was run with. Any line breaks in the log message will be escaped, so that each log message can fit on exactly one line. The name of the logger is just the fully qualified Java class (with the prefix "org.edamontology" removed) the logging event is called from (prepended with "@" in the log file), e.g. "@pubfetcher.cli.Cli". The name of the thread will be "main" if the logging event was generated by the main thread, any subsequent thread will be named "Thread-2", "Thread-3", etc. In the log file the thread name will be in square brackets, e.g. "[Thread-2]". Some Java exceptions can also be logged, these will be output with the stack trace on subsequent lines after the logged exception message.
Each log line will consist of the following: the data and time, log level, log message, the name of the logger that published the logging event and the name of the thread that generated the logging event. The date and time will be the local time in the format "2018-08-24 11:37:20,187". Log level can be DEBUG, INFO, WARN and ERROR. DEBUG level messages are only output to the log file (and not to the console). Currently, there are only few DEBUG messages, including the very first message listing all parameters the program was run with. Any line breaks in the log message will be escaped, so that each log message can fit on exactly one line. The name of the logger is just the fully qualified Java class (with the prefix "org.edamontology" removed) the logging event is called from (prepended with "@" in the log file), e.g. "@pubfetcher.cli.Cli". The name of the thread will be "main" if the logging event was generated by the main thread, any subsequent thread will be named "Thread-2", "Thread-3", etc. In the log file the thread name will be in square brackets, e.g. "[Thread-2]". Some Java exceptions can also be logged, these will be output with the stack trace on subsequent lines after the logged exception message.

Analysing logs
==============
Expand Down

0 comments on commit db7656d

Please sign in to comment.