Skip to content
/ sfeed Public

RSS and Atom parser. [copy of git://git.codemadness.org/sfeed master, 05/19/2019]

License

Notifications You must be signed in to change notification settings

dxwc/sfeed

Repository files navigation

sfeed
-----

RSS and Atom parser (and some format programs).

It converts RSS or Atom feeds from XML to a TAB-separated file. There are
formatting programs included to convert this TAB-separated format to various
other formats. There are also some programs and scripts included to import and
export OPML and to fetch, filter, merge and order feed items.


Build and install
-----------------

$ make
# make install


Usage
-----

Initial setup:

	mkdir -p "$HOME/.sfeed/feeds"
	cp sfeedrc.example "$HOME/.sfeed/sfeedrc"

Edit the sfeedrc(5) configuration file and change any RSS/Atom feeds. This file
is included and evaluated as a shellscript for sfeed_update, so it's functions
and behaviour can be overridden:

	$EDITOR "$HOME/.sfeed/sfeedrc"

or you can import existing OPML subscriptions using sfeed_opml_import(1):

	sfeed_opml_import < file.opml > "$HOME/.sfeed/sfeedrc"

an example to export from an other RSS/Atom reader called newsboat and import
for sfeed_update:

	newsboat -e | sfeed_opml_import > "$HOME/.sfeed/sfeedrc"

Update feeds, this script merges the new items, see sfeed_update(1) for more
information what it can do:

	sfeed_update

Format feeds:

Plain-text list:

	sfeed_plain $HOME/.sfeed/feeds/* > "$HOME/.sfeed/feeds.txt"

HTML view (no frames), copy style.css for a default style:

	cp style.css "$HOME/.sfeed/style.css"
	sfeed_html $HOME/.sfeed/feeds/* > "$HOME/.sfeed/feeds.html"

HTML view with the menu as frames, copy style.css for a default style:

	mkdir -p "$HOME/.sfeed/frames"
	cd "$HOME/.sfeed/frames" && sfeed_frames $HOME/.sfeed/feeds/*

To automatically update your feeds periodically and format them in a way you
like you can make a wrapper script and add it as a cronjob.

Most protocols are supported because curl(1) is used by default, therefore
proxy settings from the environment (such as $http_proxy environment variable)
are used.

The sfeed(1) program itself is just a parser that parses XML data from stdin
and is therefore protocol-agnostic. It can be used with HTTP, HTTPs, Gopher,
SSH, etc.

See the section "Usage and examples" below and the man-pages for more
information how to use sfeed(1) and the additional tools.


Dependencies
------------

- C compiler (C99).
- libc (recommended: C99 and POSIX >= 200809).


Optional dependencies
---------------------

- POSIX make(1) (for Makefile).
- POSIX sh(1),
  used by sfeed_update(1) and sfeed_opml_export(1).
- curl(1) binary: https://curl.haxx.se/ ,
  used by sfeed_update(1), can be replaced with any tool like wget(1),
  OpenBSD ftp(1) or hurl(1): https://git.codemadness.org/hurl/
- iconv(1) command-line utilities,
  used by sfeed_update(1). If the text in your RSS/Atom feeds are already UTF-8
  encoded then you don't need this. For a minimal iconv implementation:
  https://git.etalabs.net/cgit/noxcuse/tree/src/iconv.c
- mandoc for documentation: https://mdocml.bsd.lv/


OS tested
---------

- Linux (glibc+gcc, musl+gcc, clang).
- OpenBSD (gcc, clang).
- NetBSD
- FreeBSD
- Windows (cygwin gcc, mingw).
- HaikuOS


Architectures tested
--------------------

amd64, ARM, aarch64, i386, SPARC64.


Files
-----

sfeed             - Read XML RSS or Atom feed data from stdin. Write feed data
                    in TAB-separated format to stdout.
sfeed_atom        - Format feed data (TSV) to an Atom feed.
sfeed_frames      - Format feed data (TSV) to HTML file(s) with frames.
sfeed_gph         - Format feed data (TSV) to geomyidae .gph files.
sfeed_html        - Format feed data (TSV) to HTML.
sfeed_opml_export - Generate an OPML XML file from a sfeedrc config file.
sfeed_opml_import - Generate a sfeedrc config file from an OPML XML file.
sfeed_mbox        - Format feed data (TSV) to mbox.
sfeed_plain       - Format feed data (TSV) to a plain-text list.
sfeed_twtxt       - Format feed data (TSV) to a twtxt feed.
sfeed_update      - Update feeds and merge with old feeds in the directory
                    $HOME/.sfeed/feeds by default.
sfeed_web         - Find urls to RSS/Atom feed from a webpage.
sfeed_xmlenc      - Detect character-set encoding from XML stream.
sfeedrc.example   - Example config file. Can be copied to $HOME/.sfeed/sfeedrc.
style.css         - Example stylesheet to use with sfeed_html(1) and
                    sfeed_frames(1).


Files read at runtime by sfeed_update(1)
----------------------------------------

sfeedrc - Config file. This file is evaluated as a shellscript in
          sfeed_update(1).

Atleast the following functions can be overridden per feed:

- fetch: to use wget(1), OpenBSD ftp(1) or an other download program.
- filter: to filter on fields.
- merge: to change the merge logic.
- order: to change the sort order.

See also the sfeedrc(5) man page documentation for more details.

The feeds() function is called to process the feeds. The default feed()
function is executed concurrently as a background job in your sfeedrc(5) config
file to make updating faster. The variable maxjobs can be changed to limit or
increase the amount of concurrent jobs (8 by default).


Files written at runtime by sfeed_update(1)
-------------------------------------------

feedname     - TAB-separated format containing all items per feed. The
               sfeed_update(1) script merges new items with this file.
feedname.new - Temporary file used by sfeed_update(1) to merge items.


File format
-----------

man 5 sfeed
man 5 sfeedrc
man 1 sfeed


Usage and examples
------------------

Find RSS/Atom feed urls from a webpage:

	url="https://codemadness.org"; curl -L -s "$url" | sfeed_web "$url"

output example:

	https://codemadness.org/blog/rss.xml	application/rss+xml
	https://codemadness.org/blog/atom.xml	application/atom+xml

- - -

Make sure your sfeedrc config file exists, see sfeedrc.example. To update your
feeds (configfile argument is optional):

	sfeed_update "configfile"

Format the feeds files:

	# Plain-text list.
	sfeed_plain $HOME/.sfeed/feeds/* > $HOME/.sfeed/feeds.txt
	# HTML view (no frames), copy style.css for a default style.
	sfeed_html $HOME/.sfeed/feeds/* > $HOME/.sfeed/feeds.html
	# HTML view with the menu as frames, copy style.css for a default style.
	mkdir -p somedir && cd somedir && sfeed_frames $HOME/.sfeed/feeds/*

View formatted output in your browser:

	$BROWSER "$HOME/.sfeed/feeds.html"

View formatted output in your editor:

	$EDITOR "$HOME/.sfeed/feeds.txt"

- - -

Example script to view feeds with dmenu(1), opens selected url in $BROWSER:

	#!/bin/sh
	url=$(sfeed_plain $HOME/.sfeed/feeds/* | dmenu -l 35 -i |
		sed 's@^.* \([a-zA-Z]*://\)\(.*\)$@\1\2@')
	[ ! "$url" = "" ] && $BROWSER "$url"

- - -

Generate a sfeedrc config file from your exported list of feeds in OPML
format:

	sfeed_opml_import < opmlfile.xml > $HOME/.sfeed/sfeedrc

- - -

Export an OPML file of your feeds from a sfeedrc config file (configfile
argument is optional):

	sfeed_opml_export configfile > myfeeds.opml

- - -

The filter function can be overridden in your sfeedrc file. This allows
filtering items per feed. It can be used to shorten urls, filter away
advertisements, strip tracking parameters and more.

	# filter fields.
	# filter(name)
	filter() {
		case "$1" in
		"tweakers")
			LC_LOCALE=C awk -F '\t' 'BEGIN { OFS = "\t"; }
			# skip ads.
			$2 ~ /^ADV:/ {
				next;
			}
			# shorten link.
			{
				if (match($3, /^https:\/\/tweakers\.net\/[a-z]+\/[0-9]+\//)) {
					$3 = substr($3, RSTART, RLENGTH);
				}
				print $0;
			}';;
		"yt BSDNow")
			# filter only BSD Now from channel.
			LC_LOCALE=C awk -F '\t' '$2 ~ / \| BSD Now/';;
		*)
			cat;;
		esac | \
			# replace youtube links with embed links.
			sed 's@www.youtube.com/watch?v=@www.youtube.com/embed/@g' | \

			LC_LOCALE=C awk -F '\t' 'BEGIN { OFS = "\t"; }
			function filterlink(s) {
				# protocol must start with http, https or gopher.
				if (match(s, /^(http|https|gopher):\/\//) == 0) {
					return "";
				}

				# shorten feedburner links.
				if (match(s, /^(http|https):\/\/[^/]+\/~r\/.*\/~3\/[^\/]+\//)) {
					s = substr($3, RSTART, RLENGTH);
				}

				# strip tracking parameters
				# urchin, facebook, piwik, webtrekk and generic.
				gsub(/\?(ad|campaign|pk|tm|wt)_([^&]+)/, "?", s);
				gsub(/&(ad|campaign|pk|tm|wt)_([^&]+)/, "", s);

				gsub(/\?&/, "?", s);
				gsub(/[\?&]+$/, "", s);

				return s
			}
			{
				$3 = filterlink($3); # link
				$8 = filterlink($8); # enclosure

				print $0;
			}'
	}

- - -

The fetch function can be overridden in your sfeedrc file. This allows to
replace the default curl(1) for sfeed_update with any other client to fetch the
RSS/Atom data:

	# fetch a feed via HTTP/HTTPS etc.
	# fetch(name, url, feedfile)
	fetch() {
		hurl -m 1048576 -t 15 "$2" 2>/dev/null
	}

- - -

Aggregate feeds. This filters new entries (maximum one day old) and sorts them
by newest first. Prefix the feed name in the title. Convert the TSV output data
to an Atom XML feed (again):

	#!/bin/sh
	cd ~/.sfeed/feeds/ || exit 1

	LC_ALL=C awk -F '\t' -v "old=$(($(date -j +'%s') - 86400))" '
	BEGIN {
		OFS = "\t";
	}
	{
		if (int($1) >= old) {
			$2 = "[" FILENAME "] " $2;
			print $0;
		}
	}' * | \
	sort -k1,1rn | \
	sfeed_atom

- - -

To have a FIFO stream filtering for new unique feed items and showing them as
plain-text per line similar to sfeed_plain(1):

Create a FIFO:

	fifo="/tmp/sfeed_fifo"
	mkfifo "$fifo"

On the reading side:

	# This keeps track of unique lines so might consume much memory.
	# It tries to reopen the $fifo after 1 second if it fails.
	while :; do cat "$fifo" || sleep 1; done | awk '!x[$0]++'

On the writing side:

	feedsdir="$HOME/.sfeed/feeds/"
	cd "$feedsdir" || exit 1
	test -p "$fifo" || exit 1

	# 1 day is old news, don't write older items.
	LC_ALL=C awk -v "old=$(($(date -j +'%s') - 86400))" '
	BEGIN { FS = OFS = "\t"; }
	{
		if (int($1) >= old) {
			$2 = "[" FILENAME "] " $2;
			print $0;
		}
	}' * | sort -k1,1n | sfeed_plain | cut -b 3- > "$fifo"

cut -b is used to trim the "N " prefix of sfeed_plain(1).

- - -

For some podcast feed the following code can be used to filter the latest
enclosure url (probably some audio file):

	LC_ALL=C awk -F "\t" 'BEGIN { latest = 0; }
	length($8) {
		ts = int($1);
		if (ts > latest) {
			url = $8;
			latest = ts;
		}
	}
	END { if (length(url)) { print url; } }'

- - -

Over time your feeds file might become quite big. You can archive items from a
specific date by doing for example:

File sfeed_archive.c:

	#include <sys/types.h>

	#include <err.h>
	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>
	#include <time.h>

	#include "util.h"

	int
	main(int argc, char *argv[])
	{
		char *line = NULL, *p;
		time_t parsedtime, comparetime;
		struct tm tm;
		size_t size = 0;
		int r, c, y, m, d;

		if (argc != 2 || strlen(argv[1]) != 8 ||
		    sscanf(argv[1], "%4d%2d%2d", &y, &m, &d) != 3) {
			fputs("usage: sfeed_archive yyyymmdd\n", stderr);
			exit(1);
		}

		memset(&tm, 0, sizeof(tm));
		tm.tm_isdst = -1; /* don't use DST */
		tm.tm_year = y - 1900;
		tm.tm_mon = m - 1;
		tm.tm_mday = d;
		if ((comparetime = mktime(&tm)) == -1)
			err(1, "mktime");

		while ((getline(&line, &size, stdin)) > 0) {
			if (!(p = strchr(line, '\t')))
				continue;
			c = *p;
			*p = '\0'; /* temporary null-terminate */
			if ((r = strtotime(line, &parsedtime)) != -1 &&
			    parsedtime >= comparetime) {
				*p = c; /* restore */
				fputs(line, stdout);
			}
		}
		return 0;
	}

Now compile and run:

	$ cc -std=c99 -o sfeed_archive util.c sfeed_archive.c
	$ ./sfeed_archive 20150101 < feeds > feeds.new
	$ mv feeds feeds.bak
	$ mv feeds.new feeds

This could also be run weekly in a crontab to archive the feeds. Like throwing
away old newspapers. It keeps the feeds list tidy and the formatted output
small.

- - -

Convert mbox to separate maildirs per feed and filter duplicate messages using
fdm(1): https://github.com/nicm/fdm .

fdm config file (~/.sfeed/fdm.conf):

	set unmatched-mail keep

	account "sfeed" mbox "%[home]/.sfeed/mbox"
		$cachepath = "%[home]/.sfeed/mbox.cache"
		cache "${cachepath}"
		$feedsdir = "%[home]/feeds/"

		# check if in cache by message-id.
		match case "^Message-ID: (.*)" in headers
			action {
				tag "msgid" value "%1"
			}
			continue
			# if in cache, stop.
			match matched and in-cache "${cachepath}" key "%[msgid]"
			action {
				keep
			}

		# not in cache, process it and add to cache.
		match case "^X-Feedname: (.*)" in headers
		action {
			maildir "${feedsdir}%1"
			add-to-cache "${cachepath}" key "%[msgid]"
			keep
		}

Now run:

	$ sfeed_mbox ~/.sfeed/feeds/* > ~/.sfeed/mbox
	$ fdm -f ~/.sfeed/fdm.conf fetch

Now you can view feeds in mutt(1) for example.

- - -

Convert mbox to separate maildirs per feed and filter duplicate messages using
procmail(1).

procmail_maildirs.sh file:

	maildir="$HOME/feeds"
	feedsdir="$HOME/.sfeed/feeds"
	procmailconfig="$HOME/.sfeed/procmailrc"

	# message-id cache to prevent duplicates.
	mkdir -p "${maildir}/.cache"

	if ! test -r "${procmailconfig}"; then
		echo "Procmail configuration file \"${procmailconfig}\" does not exist or is not readable." >&2
		echo "See procmailrc.example for an example." >&2
		exit 1
	fi

	find "${feedsdir}" -type f -exec printf '%s\n' {} \; | while read -r d; do
		name=$(basename "${d}")
		mkdir -p "${maildir}/${name}/cur"
		mkdir -p "${maildir}/${name}/new"
		mkdir -p "${maildir}/${name}/tmp"
		printf 'Mailbox %s\n' "${name}"
		sfeed_mbox "${d}" | formail -s procmail "${procmailconfig}"
	done

Procmailrc(5) file:

	# Example for use with sfeed_mbox(1).
	# The header X-Feedname is used to split into separate maildirs. It is
	# assumed this name is sane.

	MAILDIR="$HOME/feeds/"

	:0
	* ^X-Feedname: \/.*
	{
		FEED="$MATCH"

		:0 Wh: "msgid_$FEED.lock"
		| formail -D 1024000 ".cache/msgid_$FEED.cache"

		:0
		"$FEED"/
	}

Now run:

	$ procmail_maildirs.sh

Now you can view feeds in mutt(1) for example.


License
-------

ISC, see LICENSE file.


Author
------

Hiltjo Posthuma <hiltjo@codemadness.org>

About

RSS and Atom parser. [copy of git://git.codemadness.org/sfeed master, 05/19/2019]

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •