Note: if you are using Funnelback 15.10 or newer you should use the built-in CSVToXML filter with a web collection instead of using this custom gatherer.
Custom gatherer for use with a custom collection for download of csv/tsv files.
Add URLs for CSV to fetch using the data.sources collection.cfg option. Each URL is fetched and the CSV is processed using the set of options defined. Note: only 1 set of csv options is supported so the same settings will be applied to all the CSV files downloaded.
Note: you need to set the following collection.cfg option otherwise cache copies will not work:
store.record.type=XmlRecord
The custom-gatherer.groovy requires the org.apache.commons/commons-csv library. This is included in Funnelback 15.
To use with v14, the org.apache.commons/commons-csv jar file must be downloaded and decompressed into the @groovy folder of the collection using the custom gatherer.
The CSV custom gatherer supports the following collection.cfg settings:
Contains a comma-separated list of CSV URLs to index. Must be specified as URLs. Local files should be specified using the file:// URL notation (e.g. file:///opt/funnelback/data/mycollection/offline/data/mycsvfile.csv
).
NOTE: only tested with csv and tsv
Values: These map to the available CSVFormat types
- csv (default)
- xls
- rfc4180
- tsv
- mysql
Values: Java character encoding - eg: UTF-8 (default), ISO-8859-1
Values: true (CSV has a header line) (default) / false (CSV does not have a header line)
Note: non word characters included in header fields are converted to underscores when generating the XML field names.
Value: comma separated list of header titles. Assumes the number of items matches the number of columns in CSV. Non word chars are converted to underscores
eg. csv.header.custom=field1,field2,field3,field4
Required if csv.header=false
Values: true (print out some additional information to the logs) / false (default)
Each row in the CSV file is mapped to an xml file with the column values nested inside an outer <item>
element.
The individual data items are stored as subitems in elements named using the CSV column heading (or csv.header.custom value)
<?xml version="1.0" encoding="utf-8"?>
<item>
<FIELDNAME>FIELDVALUE</FIELDNAME>
...
</item>
If a field contains a unique URL value it can be mapped to the docurl field in the xml.cfg using /item/FIELDNAME
If docurl isn't set then URLs are automatically assigned using the CSV filename with the row appended. e.g. csvfilename.csv/004
For a CSV file that looks like the following:
"DVD Title","Studio","Released","Status","Sound","Versions","Price","Rating","Year","Genre","Aspect","UPC","DVD_ReleaseDate","ID","Timestamp","Link_Address"
"Best Of HypeFest 2003: Spiderweb / Love Chains / Family In Mind / Ritchie's Itch / Blood Hunt / eRATicate / ...","CustomFlix",,"Out","2.0","4:3",19.95,"NR","VAR","VAR","1.33:1","879724005239",2004-01-01 00:00:00,89591,2006-07-21 00:00:00,"http://example.com/video001.html"
"Best Of ICW Wrestling: Vol, 2","Jadat Sports",,"Out","2.0","4:3",12.95,"NR","VAR","Sports","1.33:1","760137867395",2016-08-16 00:00:00,292588,2016-08-16 00:00:00,"http://example.com/video002.html"
"Best Of Jazz In Burghausen, Vol. 3","Double Moon",,"Out","2.0","4:3",24.98,"NR","UNK","Music","1.33:1","608917170498",2009-07-14 00:00:00,161994,2010-08-18 00:00:00,"http://example.com/video003.html"
"Best Of Jazz On TDK 2007","TDK Music DVD",,"Discontinued","5.1/DTS","LBX, 16:9",9.99,"NR","2007","Music","1.85:1","824121002176",2007-02-27 00:00:00,103660,2012-02-23 00:00:00,"http://example.com/video004.html"
"Best Of JDI, Vol. 1","JDI Records",,"Out","2.0","4:3",15.98,"NR","UNK","Music","1.33:1","798321127291",2007-09-18 00:00:00,117900,2014-11-26 00:00:00,"http://example.com/video005.html"
"Best Of JDI, Vol. 2","JDI Records",,"Out","2.0","4:3",15.98,"NR","UNK","Music","1.33:1","798321127390",2007-09-18 00:00:00,117894,2014-11-26 00:00:00,"http://example.com/video006.html"
"Best Of JDI, Vol. 3","JDI Records",,"Out","2.0","4:3",15.98,"NR","UNK","Music","1.33:1","798321127499",2007-09-18 00:00:00,117875,2014-11-26 00:00:00,"http://example.com/video007.html"
"Best Of John Wayne (2-Pack): Dawn Rider / Hurricane Express / McLintock! / Star Packer / Texas Terror / Trail Beyond / ...","GoodTimes Media",,"Discontinued","2.0","4:3",14.98,"NR","VAR","Western","1.33:1","018713833150",2004-06-01 00:00:00,42175,2011-07-23 00:00:00,"http://example.com/video008.html"
"Best Of John Wayne Collection 1: Rio Lobo / El Dorado / True Grit","Paramount",,"Discontinued","2.0","LBX",29.99,"NR","VAR","Western","1.85:1","097360561746",2003-04-29 00:00:00,26655,2007-05-19 00:00:00,"http://example.com/video009.html"
"Best Of John Wayne Collection 1: Rio Lobo / El Dorado / True Grit (Checkpoint)","Paramount",,"Discontinued","2.0","LBX",29.99,"NR","VAR","Western","1.85:1","097360561722",2003-04-29 00:00:00,57003,2013-01-05 00:00:00,"http://example.com/video010.html"
An xml.cfg similar to the following can be used
PADRE XML Mapping Version: 2
docurl,/item/Link_Address
dvdTitle,1,,//DVD_Title
dvdStudio,1,,//Studio
dvdReleased,0,,//Released
dvdStatus,0,,//Status
dvdSound,0,,//Sound
dvdVersions,0,,//Versions
dvdPrice,3,,//Price
dvdRating,0,,//Rating
dvdYear,0,,//Year
dvdGenre,1,,//Genre
dvdAspect//Aspect
dvdUpc//UPC
d,0,,//DVD_ReleaseDate
dvdId,0,,//ID
dvdTimestamp,0,,//Timestamp
A sample collection.cfg might look like:
#
# Filename: /opt/funnelback/conf/showcase-csv/collection.cfg
#
collection=csv
collection_group=Example collections
collection_type=custom
csv.debug=true
data.sources=http://examplecsv.com/dvd_csv.txt
data_report=false
filter=false
gather=custom-gather
query_processor_options=-stem=2 -SM=meta -SF=[dvdTitle,dvdStudio,dvdSound,dvdReleased,dvdStatus,dvdVersions,dvdPrice,dvdRating,dvdYear,dvdGenre,dvdAspect,dvdUpc] -rmc_sensitive=true
service_name=CSV example
spelling.suggestion_sources=[@,dvdTitle,%]
start_url=
store.record.type=XmlRecord
With csv.debug=true set additional information is written to the log files. In 15.6 this is written to the main collection update log.
e.g.
...
<Timestamp>2016-06-07 00:00:00</Timestamp>
<Versions>4:3</Versions>
<Year>UNK</Year>
<Price>31.95</Price>
<DVD_ReleaseDate>2005-05-10 00:00:00</DVD_ReleaseDate>
<Genre>Music</Genre>
<ID>61702</ID>
</item>
<?xml version="1.0" encoding="utf-8"?>
<item>
<Status>Out</Status>
<Released></Released>
<Rating>NR</Rating>
<UPC>4000127201294</UPC>
<Sound>2.0</Sound>
<DVD_Title>!!!! Beat, Vol. 4: Shows 14 - 17</DVD_Title>
<Aspect>1.33:1</Aspect>
<Studio>Bear Family</Studio>
<Timestamp>2016-06-07 00:00:00</Timestamp>
<Versions>4:3</Versions>
<Year>UNK</Year>
<Price>31.95</Price>
<DVD_ReleaseDate>2005-07-12 00:00:00</DVD_ReleaseDate>
<Genre>Music</Genre>
<ID>65695</ID>
</item>
<?xml version="1.0" encoding="utf-8"?>
<item>
<Status>Out</Status>
<Genre>Animation</Genre>
...