Permalink
Browse files

tentative rename, etc

  • Loading branch information...
1 parent 70a7c52 commit 712f3a64e3cfa30c48a98f0c875de8cf2b14b25e Kyle Maxwell committed Mar 4, 2009
View
12 INSTALL
@@ -1,4 +1,4 @@
-Dexter depends on
+Parsley depends on
- the JSON C library from http://oss.metaparadigm.com/json-c/ (I used 0.7)
- argp (standard with Linux, other platforms use argp-standalone package)
- pcre (with dev headers)
@@ -32,17 +32,11 @@ sudo make install
Ruby Binding (via Gems)
------------------------------------------------------------------------
-# install the C version first
-cd ruby
-gem build dexterous.gemspec
-sudo gem install dexterous
+http://github.com/fizx/parsley-ruby
Python Binding
------------------------------------------------------------------------
-# install the C version first
-# Use Python 2.6, as this depends on the json support in Python's stdlib
-cd python
-python setup.py install
+http://github.com/fizx/pyparsley
Other OS/Configurations:
------------------------------------------------------------------------
View
8 INTRO
@@ -1,6 +1,6 @@
<html><textarea style="width:100%;height:100%">
Towards a universal scraping API
-or, an introduction to dexter
+or, an introduction to parsley
Web scraping is a chore. Scraper scripts are brittle and slow, and everyone writes their own custom implementation, resulting in countless hours of repeated work. Let's work together to make it easier. Let's do what regular expressions did for text processing, and what SQL did for databases. Let's create a universal domain-specific language for web scraping.
@@ -47,8 +47,8 @@ Applying this to http://www.yelp.com/biz/amnesia-san-francisco yields:
You'll note that the output structure mirrors the input structure. In the Ruby binding, you can get both input and output natively:
> require "open-uri"
- > require "dexter"
- > Dexterous.new({"title" => "h1", "links" => ["a"]}).parse(:url => "http://www.yelp.com/biz/amnesia-san-francisco")
+ > require "parsley"
+ > Parsley.new({"title" => "h1", "links" => ["a"]}).parse(:url => "http://www.yelp.com/biz/amnesia-san-francisco")
#=> {"title"=>"Amnesia", "links"=>["Yelp", "Welcome", "About Me"]}
We'll also add both explicit and implicit grouping Here's an extension of the previous example with explicit grouping:
@@ -81,6 +81,4 @@ If you instead wanted to group by date, you could use implicit grouping. It's i
}]
}
-In the next blog article, I'll talk about variables, crawling with dex, dex validations, sharing, and automatic inference of dex scripts from web page structures. Hopefully, you have a taste of what dex scripts can do, and you like it. There's an alpha implementation under active development at []. I'd love to have more collaborators, bug reports, unit tests, docs, encouragement, etc.
-
</textarea></html>
View
@@ -1,55 +1,41 @@
AM_YFLAGS = -d
BUILT_SOURCES=parser.h
-lib_LTLIBRARIES = libdexter.la
-libdexter_la_SOURCES = dex_mem.c xml2json.c regexp.c printbuf.c functions.c util.c kstring.c obstack.c scanner.l parser.y dexter.c
-include_HEADERS = dexter.h obstack.h xml2json.h
+lib_LTLIBRARIES = libparsley.la
+libparsley_la_SOURCES = parsley_mem.c xml2json.c regexp.c printbuf.c functions.c util.c kstring.c obstack.c scanner.l parser.y parsley.c
+include_HEADERS = parsley.h obstack.h xml2json.h
-bin_PROGRAMS = dexterc dexter
+bin_PROGRAMS = parsleyc parsley
-dexterc_SOURCES = dexterc_main.c
-dexterc_LDADD = libdexter.la
+parsleyc_SOURCES = parsleyc_main.c
+parsleyc_LDADD = libparsley.la
-dexter_SOURCES = dexter_main.c
-dexter_LDADD = libdexter.la
+parsley_SOURCES = parsley_main.c
+parsley_LDADD = libparsley.la
bisect:
./bootstrap.sh && ./configure && make clean && make check
-port:
- make clean
- rm -rf /tmp/dexter-`cat VERSION`
- cp -R . /tmp/dexter-`cat VERSION`
- tar -C /tmp/ --exclude release --exclude .git -zcf "/tmp/dexter-`cat VERSION`.tar.gz" dexter-`cat VERSION`
- rsync --progress "/tmp/dexter-`cat VERSION`.tar.gz" kylemaxwell.com:/var/www/kylemaxwell_com/dexter/
- cat Portfile.in | sed "s/<VERSION>/`cat VERSION`/" > Portfile
- echo "checksums \
- md5 `md5 /tmp/dexter-\`cat VERSION\`.tar.gz | sed "s/.*= //"` \
- sha1 `openssl sha1 /tmp/dexter-\`cat VERSION\`.tar.gz | sed "s/.*= //"` \
- rmd160 `openssl rmd160 /tmp/dexter-\`cat VERSION\`.tar.gz | sed "s/.*= //"`" \
- >> Portfile
- sudo port build
-
install-all:
./bootstrap.sh && ./configure && make && make install && cd ruby && rake install && cd ../python && python setup.py install
check-am:
- @echo "fictional..."; ./dexter test/fictional.dex test/fictional.html | diff test/fictional.json - && echo " success."
- @echo "fictional-opt..."; ./dexter test/fictional-opt.dex test/fictional-opt.html | diff test/fictional-opt.json - && echo " success."
- @echo "function-magic..."; ./dexter test/function-magic.dex test/function-magic.html | diff test/function-magic.json - && echo " success."
- @echo "malformed-expr..."; ./dexter test/malformed-expr.dex test/malformed-expr.html | diff test/malformed-expr.json - && echo " success."
- @echo "malformed-json..."; ./dexter test/malformed-json.dex test/malformed-json.html | diff test/malformed-json.json - && echo " success."
- @echo "css_attr..."; ./dexter -x test/css_attr.dex test/css_attr.html | diff test/css_attr.json - && echo " success."
- @echo "match..."; ./dexter -x test/match.dex test/match.xml | diff test/match.json - && echo " success."
- @echo "position..."; ./dexter test/position.dex test/position.html | diff test/position.json - && echo " success."
- @echo "replace..."; ./dexter -x test/replace.dex test/replace.xml | diff test/replace.json - && echo " success."
- @echo "scope..."; ./dexter test/scope.dex test/scope.html | diff test/scope.json - && echo " success."
- @echo "test..."; ./dexter -x test/test.dex test/test.xml | diff test/test.json - && echo " success."
- @echo "yelp..."; ./dexter test/yelp.dex test/yelp.html | diff test/yelp.json - && echo " success."
- @echo "optional..."; ./dexter test/optional.dex test/optional.html | diff test/optional.json - && echo " success."
- @echo "malformed-function..."; ./dexter test/malformed-function.dex test/malformed-function.html | diff test/malformed-function.json - && echo " success."
- @echo "empty..."; ./dexter test/empty.dex test/empty.html | diff test/empty.json - && echo " success."
- @echo "trivial..."; ./dexter test/trivial.dex test/trivial.html | diff test/trivial.json - && echo " success."
- @echo "trivial2..."; ./dexter test/trivial2.dex test/trivial2.html | diff test/trivial2.json - && echo " success."
- @echo "craigs-simple..."; ./dexter test/craigs-simple.dex test/craigs-simple.html | diff test/craigs-simple.json - && echo " success."
- @echo "yelp-home..."; ./dexter test/yelp-home.dex test/yelp-home.html | diff test/yelp-home.json - && echo " success."
+ @echo "fictional..."; ./parsley test/fictional.let test/fictional.html | diff test/fictional.json - && echo " success."
+ @echo "fictional-opt..."; ./parsley test/fictional-opt.let test/fictional-opt.html | diff test/fictional-opt.json - && echo " success."
+ @echo "function-magic..."; ./parsley test/function-magic.let test/function-magic.html | diff test/function-magic.json - && echo " success."
+ @echo "malformed-expr..."; ./parsley test/malformed-expr.let test/malformed-expr.html | diff test/malformed-expr.json - && echo " success."
+ @echo "malformed-json..."; ./parsley test/malformed-json.let test/malformed-json.html | diff test/malformed-json.json - && echo " success."
+ @echo "css_attr..."; ./parsley -x test/css_attr.let test/css_attr.html | diff test/css_attr.json - && echo " success."
+ @echo "match..."; ./parsley -x test/match.let test/match.xml | diff test/match.json - && echo " success."
+ @echo "position..."; ./parsley test/position.let test/position.html | diff test/position.json - && echo " success."
+ @echo "replace..."; ./parsley -x test/replace.let test/replace.xml | diff test/replace.json - && echo " success."
+ @echo "scope..."; ./parsley test/scope.let test/scope.html | diff test/scope.json - && echo " success."
+ @echo "test..."; ./parsley -x test/test.let test/test.xml | diff test/test.json - && echo " success."
+ @echo "yelp..."; ./parsley test/yelp.let test/yelp.html | diff test/yelp.json - && echo " success."
+ @echo "optional..."; ./parsley test/optional.let test/optional.html | diff test/optional.json - && echo " success."
+ @echo "malformed-function..."; ./parsley test/malformed-function.let test/malformed-function.html | diff test/malformed-function.json - && echo " success."
+ @echo "empty..."; ./parsley test/empty.let test/empty.html | diff test/empty.json - && echo " success."
+ @echo "trivial..."; ./parsley test/trivial.let test/trivial.html | diff test/trivial.json - && echo " success."
+ @echo "trivial2..."; ./parsley test/trivial2.let test/trivial2.html | diff test/trivial2.json - && echo " success."
+ @echo "craigs-simple..."; ./parsley test/craigs-simple.let test/craigs-simple.html | diff test/craigs-simple.json - && echo " success."
+ @echo "yelp-home..."; ./parsley test/yelp-home.let test/yelp-home.html | diff test/yelp-home.json - && echo " success."
View
@@ -60,7 +60,7 @@ am__installdirs = "$(DESTDIR)$(libdir)" "$(DESTDIR)$(bindir)" \
libLTLIBRARIES_INSTALL = $(INSTALL)
LTLIBRARIES = $(lib_LTLIBRARIES)
libdexter_la_LIBADD =
-am_libdexter_la_OBJECTS = dex_mem.lo xml2json.lo regexp.lo printbuf.lo \
+am_libdexter_la_OBJECTS = parsley_mem.lo xml2json.lo regexp.lo printbuf.lo \
functions.lo util.lo kstring.lo obstack.lo scanner.lo \
parser.lo dexter.lo
libdexter_la_OBJECTS = $(am_libdexter_la_OBJECTS)
@@ -229,7 +229,7 @@ top_srcdir = @top_srcdir@
AM_YFLAGS = -d
BUILT_SOURCES = parser.h
lib_LTLIBRARIES = libdexter.la
-libdexter_la_SOURCES = dex_mem.c xml2json.c regexp.c printbuf.c functions.c util.c kstring.c obstack.c scanner.l parser.y dexter.c
+libdexter_la_SOURCES = parsley_mem.c xml2json.c regexp.c printbuf.c functions.c util.c kstring.c obstack.c scanner.l parser.y dexter.c
include_HEADERS = dexter.h obstack.h xml2json.h
dexterc_SOURCES = dexterc_main.c
dexterc_LDADD = libdexter.la
@@ -348,7 +348,7 @@ mostlyclean-compile:
distclean-compile:
-rm -f *.tab.c
-@AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/dex_mem.Plo@am__quote@
+@AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/parsley_mem.Plo@am__quote@
@AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/dexter.Plo@am__quote@
@AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/dexter_main.Po@am__quote@
@AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/dexterc_main.Po@am__quote@
View
15 OUTLINE
@@ -1,15 +0,0 @@
-- what is dex?
- - data extraction from xml/html
- - current options: hpricot/nokogiri, XSLT, beautiful soup
- - selectors + structure
- - selectors: xpath + css + functions (xpath+exsl+regex)
-
- h1>a
- substring-after(h1, ':')
- regexp:match(span.rating, '\d+', '')
- //location[obj='some-id']/ancestor::group/@id
- html('http://google.com')//title
- html(//div/a/@href)//title
-
- - structure: json-by-example
- - example: yelp.dex
View
2 PAPER
@@ -1,6 +1,6 @@
Abstract
================================================================
-A common programming task is data extraction from xml and html documents. I introduce dex, an embedded language (ala SQL, regular expressions) that improves the usability and/or speed of current extraction techniques.
+A common programming task is data extraction from xml and html documents. I introduce parsley, an embedded language (ala SQL, regular expressions) that improves the usability and/or speed of current extraction techniques.
Introduction
================================================================
View
@@ -1,15 +1,15 @@
# -*- coding: utf-8; mode: tcl; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- vim:fenc=utf-8:ft=tcl:et:sw=4:ts=4:sts=4
# $Id$
PortSystem 1.0
-name dexter
+name parsley
version 0.1.5
categories net
maintainers kyle@kylemaxwell.com
description Data extractor
-long_description Dexter is a system to extract data from HTML/XML documents
-homepage http://github.com/fizx/dexter
+long_description Parsley is a system to extract data from HTML/XML documents
+homepage http://github.com/fizx/parsley
platforms darwin
-master_sites http://kylemaxwell.com/dexter/
+master_sites http://parslets.com
depends_lib port:argp-standalone \
port:json-c \
port:libxslt \
View
@@ -1,15 +1,15 @@
# -*- coding: utf-8; mode: tcl; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- vim:fenc=utf-8:ft=tcl:et:sw=4:ts=4:sts=4
# $Id$
PortSystem 1.0
-name dexter
+name parsley
version <VERSION>
categories net
maintainers kyle@kylemaxwell.com
description Data extractor
-long_description Dexter is a system to extract data from HTML/XML documents
-homepage http://github.com/fizx/dexter
+long_description Parsley is a system to extract data from HTML/XML documents
+homepage http://github.com/fizx/parsley
platforms darwin
-master_sites http://kylemaxwell.com/dexter/
+master_sites http://parslets.com
depends_lib port:argp-standalone \
port:json-c \
port:libxslt \
View
@@ -1,45 +1,45 @@
-To use dexter from C, the following functions are available from dexter.h. In
+To use parsley from C, the following functions are available from parsley.h. In
addition, there is a function to convert xml documents of the type returned by
-dexter into json.
+parsley into json.
You will also need passing familiarity with libxml2 and json-c to print, manipulate, and free some of the generated objects.
- http://svn.metaparadigm.com/svn/json-c/trunk
- http://xmlsoft.org/
-From dexter.h
+From parsley.h
=============
-parsedDexPtr -- a struct that contains the following elements:
- - xmlDocPtr xml -- the output of a dex document parse, as a libxml2 document
+parsedParsleyPtr -- a struct that contains the following elements:
+ - xmlDocPtr xml -- the output of a parslet document parse, as a libxml2 document
- char *error -- an error message, or NULL if no error
- - compiled_dex *dex -- reference to the dex that did the parsing
+ - compiled_parsley *parsley -- reference to the parsley that did the parsing
-dexPtr dex_compile(char* dex, char* incl)
+parsleyPtr parsley_compile(char* parsley, char* incl)
Arguments:
- - char* dex -- a string of dex to compile.
+ - char* parsley -- a string of parsley to compile.
- char* incl -- arbitrary XSLT to inject directly into the stylesheet,
outside any templates.
- Returns: A structure that you can pass to dex_parse_* to do the actual
+ Returns: A structure that you can pass to parsley_parse_* to do the actual
parsing. This structure contains the compiled XSLT.
- Notes: This is *NOT* thread-safe. (Usage of the dex via dex_parse_* *IS*
+ Notes: This is *NOT* thread-safe. (Usage of the parslet via parsley_parse_* *IS*
thread-safe, however.)
-void dex_free(dexPtr);
+void parsley_free(parsleyPtr);
- Frees the dexPtr's memory.
+ Frees the parsleyPtr's memory.
-void parsed_dex_free(parsedDexPtr);
+void parsed_parsley_free(parsedParsleyPtr);
- Frees the parsedDexPtr's memory.
+ Frees the parsedParsleyPtr's memory.
-parsedDexPtr dex_parse_file(dexPtr dex, char* file_name, boolean html);
+parsedParsleyPtr parsley_parse_file(parsleyPtr parsley, char* file_name, boolean html);
Arguments:
- - dexPtr dex -- Compiled dex struct
+ - parsleyPtr parsley -- Compiled parsley struct
- char* file_name -- file to parse
- boolean html -- Use the html parser? (instead of xml)
@@ -48,28 +48,28 @@ parsedDexPtr dex_parse_file(dexPtr dex, char* file_name, boolean html);
like xmlSaveFormatFile(). If you want json output, look below for xml2json
docs.
-parsedDexPtr dex_parse_string(dexPtr dex, char* string, size_t len, boolean html);
+parsedParsleyPtr parsley_parse_string(parsleyPtr parsley, char* string, size_t len, boolean html);
- Parses the in-memory string/length combination given. See dex_parse_file
+ Parses the in-memory string/length combination given. See parsley_parse_file
docs.
-parsedDexPtr dex_parse_doc(dexPtr dex, xmlDocPtr doc);
+parsedParsleyPtr parsley_parse_doc(parsleyPtr parsley, xmlDocPtr doc);
- Uses the dex parser to parse a libxml2 document.
+ Uses the parsley parser to parse a libxml2 document.
From xml2json.h
===============
struct json_object * xml2json(xmlNodePtr);
Converts an xml subtree to json. The xml should be in the format returned
- by dexter. Basically, xml attributes get ignored, and if you want an array
+ by parsley. Basically, xml attributes get ignored, and if you want an array
like [a,b], use:
- <dex:groups>
- <dex:group>a</dex:group>
- <dex:group>b</dex:group>
- </dex:groups>
+ <parsley:groups>
+ <parsley:group>a</parsley:group>
+ <parsley:group>b</parsley:group>
+ </parsley:groups>
To get a null-terminated string out, use:
Oops, something went wrong.

0 comments on commit 712f3a6

Please sign in to comment.