Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 bin
Octocat-spinner-32 lib
Octocat-spinner-32 t
Octocat-spinner-32 .gitignore
Octocat-spinner-32 MANIFEST.SKIP
Octocat-spinner-32 README.pod
Octocat-spinner-32 dist.ini
Octocat-spinner-32 perlcritic.rc
Octocat-spinner-32 weaver.ini
README.pod

NAME

HTML::Untemplate - web scraping assistant

VERSION

version 0.011

DESCRIPTION

Suppose you have a set of HTML documents generated by populating the same template with the data from some kind of database. HTML::Untemplate is a set of command-line tools ("xpathify", "untemplate") and modules (HTML::Linear and it's dependencies) which assist in original data retrieval.

To achieve this goal, HTML tree nodes are presented as XPath/content pairs. HTML documents linearized this way can be easily inspected manually or with a diff tool. Please refer to "EXAMPLES".

Despite being named similarly to HTML::Template, this distribution is not directly related to it. Instead, it attempts to reverse the templating action, whatever the template agent used.

Why?

Suppose you have a CMS. Typical CMS works roughly as this (data flows bottom-down):

            RDBMS
      scripting language
             HTML
         HTTP server
            (...)
          HTTP agent
        layout engine
            screen
             user

Consider the first 3 steps: RDBMS => scripting language => HTML

This is "applying template".

Now, consider this: HTML => scripting language => RDBMS

I would call that "un-applying template", or "untemplate" :)

The practical application of this set of tools to assist in creation of web scrappers.

EXAMPLES

xpathify

The xpathify tool flatterns the HTML tree into key/value list:

    <!DOCTYPE html>
    <html>
        <head>
            <title>Hello HTML</title>
        </head>
        <body>
            <h1>Hello World!</h1>
            <p>This is a sample HTML</p>
            Beware!
            <p>HTML is <b>not</b> XML!</p>
            Have a nice day.
        </body>
    </html>

Becomes:

(HTML block)

/html/head/title/text() Hello HTML                
/html/body/h1/text()    Hello World!              
/html/body/p[1]/text()  This is a sample HTML     
/html/body/text()        Beware!                  
/html/body/p[2]/text()  HTML is                   
/html/body/p[2]/b/text()        not               
/html/body/p[2]/text()   XML!                     
/html/body/text()        Have a nice day.         

The keys are in XPath format, while the values are respective content from the HTML tree. Theoretically, it could be possible to reassemble the HTML tree from the flat key/value list this tool generates.

untemplate

The untemplate tool flatterns a set of HTML documents using the algorithm from xpathify. Then, it strips the shared key/value pairs. The "rest" is composed of original values fed into the template engine.

And this is how the result actually looks like with some simple real-world examples (quotes 1839 and 2486 from bash.org):

(HTML block)

/html/head/title/text()                                                                             
bash_org_1839   QDB: Quote #1839                                                                    
bash_org_2486   QDB: Quote #2486                                                                    
                                                                                                    
/html/body/form[@name='tsearch']/center/table[1]/tr/td[2]/font/b/text()                             
bash_org_1839   Quote #1839                                                                         
bash_org_2486   Quote #2486                                                                         
                                                                                                    
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/a/@href                                     
bash_org_1839   ?1839                                                                               
bash_org_2486   ?2486                                                                               
                                                                                                    
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/a/b/text()                                  
bash_org_1839   #1839                                                                               
bash_org_2486   #2486                                                                               
                                                                                                    
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/a[@class='qa'][1]/@href                     
bash_org_1839   ./?le=cc8456a913b26eb7364e4e9a94348d04&rox=1839                                     
bash_org_2486   ./?le=cc8456a913b26eb7364e4e9a94348d04&rox=2486                                     
                                                                                                    
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/text()                                      
bash_org_1839   (245)                                                                               
bash_org_2486   (230)                                                                               
                                                                                                    
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/a[@class='qa'][2]/@href                     
bash_org_1839   ./?le=cc8456a913b26eb7364e4e9a94348d04&sox=1839                                     
bash_org_2486   ./?le=cc8456a913b26eb7364e4e9a94348d04&sox=2486                                     
                                                                                                    
/html/body/p/center[1]/table/tr/td[1]/p[@class='quote']/a[@class='qa'][3]/@href                     
bash_org_1839   ./?le=cc8456a913b26eb7364e4e9a94348d04&sux=1839                                     
bash_org_2486   ./?le=cc8456a913b26eb7364e4e9a94348d04&sux=2486                                     
                                                                                                    
/html/body/p/center[1]/table/tr/td[1]/p[@class='qt']/text()                                         
bash_org_1839   <maff> who needs showers when you've got an assortment of feminine products         
bash_org_2486   <R`:#heroin> Is this for recovery or indulgence?                                    
                                                                                                    
/html/body/p/center[2]/table/tr[2]/td[@class='footertext'][1]/text()                                
bash_org_1839   8.3642                                                                              
bash_org_2486   0.0016                                                                              
                                                                                                    

MODULES

May be used to serialize/flattern HTML documents by your own:

SEE ALSO

AUTHOR

Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

Something went wrong with that request. Please try again.