Skip to content
Permalink
master
Go to file
 
 
Cannot retrieve contributors at this time
175 lines (141 sloc) 2.86 KB

Scrapemark Examples

Most of the time, you’ll be doing something like this:

from scrapemark import scrape

scrape("""
  your pattern here
  """,
  url='http://someurl.com/')

However, for the sake of these examples, we will be passing the html argument to scrapemark.scrape(). The html argument will have the following string value:

<html>
<head>
<title>The Site Title :: The Page Title</title>
<META HTTP-EQUIV=REFRESH CONTENT="1; URL=http://otherurl.com/">
</head>
<body>

<ul id='nav' class='section'>
<li><span><a href='home.html'>Home</a></span></li>
<li><span><a href='about.html'>About</a></span></li>
<li><span><a href='photos.html'>Photos</a></span></li>
</ul>

<div id='content' class='section'>
Look at these data points
<table>
<tr><th>Day</th><th>Test 1</th><th>Test 2</th></tr>
<tr><td>1</td><td>5.6</td><td>24.5</td></tr>
<tr><td>2</td><td>1.1</td><td>12.8</td></tr>
<tr><td>3</td><td>2.4</td><td>5.67</td></tr>
</table>
</div>

<div id='footer' class='section'>
<a href='disclaimer.html'>Disclaimer</a> | <a href='contact.html'>Contact</a>
</div>

</body>
</html>

scrape some text:

scrape("""
  <title>:: {{ page_title }}</title>
  """,
  html)

# will get...
{'page_title': 'The Page Title'}

scrape some text (quick version):

scrape("""
  <title>:: {{ }}</title>
  """,
  html)

# will get...
'The Page Title'

loop over certain divs, scrape a list:

scrape("""
  <body>
  {*
    <div class='section' id='{{ [section_ids] }}' />
  *}
  </body>
  """,
  html)

# will get...
{'section_ids': ['nav', 'content', 'footer']}

scrape text before a certain element:

scrape("""
  <div id='content'>
  {{ before_table }}
  <table />
  </div>
  """,
  html)

# will get...
{'before_table': 'Look at these data points'}

scrape a column from a table (as a list of ints):

scrape("""
  <table>
  <tr />
  {*
    <tr>
    <td>{{ [day_numbers]|int }}</td>
    </tr>
  *}
  </table>
  """,
  html)

# will get...
{'day_numbers': [1, 2, 3]}

scrape the entire table with nested loops and dot-notation:

scrape("""
  <table>
  <tr />
  {*
    <tr>
    <td>{{ [days].number|int }}</td>
    {*
      <td>{{ [days].[points]|float }}</td>
    *}
    </tr>
  *}
  </table>
  """,
  html)

# will get...
{'days': [
  {'number': 1, 'points': [5.6, 24.5]},
  {'number': 2, 'points': [1.1, 12.8]},
  {'number': 3, 'points': [2.4, 5.67]}
]}

preserve HTML when you scrape:

scrape("""
  <div id='footer'>{{ footer|html }}</div>
  """,
  html)

# will get...
{'footer': "<a href='disclaimer.html'>Disclaimer</a> | <a href='contact.html'>Contact</a>"}

visit another page and scrape it:

scrape("""
  <head>
  <meta http-equiv='refresh' content='url={@
    <title>{{ title }}</title>
  @}'/>
  </head>
  """,
  html)

# will get...
{'title': 'whatever the title of http://otherurl.com/ is'}
You can’t perform that action at this time.