Skip to content
Permalink
master
Switch branches/tags
Go to file
 
 
Cannot retrieve contributors at this time

Scrapemark Examples

Most of the time, you’ll be doing something like this:

from scrapemark import scrape

scrape("""
  your pattern here
  """,
  url='http://someurl.com/')

However, for the sake of these examples, we will be passing the html argument to scrapemark.scrape(). The html argument will have the following string value:

<html>
<head>
<title>The Site Title :: The Page Title</title>
<META HTTP-EQUIV=REFRESH CONTENT="1; URL=http://otherurl.com/">
</head>
<body>

<ul id='nav' class='section'>
<li><span><a href='home.html'>Home</a></span></li>
<li><span><a href='about.html'>About</a></span></li>
<li><span><a href='photos.html'>Photos</a></span></li>
</ul>

<div id='content' class='section'>
Look at these data points
<table>
<tr><th>Day</th><th>Test 1</th><th>Test 2</th></tr>
<tr><td>1</td><td>5.6</td><td>24.5</td></tr>
<tr><td>2</td><td>1.1</td><td>12.8</td></tr>
<tr><td>3</td><td>2.4</td><td>5.67</td></tr>
</table>
</div>

<div id='footer' class='section'>
<a href='disclaimer.html'>Disclaimer</a> | <a href='contact.html'>Contact</a>
</div>

</body>
</html>

scrape some text:

scrape("""
  <title>:: {{ page_title }}</title>
  """,
  html)

# will get...
{'page_title': 'The Page Title'}

scrape some text (quick version):

scrape("""
  <title>:: {{ }}</title>
  """,
  html)

# will get...
'The Page Title'

loop over certain divs, scrape a list:

scrape("""
  <body>
  {*
    <div class='section' id='{{ [section_ids] }}' />
  *}
  </body>
  """,
  html)

# will get...
{'section_ids': ['nav', 'content', 'footer']}

scrape text before a certain element:

scrape("""
  <div id='content'>
  {{ before_table }}
  <table />
  </div>
  """,
  html)

# will get...
{'before_table': 'Look at these data points'}

scrape a column from a table (as a list of ints):

scrape("""
  <table>
  <tr />
  {*
    <tr>
    <td>{{ [day_numbers]|int }}</td>
    </tr>
  *}
  </table>
  """,
  html)

# will get...
{'day_numbers': [1, 2, 3]}

scrape the entire table with nested loops and dot-notation:

scrape("""
  <table>
  <tr />
  {*
    <tr>
    <td>{{ [days].number|int }}</td>
    {*
      <td>{{ [days].[points]|float }}</td>
    *}
    </tr>
  *}
  </table>
  """,
  html)

# will get...
{'days': [
  {'number': 1, 'points': [5.6, 24.5]},
  {'number': 2, 'points': [1.1, 12.8]},
  {'number': 3, 'points': [2.4, 5.67]}
]}

preserve HTML when you scrape:

scrape("""
  <div id='footer'>{{ footer|html }}</div>
  """,
  html)

# will get...
{'footer': "<a href='disclaimer.html'>Disclaimer</a> | <a href='contact.html'>Contact</a>"}

visit another page and scrape it:

scrape("""
  <head>
  <meta http-equiv='refresh' content='url={@
    <title>{{ title }}</title>
  @}'/>
  </head>
  """,
  html)

# will get...
{'title': 'whatever the title of http://otherurl.com/ is'}