## Attribution

These slides were adapted from [the companion notebooks](https://github.com/REMitchell/python-scraping) for [Web Scraping in Python](http://shop.oreilly.com/product/0636920034391.do), which are open sourced and provided for free.  If you are interested in a more detailed presentation of web scraping in Python, this book is a great source.

In [None]:
# Install if needed
!pip install composable
!pip install composablesoup

In [None]:
# Check for upgrade is already installed
!pip install composable --upgrade
!pip install composablesoup --upgrade

In [90]:
from composable import pipeable
from composable.strict import map, filter
from composablesoup import find, find_all, get_text, has_attr
from composablesoup.soup import find_parent, parents, children, find_previous_sibling, find_previous_siblings, find_next_sibling, find_next_siblings, find_previous_sibling
from composable.sequence import to_list, head
from composable.string import strip
from composable import from_toolz as tlz

### Searching with `lambda` functions

We can use a lambda function 
* to perform more complicated searches.
* **Syntax:** `bs.find_all(lambda tag: bool_expr)`

### Example 1

Let's find all tags with exactly 2 attributed

In [54]:
(items_for_sale
 >> find_all(lambda tag: len(tag.attrs) == 2)
 >> head(2)
)

[<img src="../img/gifts/logo.jpg" style="float:left;"/>,
 <tr class="gift" id="gift1"><td>
 Vegetable Basket
 </td><td>
 This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
 <span class="excitingNote">Now with super-colorful bell peppers!</span>
 </td><td>
 $15.00
 </td><td>
 <img src="../img/gifts/img1.jpg"/>
 </td></tr>]

### Example 2

Let's find all tags containing a specific piece of text.

In [55]:
(items_for_sale
 >> find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')
)

[<span class="excitingNote">Or maybe he's only resting?</span>]

In [60]:
(items_for_sale
 >> find_all('', text='Or maybe he\'s only resting?')
)


["Or maybe he's only resting?"]

## Searching with regular expressions

The ultimate tool for performing complex text searches is a Regular Expression, which will be our next topic of discussion.

In [104]:
gift_img = re.compile('\.\.\/img\/gifts/img.*\.jpg')

(items_for_sale
 >> find_all('img', attrs={'src': gift_img})
 >> map(tlz.get('src')) # why is ok to skip the filter?
)

['../img/gifts/img1.jpg',
 '../img/gifts/img2.jpg',
 '../img/gifts/img3.jpg',
 '../img/gifts/img4.jpg',
 '../img/gifts/img6.jpg']