## Attribution

These slides were adapted from [the companion notebooks](https://github.com/REMitchell/python-scraping) for [Web Scraping in Python](http://shop.oreilly.com/product/0636920034391.do), which are open sourced and provided for free.  If you are interested in a more detailed presentation of web scraping in Python, this book is a great source.

In [None]:
# Install if needed
!pip install composable
!pip install composablesoup

In [None]:
# Check for upgrade is already installed
!pip install composable --upgrade
!pip install composablesoup --upgrade

In [90]:
from composable import pipeable
from composable.strict import map, filter
from composablesoup import find, find_all, get_text, has_attr
from composablesoup.soup import find_parent, parents, children, find_previous_sibling, find_previous_siblings, find_next_sibling, find_next_siblings, find_previous_sibling
from composable.sequence import to_list, head
from composable.string import strip
from composable import from_toolz as tlz

## Parents, Children and Siblings

Beautiful search objects keep reference to all surrounding tags and we will need to exploit these relationships when we can't find a tag through a direct search.  In this section, we will investigate these relationships and using them to access the desired tags.

### Definitions

Many tags have the following relationships. 

* **Parents:** Closest surrounding tag
* **Children:** All tag immediately inside a tag
    * EXACTLY one level deep
* **Descendents:** All embedded tags
    * ANY depth
* **Siblings:** All tags on the same level
    * i.e. all children of the surrounding tag.

### Working example

Please visit [this page](http://www.pythonscraping.com/pages/page3.html) and inspect the source.

In [2]:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get('http://www.pythonscraping.com/pages/page3.html')
items_for_sale = BeautifulSoup(r.content, 'html.parser')

### Plotting the DOM

* HTML
    * body
        * div.wrapper
            * h1
            * div.content
            * table#giftList
                * tr
                    * th
                    * th
                    * th
                    * th
                * tr.gift#gift
                    * td
                    * td
                        * span.excitingNote
                    * td
                    * td
                        * img
                *  ... table continues ...
            * div.footer

<font color="red"><h2>Exercise 1</h2></font>

Identify the parents of

1. `table#giftList1`
2. `span.excitingNote`

> Your answer here

<font color="red"><h2>Exercise 2</h2></font>

Identify the children of

1. `table#giftList1`
2. `tr.gift#gift`

> Your answer here

<font color="red"><h2>Exercise 3</h2></font>

Describe (in words) the descendents of `table#giftList1`

> Your answer here

<font color="red"><h2>Exercise 4</h2></font>

Identify the siblings of `tr.gift#gift`

> Your answer here

### Stepping up a level with `find_parent`

We can access the parent of any tag using the `parent` attribute

In [3]:
(items_for_sale
 >> find('tr', class_ = 'gift')
 >> find_parent
)

<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gift

### Stepping up two levels with 2*`find_parent`

Applying `find_parent` twice will step us up two levels.

In [4]:
(items_for_sale
 >> find('tr', class_ = 'gift')
 >> find_parent
 >> find_parent
)

<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;"/>
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br/>
123 Main St.<br/>
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</p></div>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls


### Searching for a specif parents.

We can also use `find_parent` two search for the closest parent that fits some description.

In [5]:
(items_for_sale
 >> find('tr', class_ = 'gift')
 >> find_parent(name='div', attrs={'id':'wrapper'})
)

<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;"/>
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br/>
123 Main St.<br/>
Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</p></div>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls


### Searching for children

Note that we are using `find` (why?) with the `children` attribute

In [92]:
(items_for_sale
 >> find('table',attrs={'id':'giftList'})
 >> children
)

['\n', <tr><th>
 Item Title
 </th><th>
 Description
 </th><th>
 Cost
 </th><th>
 Image
 </th></tr>, '\n', <tr class="gift" id="gift1"><td>
 Vegetable Basket
 </td><td>
 This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
 <span class="excitingNote">Now with super-colorful bell peppers!</span>
 </td><td>
 $15.00
 </td><td>
 <img src="../img/gifts/img1.jpg"/>
 </td></tr>, '\n', <tr class="gift" id="gift2"><td>
 Russian Nesting Dolls
 </td><td>
 Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
 </td><td>
 $10,000.52
 </td><td>
 <img src="../img/gifts/img2.jpg"/>
 </td></tr>, '\n', <tr class="gift" id="gift3"><td>
 Fish Painting
 </td><td>
 If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
 </td><td>
 $10,00

### Accessing the last and next siblings

* `find_previous_sibling` returns closest previous sibling
* `find_previous_siblings` returns all previous sibling

In [98]:
(items_for_sale 
 >> find('tr', id = 'gift3')
 >> find_previous_sibling
)

<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>

In [99]:
(items_for_sale 
 >> find('tr', id = 'gift3')
 >> find_previous_siblings
)

[<tr class="gift" id="gift2"><td>
 Russian Nesting Dolls
 </td><td>
 Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
 </td><td>
 $10,000.52
 </td><td>
 <img src="../img/gifts/img2.jpg"/>
 </td></tr>, <tr class="gift" id="gift1"><td>
 Vegetable Basket
 </td><td>
 This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
 <span class="excitingNote">Now with super-colorful bell peppers!</span>
 </td><td>
 $15.00
 </td><td>
 <img src="../img/gifts/img1.jpg"/>
 </td></tr>, <tr><th>
 Item Title
 </th><th>
 Description
 </th><th>
 Cost
 </th><th>
 Image
 </th></tr>]

### Accessing the last and next siblings

* `find_next_sibling` returns closest remaining sibling
* `find_next_siblings` returns all remaining sibling

In [100]:
(items_for_sale 
 >> find('tr', class_ = 'gift')
 >> find_next_sibling
)

<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>

In [101]:
(items_for_sale 
 >> find('tr', class_ = 'gift')
 >> find_next_siblings
)

[<tr class="gift" id="gift2"><td>
 Russian Nesting Dolls
 </td><td>
 Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
 </td><td>
 $10,000.52
 </td><td>
 <img src="../img/gifts/img2.jpg"/>
 </td></tr>, <tr class="gift" id="gift3"><td>
 Fish Painting
 </td><td>
 If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
 </td><td>
 $10,005.00
 </td><td>
 <img src="../img/gifts/img3.jpg"/>
 </td></tr>, <tr class="gift" id="gift4"><td>
 Dead Parrot
 </td><td>
 This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
 </td><td>
 $0.50
 </td><td>
 <img src="../img/gifts/img4.jpg"/>
 </td></tr>, <tr class="gift" id="gift5"><td>
 Mystery Box
 </td><td>
 If you love suprises, this mystery box is for you! Do not place on light-colored s

### Searching the last and next siblings

We can also use these four functions to search for specific tags

In [102]:
(items_for_sale 
 >> find('tr', class_ = 'gift')
 >> find_next_sibling(attrs={'id':'gift4'})
)

<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>

In [103]:
import re
four_or_five = re.compile('(gift4|gift5)')
(items_for_sale 
 >> find('tr', class_ = 'gift')
 >> find_next_siblings(attrs={'id':four_or_five})
)

[<tr class="gift" id="gift4"><td>
 Dead Parrot
 </td><td>
 This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
 </td><td>
 $0.50
 </td><td>
 <img src="../img/gifts/img4.jpg"/>
 </td></tr>, <tr class="gift" id="gift5"><td>
 Mystery Box
 </td><td>
 If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
 </td><td>
 $1.50
 </td><td>
 <img src="../img/gifts/img6.jpg"/>
 </td></tr>]

<font color="red"><h2>Exercise 5</h2></font>

* Look at the site source again, 
    * specifically item prices.
* How can we get to these prices?

### Using relationships to find unlabeled data.


* tr.gift#gift1
    * td
    * td
    * td
        * "$15.00"
    * td
        * `<img src="/img/gifts/img1.jpg"/>`

In [11]:
(items_for_sale
 >> find('tr', id = 'gift1')
)

<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>

In [12]:
(items_for_sale
 >> find('tr', id = 'gift1')
 >> find('img')
)

<img src="../img/gifts/img1.jpg"/>

In [13]:
(items_for_sale
 >> find('tr', id = 'gift1')
 >> find('img')
 >> find_parent
)

<td>
<img src="../img/gifts/img1.jpg"/>
</td>

In [20]:
(items_for_sale
 >> find('tr', id = 'gift1')
 >> find('img')
 >> find_parent
 >> find_previous_sibling
)

<td>
$15.00
</td>

In [24]:
(items_for_sale
 >> find('tr', id = 'gift1')
 >> find('img')
 >> find_parent
 >> find_previous_sibling
 >> get_text
)

'\n$15.00\n'

In [28]:
(items_for_sale
 >> find('tr', id = 'gift1')
 >> find('img')
 >> find_parent
 >> find_previous_sibling
 >> get_text
 >> strip
)

'$15.00'

<font color="red"><h2>Exercise 6</h2></font>

See if you can get all of the prices with one pipe