# Parsing XML with BeautifulSoup

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a general way to read hierarchical, tree-like structures, e.g. XML and HTML.

Let's read some XML! Here's the data:

In [1]:
xml_text = """<?xml version="1.0" ?>
<collection>
<mineral>
<name>Quartz</name>
<hardness>7</hardness>
<colour>colourless</colour>
<info>Common in continental crust</info>
</mineral>
<mineral>
<name>Olivine</name>
<hardness>6.5</hardness>
<colour>green</colour>
<info>Very common in the mantle</info>
</mineral>
</collection>
"""

In [2]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(xml_text, 'xml')

print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<collection>
 <mineral>
  <name>
   Quartz
  </name>
  <hardness>
   7
  </hardness>
  <colour>
   colourless
  </colour>
  <info>
   Common in continental crust
  </info>
 </mineral>
 <mineral>
  <name>
   Olivine
  </name>
  <hardness>
   6.5
  </hardness>
  <colour>
   green
  </colour>
  <info>
   Very common in the mantle
  </info>
 </mineral>
</collection>


## Find all of a particular field

In [3]:
minerals = soup.find_all('name')
minerals

[<name>Quartz</name>, <name>Olivine</name>]

In [4]:
for mineral in minerals:
    print(mineral.get_text())

Quartz
Olivine


## Put data in `pandas`

In [5]:
minerals = soup.find_all('mineral')

In [6]:
import pandas as pd

data = []
for mineral in minerals:
    d = [x for x in filter(None, mineral.get_text().split('\n'))]
    data.append(d)
    
# I'm sure there's a better way to get these names...
df = pd.DataFrame(data, columns=['Name', 'Hardness', 'Color', 'Info'])

In [7]:
df

Unnamed: 0,Name,Hardness,Color,Info
0,Quartz,7.0,colourless,Common in continental crust
1,Olivine,6.5,green,Very common in the mantle


## Select

Sometimes you'd like to be more specific about what you're selecting. Then you can use `select()`, which on a simple level does what `find_all()` can do...

In [8]:
soup.select('info')

[<info>Common in continental crust</info>,
 <info>Very common in the mantle</info>]

...but you can also pass full branches: `soup.select("div[id=foo] > div > div[class=fee] > span > span > a"`

In [9]:
soup.select('collection > mineral > info')

[<info>Common in continental crust</info>,
 <info>Very common in the mantle</info>]

In [10]:
[i.text for i in soup.select('collection > mineral > info')]

['Common in continental crust', 'Very common in the mantle']