![title](Header__0000_10.png)
___
# Chapter 10 -  Web Scraping with Beautiful Soup
## Data parsing


In [None]:
# Parsing Data 
# - An HTML or XML document is just passed to the beautifulsoup() constructor
# - The constructor converts the documents to unicode and then parses it with a built-in HTML parser (by default) 
#
#
#
#

In [1]:
import pandas as pd

from bs4 import BeautifulSoup

#importing regular expression library
import re

In [2]:
# creating an r object that will be an html document. 
r = '''
<html><head><title>Best Books</title></head>
<body>
<p class='title'><b>DATA SCIENCE FOR DUMMIES</b></p>

<p class='description'>Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe
<br><br>
Edition 1 of this book:
        <br>
 <ul>
  <li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>
  <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
  <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>
  <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>   
  </ul>
<br><br>
What to do next:
<br>
<a href='http://www.data-mania.com/blog/books-by-lillian-pierson/' class = 'preview' id='link 1'>See a preview of the book</a>,
<a href='http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/' class = 'preview' id='link 2'>get the free pdf download,</a> and then
<a href='http://bit.ly/Data-Science-For-Dummies' class = 'preview' id='link 3'>buy the book!</a> 
</p>

<p class='description'>...</p>
'''

In [3]:
#converting r to a beautiful soup object. 
soup = BeautifulSoup(r, 'lxml')
type(soup)

bs4.BeautifulSoup

### Parsing your data

In [4]:
#printing out the first 100 characters with prettify. 
print soup.prettify()[0:100]

<html>
 <head>
  <title>
   Best Books
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    DA


### Getting data from a parse tree

In [5]:
#returning all of the text with html tags stripped out with get_text. 
text_only = soup.get_text()
print(text_only)

Best Books

DATA SCIENCE FOR DUMMIES
Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe

Edition 1 of this book:
        

Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis
Details different data visualization techniques that can be used to showcase and summarize your data
Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques
Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark


What to do next:

See a preview of the book,
get the free pdf download, and then
buy the book!
...



### Searching and retrieving data from a parse tree

#### Retrieving tags by filtering with name arguments

In [6]:
#searching for the tags that contain 'li'
soup.find_all("li")

[<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>,
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>,
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>,
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>]

#### Retrieving tags by filtering with keyword arguments

In [8]:
#retreiving tags with keyword arguments or tag attributes. the id for this tag is 'link 3'
soup.find_all(id="link 3")

[<a class="preview" href="http://bit.ly/Data-Science-For-Dummies" id="link 3">buy the book!</a>]

##### Retrieving tags by filtering with string arguments

In [11]:
#filtering based on an exact string. returns all tags that contain string values of 'ul'
soup.find_all('ul')

[<ul>\n<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>\n<li>Details different data visualization techniques that can be used to showcase and summarize your data</li>\n<li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>\n<li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>\n</ul>]

#### Retrieving tags by filtering with list objects


In [12]:
#searching for tags based on lists. returns all tags that contain 'ul' or 'b'    I'll have to google this to understand
# it better. 
soup.find_all(['ul', 'b'])

[<b>DATA SCIENCE FOR DUMMIES</b>,
 <ul>\n<li>Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis</li>\n<li>Details different data visualization techniques that can be used to showcase and summarize your data</li>\n<li>Explains both supervised and unsupervised machine learning, including regression, model validation, and clustering techniques</li>\n<li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>\n</ul>]

#### Retrieving tags by filtering with regular expressions


In [13]:
# l is the regular expression object. which we pass in to the find_all method using a for loop. 
#prints out a list of all the tags that contain l in their attribute. 
l = re.compile('l')
for tag in soup.find_all(l): print(tag.name)

html
title
ul
li
li
li
li


#### Retrieving tags by filtering with a Boolean value

In [14]:
# returns all of the tags within the parse tree. find_all accepts boolean values. 
for tag in soup.find_all(True): print(tag.name)

html
head
title
body
p
b
p
br
br
br
ul
li
li
li
li
br
br
br
a
a
a
p


#### Retrieving weblinks by filtering with string objects


In [16]:
#pass in a string object as a filter. We isolate the weblinks by passing in the tag 'a'
#for each 'a' tag that it finds it gets the 'href' tag and prints it out. 
for link in soup.find_all('a'): print(link.get('href'))

http://www.data-mania.com/blog/books-by-lillian-pierson/
http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/
http://bit.ly/Data-Science-For-Dummies


#### Retrieving strings by filtering with regular expressions


In [17]:
#returning strings from all of the tags that contain data. 
soup.find_all(string=re.compile("data"))

[u'Jobs in data science abound, but few people have the data science skills needed to fill these increasingly important roles in organizations. Data Science For Dummies is the pe\n',
 u'Provides a background in data science fundamentals before moving on to working with relational databases and unstructured data and preparing your data for analysis',
 u'Details different data visualization techniques that can be used to showcase and summarize your data',
 u'Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark']