<p><a name="sections"></a></p>
<br>
<br>
# Sections

- <a href="#intro">Introduction to Beautiful Soup</a><br>
    - <a href="#web">What is web scraping</a><br>
    - <a href="#html">Introduction to HTML</a><br>
        
    - <a href="#beautiful">Basics of Beautiful Soup</a><br>

- <a href="#example">Examples</a><br>
    - <a href="#calendar">Python User Group Calendar</a><br>
    - <a href="#cran">Explore CRAN</a><br>

<p><a name="intro"></a></p>
## Introduction to Beautiful Soup
[[back to top]](#sections)

In [1]:
from IPython.display import HTML
HTML('<iframe src=http://www.crummy.com/software/BeautifulSoup/bs4/doc/ width=800 height=600></iframe>')

<p><a name="web"></a></p>
## What is web scraping?
[[back to top]](#sections)

HTML is short for **HyperText Markup Language**. It's a language for presenting content on the Web.

Plain text is turned into an HTML document by **tags** that are then interpreted by a browser.

Using BeautifulSoup, you can easily extract the tag values from HTML source code.

### Beautiful Soup VS Regular Expressions

In [2]:
# the source code of hi.html
!cat data/hi.html

<!DOCTYPE html>
<html>
    <head>
        <title>Hi</title> <!--Im a comment, ignore me.-->
    </head>
    <body>
        <a href='http://www.crummy.com/software/BeautifulSoup/'>Hello, beautifulsoup!</a>
    </body>
</html>


### Example:
Extract the characters between the title tags. 


In this case it's `Hi` (`<title>Hi</title>`).

- **Solution using Regular Expressions**

In [3]:
import re
hi_path = 'data/hi.html'
hi = open(hi_path).read()
re.findall('<title>(.*)</title>', hi)

['Hi']

- **Solution using BeautifulSoup**

In [4]:
from bs4 import BeautifulSoup
hi = BeautifulSoup(open(hi_path))
print hi.title # find the title tags
print hi.title.string  # find the value of tags

<title>Hi</title>
Hi




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


**Compared with regular expressions:**
    
- Beautiful Soup's syntax is much simpler, while regular expressions are more flexible.

<p><a name="html"></a></p>
## Introduction to HTML
[[back to top]](#sections)

### Tag Syntax

- The `<title>` tags in this example designate the enclosed text as the title to be displayed in the head of the browser tab.
![hi](pic/hi.png)

- Tags are always enclosed by `<` and `>` to distinguish them from the content. 
- A pair of tags consist of start and end tags which carry the same name, but the end tag is preceded by a slash `/` .

### Values

Values are the content between start tags and end tags.

- **Example**

`<title>Hi</title>`: It's a title tag with a value of `Hi`.

### Attributes
Tags have another feauture called attributes.

- **Example**

`<a href='http://www.crummy.com/software/BeautifulSoup/'>Hello, beautifulsoup!</a>`

The anchor tag `<a>` with an attribute `href` and hyperlink—http://www.crummy.com/software/BeautifulSoup/. It creates an association of text points to another address (a hyperlink).

### Tree structure
- The first element in the example is the `<html>` element. 


- Between the `<html>` tags of this element, several tags are opened and closed again: `<head>, <title>` , and
`<body>, <a>`.

    - The `<head>` and `<body>` tags are directly enclosed by the `<html>` element. 
    - The `<title>` element is enclosed by the `<head>` tag.
    - The `<a>` element is enclosed by the `<body>` tag.


A good way to describe the multiple layers of an HTML document is the tree analogy. 
![html](pic/html.png)

The <html> element is the root element that splits into two branches, `<head>` and `<body>` . `<head>` is followed by another branch called `<title>`; `<body>` is followed by another branch called `<a>`.

<p><a name="beautiful"></a></p>
## Basics of Beautiful Soup
[[back to top]](#sections)

### Install 

- Anaconda is shipped with Beautiful Soup.

- Command line from Linux :
    - `sudo easy_install beautifulsoup4`

- Command line from Mac :
    - `sudo pip beautifulsoup4`
    
- Command line from Windows:
    - [Download](http://www.crummy.com/software/BeautifulSoup/bs4/download/4.0/) and run `python setup.py install`

### Parse HTML

In [5]:
from bs4 import BeautifulSoup

In [6]:
# open a local file 
# parse the plain text by BeautifulSoup directly
hi = BeautifulSoup(open(hi_path), 'html.parser')
print type(hi), '\n' # get a bs4.BeautifulSoup instance
print hi

<class 'bs4.BeautifulSoup'> 

<!DOCTYPE html>

<html>
<head>
<title>Hi</title> <!--Im a comment, ignore me.-->
</head>
<body>
<a href="http://www.crummy.com/software/BeautifulSoup/">Hello, beautifulsoup!</a>
</body>
</html>



The `prettify` method adds indentations so that it looks pretty.

In [7]:
print hi.prettify()

<!DOCTYPE html>
<html>
 <head>
  <title>
   Hi
  </title>
  <!--Im a comment, ignore me.-->
 </head>
 <body>
  <a href="http://www.crummy.com/software/BeautifulSoup/">
   Hello, beautifulsoup!
  </a>
 </body>
</html>



### Names, Values, and Attributes

Beautiful Soup can extract the `name`, `value` and `attributes` of tags. The corresponding methods are:
- name
- string
- attrs

In [8]:
print "The name of a tags is: ", hi.a.name
print "The value of a tags is: ", hi.a.string
print "The attribute of a tags is: ", hi.a.attrs

The name of a tags is:  a
The value of a tags is:  Hello, beautifulsoup!
The attribute of a tags is:  {u'href': u'http://www.crummy.com/software/BeautifulSoup/'}


### get_text() & get()
For tags that have child tags the string does not work

In [9]:
print hi.html.string

None


Use the get_text method instead. The `get_text()` method will extract all the contents of child tags.

In [10]:
print hi.html.get_text()



Hi 


Hello, beautifulsoup!




`get()` is used to find the attribute of a tag. For example, we can get the href of tag a using the following code. 

It is the same as run `hi.a.attrs` first and then find the value of key `href` from the dictionary.

In [11]:
print hi.a.get('href')

http://www.crummy.com/software/BeautifulSoup/


### find() & find_all()
The functions `find` and `findall` are flexible for finding tags.

In [12]:
!cat data/article.html

<!DOCTYPE html>
<html>
    <head>
        <title>Article</title>
    </head>
    <body>
        <h1 id='one'>One</h1>
        	<p>This is the first paragraph.</p>
        <h2 id='two'>Two</h2>
        	<p><a href='www.google.com'>Here is the Google website.</a></p>
        <h3 id='three'>Three</h3>
        	<p>This is the third paragraph.</p>
    </body>
</html>


![article](pic/article.png)

In [13]:
article_path = 'data/article.html'
article = BeautifulSoup(open(article_path))

Return only the first p tags

In [14]:
print article.p

<p>This is the first paragraph.</p>


find() returns the first p tags, which is equivalent to article.p

In [15]:
print article.find('p')

<p>This is the first paragraph.</p>


find_all() returns all p tags

In [16]:
print article.find_all('p')

[<p>This is the first paragraph.</p>, <p><a href="www.google.com">Here is the Google website.</a></p>, <p>This is the third paragraph.</p>]


You can also specify a function to extract a list of Tag objects that match the given criteria.

For example:

Write a function to return whether or not a tag has the `id` attribute

In [17]:
print article.find_all(lambda tag: tag.has_attr('id'))

[<h1 id="one">One</h1>, <h2 id="two">Two</h2>, <h3 id="three">Three</h3>]


In [18]:
# the tags whose attribute id equals 'one'
print article.find_all(id='one')
# equivalent
print article.find_all(lambda tag: tag.get('id') == 'one')

[<h1 id="one">One</h1>]
[<h1 id="one">One</h1>]


<p><a name="example"></a></p>
## Examples
[[back to top]](#sections)

<p><a name="calendar"></a></p>
### Python User Group Calendar
[[back to top]](#sections)

Let's extract the time, location, and event titles from this web page [Python User Group Calendar](https://www.python.org/events/python-user-group/).

<img src=pic/events.png width=800/>

In [19]:
import requests
text = requests.get('https://www.python.org/events/python-user-group/').text
text = BeautifulSoup(text)

In [20]:
print text.prettify()

<!DOCTYPE html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" dir="ltr" lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js" rel="prefetch"/>
  <meta content="Python.org" name="application-name"/>
  <meta content="The official home of the Python Programming Language" name="msapplication-tooltip"/>
  <meta content="Python.org" name="apple-mobile-web-app-title"/>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <meta content="black" name="apple-mobile-web-app-status-bar-style"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="True" name="HandheldFriendly"/>
 

#### Title
Titles are in `h3` tags with an attribute `class="event-title"`.
<img src=pic/title.png width=900/>

In [21]:
titleTags = text.find_all('h3', {'class': "event-title"})
titleTags

[<h3 class="event-title"><a href="/events/python-user-group/436/">Django Girls Guayaquil</a></h3>,
 <h3 class="event-title"><a href="/events/python-user-group/403/">Python Meeting D\xfcsseldorf</a></h3>]

In [22]:
titleString = [tag.get_text() for tag in titleTags]
titleString

[u'Django Girls Guayaquil', u'Python Meeting D\xfcsseldorf']

#### Time
Times are in the `time` tags with the attribute `datetime`.

![time](pic/time.png)

In [23]:
#timeTags = text.find_all('time', {'datetime'})
timeTags = text.find_all(lambda tag: 'datetime' in tag.attrs)
timeTags

[<time datetime="2016-07-09T00:00:00+00:00">09 July \u2013 10 July <span class="say-no-more"> 2016</span></time>,
 <time datetime="2016-07-06T16:00:00+00:00">06 July<span class="say-no-more"> 2016</span></time>]

In [24]:
timeString = [tag.get_text() for tag in timeTags]
timeString

[u'09 July \u2013 10 July  2016', u'06 July 2016']

#### Location
Locations are in `span` tags with the attribute `class="envet-location"`.

<img src=pic/location.png width=900/>

In [25]:
locationTags = text.findAll("span", {"class": "event-location"})
locationTags

[<span class="event-location">Guayaquil, Ecuador</span>,
 <span class="event-location">B\xfcrgerhaus im Stadtteilzentrum Bilk (D\xfcsseldorfer Arcaden), Bachstr. 145, 40217 D\xfcsseldorf, Germany</span>]

In [26]:
locationString = [tag.get_text() for tag in locationTags]
locationString

[u'Guayaquil, Ecuador',
 u'B\xfcrgerhaus im Stadtteilzentrum Bilk (D\xfcsseldorfer Arcaden), Bachstr. 145, 40217 D\xfcsseldorf, Germany']

<p><a name="cran"></a></p>
### Exploring CRAN
[[back to top]](#sections)

We want to explore the information provided in the packages on [CRAN](http://cran.r-project.org/web/packages/available_packages_by_name.html).

In [27]:
from IPython.display import HTML
HTML('<iframe src=http://cran.r-project.org/web/packages/available_packages_by_name.html height=600 width=800></iframe>')

In [28]:
import requests
from bs4 import BeautifulSoup
cran_url = 'http://cran.r-project.org/web/packages/available_packages_by_name.html'
cran = requests.get(cran_url).text
cran = BeautifulSoup(cran)
print cran.prettify()

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   CRAN Packages By Name
  </title>
  <link href="../CRAN_web.css" rel="stylesheet" type="text/css"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <style type="text/css">
   table td { vertical-align: top; }
  </style>
 </head>
 <body lang="en">
  <h1>
   Available CRAN Packages By Name
  </h1>
  <p style="text-align: center">
   <a href="#available-packages-A">
    A
   </a>
   <a href="#available-packages-B">
    B
   </a>
   <a href="#available-packages-C">
    C
   </a>
   <a href="#available-packages-D">
    D
   </a>
   <a href="#available-packages-E">
    E
   </a>
   <a href="#available-packages-F">
    F
   </a>
   <a href="#available-packages-G">
    G
   </a>
   <a href="#available-packages-H">
    H
   </a>
   <a href="#available-packages-I">
    I
   </a>
   <a href="#availabl

In [29]:
## all the packages
packages = cran.table.find_all('a')
packages = [i.string for i in packages]
print len(packages) # the number of packages
print packages[0:5]

8815
[u'A3', u'abbyyR', u'abc', u'ABCanalysis', u'abc.data']


#### Details of Packages on CRAN

In [30]:
HTML('<iframe src=http://cran.r-project.org/web/packages/A3/index.html width=800 height=600></iframe>')

**`"Dependencies"` occur when a package is built on top of another package.**

If A depends on B, we can say that B is cited. So let's **find out which packages have the most citations.**

In [31]:
# the index of each package is: ../../web/packages/package_name/index.html
def package_url(name):
    return 'http://cran.r-project.org/web/packages/' + name + '/index.html'

In [32]:
A3 = BeautifulSoup(requests.get(package_url('A3')).text)

In [33]:
# the second `tr` tags in `table` tags
A3_depends = A3.table.find_all('tr')[1]
print A3_depends

<tr>
<td>Depends:</td>
<td>R (≥ 2.15.0), <a href="../xtable/index.html">xtable</a>, <a href="../pbapply/index.html">pbapply</a></td>
</tr>


In [34]:
# the package names
for i in A3_depends.find_all('a'):
    print i.string

xtable
pbapply


#### Function to get Dependencies

In [35]:
def get_depends(name):
    '''
    given package name, return the depends packages
    '''
    url = package_url(name)
    text = BeautifulSoup(requests.get(url).text)
    text_depends = text.table.find_all('tr')[1]
    return [i.string for i in text_depends.find_all('a')]

In [36]:
## test function 
get_depends('abc')

[u'abc.data', u'nnet', u'quantreg', u'MASS', u'locfit']

#### Summary

Since there are too many packages, let's select the first 50 packages and summarize their dependencies.

In [37]:
# it takes a long time to run
depend_packages = []
for i in range(50):
    if (i+1) % 10 == 0:
        print "Getting %dth package: %s" %(i+1, packages[i])
    depend_packages.append(get_depends(packages[i]))

Getting 10th package: abctools
Getting 20th package: acc
Getting 30th package: ACEt
Getting 40th package: ACSNMineR
Getting 50th package: ada


In [38]:
depend_packages

[[u'xtable', u'pbapply'],
 [],
 [u'abc.data', u'nnet', u'quantreg', u'MASS', u'locfit'],
 [],
 [],
 [u'Rglpk', u'rgl', u'corrplot', u'lattice'],
 [],
 [u'MASS'],
 [],
 [u'abc', u'abind', u'plyr'],
 [u'nlme', u'lattice', u'mosaic'],
 [],
 [u'ggplot2', u'reshape2'],
 [],
 [u'Cairo'],
 [u'cluster'],
 [u'limma'],
 [u'QUIC'],
 [],
 [u'zoo', u'mhsmm', u'PhysicalActivity', u'nleqslv', u'plyr'],
 [],
 [u'mice', u'pscl'],
 [],
 [],
 [u'tcltk2'],
 [],
 [],
 [],
 [],
 [],
 [u'gamlss', u'gamlss.dist', u'Hmisc'],
 [u'MASS'],
 [],
 [u'aroma.affymetrix'],
 [u'R.utils', u'xtable'],
 [u'mvtnorm'],
 [u'tseries', u'quantmod'],
 [u'dummies', u'randomForest', u'kernelFactory', u'ada'],
 [u'stringr', u'plyr', u'XML'],
 [u'ggplot2', u'gridExtra', u'scales'],
 [u'acss.data'],
 [],
 [u'MASS'],
 [u'R.methodsS3'],
 [u'fda'],
 [],
 [],
 [],
 [u'reliaR', u'actuar', u'hypergeo'],
 [u'rpart']]