In [1]:
import requests
import bs4

In [3]:
response = requests.get('http://duspviz.mit.edu/_assets/data/sample.html')
print(response.text)

<html><head><title>Where are the rats?</title></head>
<body>
<p class="title"><b>Rat Incidents in Greater Boston</b></p>

<p class="story">The following is rodent incident data for 
<a href="http://example.com/boston" class="link" id="link1">Boston</a>,
<a href="http://example.com/brookline" class="link" id="link2">Brookline</a>,
<a href="http://example.com/cambridge" class="link" id="link2">Cambridge</a>, and
<a href="http://example.com/somerville" class="link" id="link3">Somerville</a>;
and it only available here.</p>

<table>
	<thead>
		<tr>
			<th>City</th>
			<th># of rats</th>
		</tr>
	</thead>
	<tbody>
		<tr>
			<td class="city">Cambridge</td>
			<td class="number">400</td>
		</tr>
		<tr>
			<td class="city">Boston</td>
			<td class="number">900</td>
		</tr>
		<tr>
			<td class="city">Somerville</td>
			<td class="number">300</td>
		</tr>
		<tr>
			<td class="city">Brookline</td>
			<td class="number">600</td>
		</tr>
	</tbody>
</table>

</body>



# What is Beautiful Soup?

Beautiful Soup is a Python library for parsing data out of HTML and XML files (aka webpages). It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. The major concept with Beautiful Soup is that it allows you to access elements of your page by following the CSS structures, such as grabbing all links, all headers, specific classes, or more. It is a powerful library. Once we grab elements, Python makes it makes it easy to write the elements or relevant components of the elements into other files, such as a CSV, that can be stored in a database or opened in other software.

The sample webpage we are using contains data on 'rodent incidents' in the greater Boston area. Let's use this file to explore the tree, and extract some data.

### iv. Make the Soup

First, we have to turn the website code into a Python object. We have already imported the Beautiful Soup library, so we can start calling some of the methods in the libary. Replace print response.text with the following. This turns the text into an Python object named soup.

An important note: You need to specify the specific parser that Beautiful Soup uses to parse your text. This is done in the second argument of the BeautifulSoup function. The default is the built in Python parser, which we can call using html.parser

You an also use lxml or html5lib. This is nicely described in the documentation. For our purposes, using the default is fine.

Using the Beautiful Soup prettify() function, we can print the page to see the code printed in a readable and legible manner.

In [5]:
soup = bs4.BeautifulSoup(response.text, "html.parser")
print(soup.prettify())

<html>
 <head>
  <title>
   Where are the rats?
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    Rat Incidents in Greater Boston
   </b>
  </p>
  <p class="story">
   The following is rodent incident data for
   <a class="link" href="http://example.com/boston" id="link1">
    Boston
   </a>
   ,
   <a class="link" href="http://example.com/brookline" id="link2">
    Brookline
   </a>
   ,
   <a class="link" href="http://example.com/cambridge" id="link2">
    Cambridge
   </a>
   , and
   <a class="link" href="http://example.com/somerville" id="link3">
    Somerville
   </a>
   ;
and it only available here.
  </p>
  <table>
   <thead>
    <tr>
     <th>
      City
     </th>
     <th>
      # of rats
     </th>
    </tr>
   </thead>
   <tbody>
    <tr>
     <td class="city">
      Cambridge
     </td>
     <td class="number">
      400
     </td>
    </tr>
    <tr>
     <td class="city">
      Boston
     </td>
     <td class="number">
      900
     </td>
    </tr>
    <tr>
  

At any point, if you need a reference, visit the Beautiful Soup documentation for the official descriptions of functions. Prettify is a handy one to see our document in a clean fashion.

### Navigating the Data Structure

With our data from the webpage nicely laid out, Beautiful Soup allows us to now navigate the data structure. We called our Beautiful Soup object soup, so we can run the Beautiful Soup functions on this object. Let's explore some ways to do this, try entering some of the following into your terminal.

In [6]:
# Access the title element
soup.title

<title>Where are the rats?</title>

In [7]:
# Access the content of the title element
soup.title.string

'Where are the rats?'

In [8]:
# Access data in the first 'p' tag
soup.p

<p class="title"><b>Rat Incidents in Greater Boston</b></p>

In [9]:
# Access data in the first 'a' tag
soup.a

<a class="link" href="http://example.com/boston" id="link1">Boston</a>

In [10]:
# Retrieve all links in the document (note it returns an array)
soup.find_all('a')

[<a class="link" href="http://example.com/boston" id="link1">Boston</a>,
 <a class="link" href="http://example.com/brookline" id="link2">Brookline</a>,
 <a class="link" href="http://example.com/cambridge" id="link2">Cambridge</a>,
 <a class="link" href="http://example.com/somerville" id="link3">Somerville</a>]

In [11]:
# Retrieve elements by class equal to link using the attributes argument
soup.findAll(attrs={'class' : 'link'})

[<a class="link" href="http://example.com/boston" id="link1">Boston</a>,
 <a class="link" href="http://example.com/brookline" id="link2">Brookline</a>,
 <a class="link" href="http://example.com/cambridge" id="link2">Cambridge</a>,
 <a class="link" href="http://example.com/somerville" id="link3">Somerville</a>]

In [12]:
# Retrieve a specific link by ID
soup.find(id="link3")

<a class="link" href="http://example.com/somerville" id="link3">Somerville</a>

In [13]:
# Access Data in the table (note it returns an array)
soup.find_all('td')

[<td class="city">Cambridge</td>,
 <td class="number">400</td>,
 <td class="city">Boston</td>,
 <td class="number">900</td>,
 <td class="city">Somerville</td>,
 <td class="number">300</td>,
 <td class="city">Brookline</td>,
 <td class="number">600</td>]

## Working with Arrays

The easiest way to access elements and then either write them to file or manipulate them is to save them as objects themselves. Note that our data is organzed into cities and numbers. Let's save these to arrays, which are the easiest way to work with the data.

The following gives us an array, we can work with the elements.

In [15]:
data = soup.findAll(attrs={'class':'city'})
print(data[0].string)
print(data[1].string)
print(data[2].string)
print(data[3].string)

Cambridge
Boston
Somerville
Brookline


In [17]:
# turn it into a loop
data = soup.findAll(attrs={'class':'city'})
for i in data:
    print(i.string)

Cambridge
Boston
Somerville
Brookline


This array only gives us cities though, let's get all of the data elements that have either class city or class number.

In [19]:
data = soup.findAll(attrs={'class':['city','number']})
print(data)

[<td class="city">Cambridge</td>, <td class="number">400</td>, <td class="city">Boston</td>, <td class="number">900</td>, <td class="city">Somerville</td>, <td class="number">300</td>, <td class="city">Brookline</td>, <td class="number">600</td>]


We have all of our data that was nested in these tags saved to a Python array. Access the elements of the array by using data[x], where x is location in the array. In Python, arrays start at 0, so place 1 in a Python array is actually called by using a 0, and place 8 would be called by a 7.

In [20]:
print(data[0])
print(data[1])

<td class="city">Cambridge</td>
<td class="number">400</td>


Right now, we get the whole element with those commands. To get just the content, use the following.

In [21]:
print(data[0].string)
print(data[1].string)

Cambridge
400


### Write Data to a File using a Simple Loop

Python makes opening a file and writing to it very easy. Let's take this simple dataset and write it to a file that saves in our current working directory. An important note, whatever the working directory is when you start Python will be the root for where your files are read from and written to.

Python also has nice iteration features that allow us to iterate through arrays, lists, and other files. In this following example, manually create a comma-separated document with our data using file writing operations and a while loop.

In pseudo-code:

Open up a file to write in and append data.

Set up parameters for the while loop

Write headers

Run while loop that will write elements of the array to file

When complete, close the file

Once done, open the file on your machine and see your data. Enter the following code, note what each line is doing.

In [22]:
f = open('rat_data.txt','a') # open new file

p = 0 # initial place in array
l = len(data)-1 # length of array minus one

f.write("City, Number\n") #write headers

while p < l: # while place is less than length
    f.write(data[p].string + ", ") # write city and add comma
    p = p + 1 # increment
    f.write(data[p].string + "\n") # write number and line break
    p = p + 1 # increment

f.close() # close file