# Let's scrape a practice table

The latest Mountain Goats album is called [Goths](https://pitchfork.com/reviews/albums/23153-goths/). (It's good!) I made a simple HTML table with the track listing -- let's scrape it into a CSV.

### Import the modules we'll need

In [1]:
from bs4 import BeautifulSoup
import csv

### Read in the file, see what we're working with

We'll use the `read()` method to get the contents of the file.

In [5]:
# in a with block, open the HTML file
with open('mountain-goats.html', 'r') as html_file:
    
    # .read() in the contents of a file -- it'll be a string
    html_code = html_file.read()

    # print the string to see what's there
    print(html_code)

<html>
<table id="empty-table-to-throw-you-off"></table>
<table class="song-table" id="my-cool-table" style="width: 95%;">
  <thead>
    <tr>
      <th>Track Number</th>
      <th>Song Title</th>
      <th>Duration</th>
      <th>Artist</th>
      <th>Album</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>Rain in Soho</td>
      <td>4:47</td>
      <td>The Mountain Goats</td>
      <td>Goths</td>
    </tr>
    <tr>
      <td>2</td>
      <td>Andrew Eldritch is Moving Back to Leeds</td>
      <td>4:19</td>
      <td>The Mountain Goats</td>
      <td>Goths</td>
    </tr>
    <tr>
      <td>3</td>
      <td>The Grey King and the Silver Flame Attunement</td>
      <td>4:55</td>
      <td>The Mountain Goats</td>
      <td>Goths</td>
    </tr>
    <tr>
      <td>4</td>
      <td>We Do it Different on the West Coast</td>
      <td>5:21</td>
      <td>The Mountain Goats</td>
      <td>Goths</td>
    </tr>
    <tr>
      <td>5</td>
      <td>Unicorn Tolerance</td>
      <

### Parse the table with BeautifulSoup

Right now, Python isn't interpreting our table as _data_ -- it's just a string. We need to use BeautifulSoup to parse that string into data objects that Python can understand. Once the string is parsed, we'll be working with a "tree" of data that we can navigate.

In [6]:
with open('mountain-goats.html', 'r') as html_file:
    html_code = html_file.read()
    
    # use the type() function to see what kind of object `html_code` is
    print(type(html_code))
    
    # feed the file's contents (the string of HTML) to BeautifulSoup
    # will complain if you don't specify the parser
    soup = BeautifulSoup(html_code, 'html.parser')

    # use the type() function to see what kind of object `soup` is
    print(type(soup))

<class 'str'>
<class 'bs4.BeautifulSoup'>


### Decide how to target the table

BeautifulSoup has several methods for targeting elements -- by position on the page, by attribute, etc. Right now we just want to find the correct table.

In [8]:
with open('mountain-goats.html', 'r') as html_file:
    html_code = html_file.read()
    soup = BeautifulSoup(html_code, 'html.parser')
    
    # by position on the page
    # find_all returns a list of matching elements, and we want the second ([1]) one
    # song_table = soup.find_all('table')[1]
    
    # by class name
    # => with `find`, you can pass a dictionary of element attributes to match on
    # song_table = soup.find('table', {'class': 'song-table'})
    
    # by ID
    # song_table = soup.find('table', {'id': 'my-cool-table'})
    
    # by style
    song_table = soup.find('table', {'style': 'width: 95%;'})
    
    print(song_table)

<table class="song-table" id="my-cool-table" style="width: 95%;">
<thead>
<tr>
<th>Track Number</th>
<th>Song Title</th>
<th>Duration</th>
<th>Artist</th>
<th>Album</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Rain in Soho</td>
<td>4:47</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>2</td>
<td>Andrew Eldritch is Moving Back to Leeds</td>
<td>4:19</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>3</td>
<td>The Grey King and the Silver Flame Attunement</td>
<td>4:55</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>4</td>
<td>We Do it Different on the West Coast</td>
<td>5:21</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>5</td>
<td>Unicorn Tolerance</td>
<td>5:25</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>6</td>
<td>Stench of the Unburied</td>
<td>4:30</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr>
<td>7</td>
<td>Wear Black</td>
<td>4:11</td>
<td>The Mountain Goats</td>
<td>Goths</td>
</tr>
<tr

### Looping over the table rows

Let's print a list of track numbers and song titles. Look at the structure of the table -- a table has rows represented by the tag `tr`, and within each row there are cells represented by `td` tags. The `find_all()` method returns a list. And we know how to iterate over lists: with a for loop. Let's do that.

In [15]:
with open('mountain-goats.html', 'r') as html_file:
    html_code = html_file.read()
    soup = BeautifulSoup(html_code, 'html.parser')
    song_table = soup.find('table', {'style': 'width: 95%;'})
    
    # find the rows in the table
    # slice to skip the header row
    song_rows = song_table.find_all('tr')[1:]
    
    # loop over the rows
    for row in song_rows:

        # get the table cells in the row
        song = row.find_all('td')
        
        # assign them to variables
        track, title, duration, artist, album = song
        
        # use the .string attribute to get the text in the cell
        print(track.string, title.string)

1 Rain in Soho
2 Andrew Eldritch is Moving Back to Leeds
3 The Grey King and the Silver Flame Attunement
4 We Do it Different on the West Coast
5 Unicorn Tolerance
6 Stench of the Unburied
7 Wear Black
8 Paid in Cocaine
9 Rage of Travers
10 Shelved
11 For the Portuguese Goths Metal Bands
12 Abandoned Flesh


### Write data to file

Let's put it all together and open a file to write the data to.

In [16]:
with open('mountain-goats.html', 'r') as html_file, open('mountain-goats.csv', 'w') as outfile:
    html_code = html_file.read()
    soup = BeautifulSoup(html_code, 'html.parser')
    song_table = soup.find('table', {'style': 'width: 95%;'})
    
    song_rows = song_table.find_all('tr')[1:]
    
    # set up a writer object
    writer = csv.DictWriter(outfile, fieldnames=['track', 'title', 'duration', 'artist', 'album'])
    
    writer.writeheader()
    
    for row in song_rows:

        # get the table cells in the row
        song = row.find_all('td')
        
        # assign them to variables
        track, title, duration, artist, album = song
        
        # write out the dictionary to file
        writer.writerow({
            'track': track.string,
            'title': title.string,
            'duration': duration.string,
            'artist': artist.string,
            'album': album.string
        })