# Scrape Texas death row inmates, part 2

The table we scraped in the last notebook _probably_ could have been imported directly into Excel without too much trouble. But what if you also wanted to append a few columns of information from each inmate's detail page, as well?

In this section, we're going to supplement the scraper we just wrote with a _function_ that extracts data from inmates' detail pages. We're also going to use Python's built-in `time.sleep` function to pause for a few seconds between each row to give the government's servers a break.

First, let's import the libraries we'll need.

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import time

## Let's write a function

We need a function that will take a URL of a detail page and do these things:

- Open the detail page URL using `requests`
- Parse the contents using `BeautifulSoup`
- Isolate the bits of information we're interested in: height, weight, eye color, hair color, native county, native state, link to mugshot
- Return those bits of information to the script that called the function -- let's use a dictionary

We shall call our function `inmateDetails()`.

In [None]:
def inmateDetails(url):

    # create a dictionary with some default values
    # as we go through, we're going to add stuff to it, then *return* it
    # (if you want to explore further, there is actually
    # a special kind of dictionary called a "defaultdict" to
    # handle this use case) =>
    # https://docs.python.org/3/library/collections.html#collections.defaultdict
    out_dict = {
        'Height': None,
        'Weight': None,
        'Eye Color': None,
        'Hair Color': None,
        'Native County': None,
        'Native State': None,
        'mug': None
    }
    
    # partway down the page, the links go to JPGs instead of HTML pages
    # we can't parse images, so we'll just return the empty dictionary
    if not url.endswith('.html'):
        return out_dict
    
    # get the page
    r = requests.get(url)
    
    # soup the HTML
    soup = BeautifulSoup(r.text, 'html.parser')

    # find the table of info
    table = soup.find('table', {'class': 'tabledata_deathrow_table'})
    
    # target the mugshot, if it exists
    mug = table.find('img', {'class': 'photo_border_black_right'})
    
    # if there is a mug, grab the src and add it to the dictionary
    if mug:
        out_dict['mug'] = 'http://www.tdcj.state.tx.us/death_row/dr_info/' + mug['src']

    # get a list of the "label" cells
    # on some pages, they're identified by the class 'tabledata_bold_align_right_deathrow'
    # on others, they're identified by the class 'tabledata_bold_align_right_unit'
    # so we pass it a list of possible classes
    label_cells = table.find_all('td', {'class': ['tabledata_bold_align_right_deathrow',
                                                  'tabledata_bold_align_right_unit']})
    
    # a list of the things we're interested in -- should match exactly the text of the cells
    attr_list = ['Height', 'Weight', 'Eye Color', 'Hair Color', 'Native County', 'Native State']

    # loop over the list of label cells we targeted earlier
    for cell in label_cells:
        
        clean_label_cell_text = cell.text.strip()
        
        # check to see if the cell text is in our list of attributes
        if clean_label_cell_text in attr_list:
            
            # if so, find the value -- go up to the tr and search for the other td --
            # and add that attribute to our dictionary
            value_cell_text = cell.parent.find('td', {'class': 'tabledata_align_left_deathrow'}).text.strip()
            out_dict[clean_label_cell_text] = value_cell_text

    # return the dictionary to the script
    return(out_dict)

OK, now we have our function. Let's drop it in the scraper we wrote for the last session.

First, let's get back to the part where where we have the rows of the table stored as a variable:

In [None]:
url = 'http://www.tdcj.state.tx.us/death_row/dr_offenders_on_dr.html'

r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

dr_table = soup.find('table', {'class': 'os'})

dr_rows = dr_table.find_all('tr')[1:]

Now we're going to loop over the rows of the table again as we write to a file -- let's call it 'tx-death-row-with-details.csv' -- but this time, we're _also_ going to call the function we just wrote, `inmateDetails`, to grab some details from the detail page.

The details will be returned as a dictionary, and we'll add these values to the list that we write out to file instead of just dropping in the link to the detail page.

_Furthermore_, because we're adding an HTTP request to every loop iteration, we're going to use `time.sleep` to pause for a few seconds at the end of each loop.

In [None]:
with open('tx-death-row-with-details.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    headers = ['id', 'last', 'first', 'dob', 'sex', 'race',
               'admission_date', 'county', 'offense_date',
               'height', 'weight', 'eye_color', 'hair_color',
               'native_county', 'native_state']
    
    writer.writerow(headers)
    
    for row in dr_rows:
        cols = row.find_all('td')

        id_number = cols[0].text
        last_name = cols[2].text        
        first_name = cols[3].text

        print('Scraping data for', first_name, last_name)
        
        dob = cols[4].text
        sex = cols[5].text
        race = cols[6].text
        date_received = cols[7].text
        county = cols[8].text
        date_offense = cols[9].text

        detail_link = 'http://www.tdcj.state.tx.us/death_row/' + cols[1].a['href']
        details = inmateDetails(detail_link)
        
        height = details['Height']
        weight = details['Weight']
        eye_color = details['Eye Color']
        hair_color = details['Hair Color']
        native_county = details['Native County']
        native_state = details['Native State']
        
        writer.writerow([id_number, detail_link, last_name, first_name, dob, sex,
                         race, date_received, county, date_offense, height, weight,
                         eye_color, hair_color, native_county, native_state])

        time.sleep(2)

    print('')
    print('Done!')