# Webscraper
#### TX Death Row Last Words
#### Kwadwo Alfajiri Shah
#### 9 July 2024

Webscraping contains two main tasks
* Scraping the main table of [TX Death Row Information](https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html), which includes identifying information for prisoners and links to personal details and last words.
* Programmatically accessing the links in the main table and scraping prisoner details and last words from all.

In [58]:
import requests
import bs4
import pandas as pd

### 1. Scraping the main table

#### Access webpage

In [3]:
## review
url = 'https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'}

In [5]:
## access
status = requests.get(url, headers = headers, verify = False) # verify = True results in an SSL error. not the best practice to proceed without verifying TLS, but...
status



<Response [200]>

In [6]:
status.close()

#### Parse results

In [67]:
## parse
soup = bs4.BeautifulSoup(status.text, features = 'html.parser')
html_table = soup.find_all('tr')

#### Transform

Get column names:

In [77]:
columns_html = html_table.pop(0)
columns_html

In [81]:
## create list of col names
columns = []

for tag in columns_html.find_all('th'):
    column = tag.text
    columns.append(column)

# (custom names - webpage names are both simply 'Link')
columns[1] = "Inmate Information"
columns[2] = "Last Words"


columns

['Execution',
 'Inmate Information',
 'Last Words',
 'Last Name',
 'First Name',
 'TDCJNumber',
 'Age',
 'Date',
 'Race',
 'County']

Create data table:

In [82]:
## create list of table values
info_list = []

for row in html_table:
    row_values = []
    for cell in row.find_all('td'):
        if cell.a is None:
            cell_value = cell.text
        else:
            cell_value = cell.a['href']
        row_values.append(cell_value)
    info_list.append(row_values)

In [83]:
## convert list to df
info = pd.DataFrame(table_list, columns = name_list)
info

Unnamed: 0,Execution,Inmate Information,Last Words,Last Name,First Name,TDCJNumber,Age,Date,Race,County
0,588,dr_info/gonzalesramiro.html,dr_info/gonzalesramirolast.html,Gonzales,Ramiro,999513,41,6/26/2024,Hispanic,Medina
1,587,dr_info/cantuivan.html,dr_info/cantuivanlast.html,Cantu,Ivan,999399,50,2/28/2024,Hispanic,Collin
2,586,dr_info/renteriadavid.html,dr_info/renteriadavidlast.html,Renteria,David,999460,53,11/16/2023,Other,El Paso
3,585,dr_info/brewer.jpg,dr_info/brewerbrentlast.html,Brewer,Brent,999000,53,11/9/2023,White,Randall
4,584,dr_info/murphyjedidiah.html,dr_info/murphyjedidiahlast.html,Murphy,Jedidiah,999392,48,10/10/2023,White,Dallas
...,...,...,...,...,...,...,...,...,...,...
583,5,dr_info/skillerndoyle.jpg,dr_info/skillerndoylelast.html,Skillern,Doyle,518,49,01/16/1985,White,Lubbock
584,4,dr_info/barefootthomas.jpg,dr_info/barefootthomaslast.html,Barefoot,Thomas,621,39,10/30/1984,White,Bell
585,3,dr_info/obryanronald.jpg,dr_info/obryanronaldlast.html,O'Bryan,Ronald,529,39,03/31/1984,White,Harris
586,2,dr_info/autryjames.html,dr_info/no_last_statement.html,Autry,James,670,29,03/14/1984,White,Jefferson


In [84]:
## save
info.to_csv('../data/executed_inmates.csv', index=False)