# Scraping and Analysing the Texas Death Row Database

Conducted by Dhruv Rachakonda <br>
University of Utah, Salt Lake City, Utah <br>
November 2024

## Introduction

The purpose of this exploratory data analysis (EDA) is to scrape the web table found on the Texas Department of Criminal Justice website, which publishes a table of up-to-date executed inmates with information regarding their execution. 

The following questions will be looked into

<b>Which counties have the most executions?</b> <br> 
<b>What are the most common crimes of those committed on death row?</b> <br>
<b>What are the most common words/phrases in the last words</b> <br>
<b>What are some common patterns in those who are executed</b> <br>
<b>How does the amount of executions vary by year?</b> <br>
<b>What are the primary reasons one is released from Death Row?</b> <br>
<b>What are the primary demographics of those released from Death Row?</b> 

Let us start off by importing the neccesary libraries

In [2]:
import requests
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd
import numpy as np
import certifi
from random import randrange


## Reading in Data

We will be scraping the Texas Department of Criminal Justice Website: <br>

https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html (already executed) <br>
https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html (on death-row, not executed yet) <br>
https://www.tdcj.texas.gov/death_row/dr_list_all_dr_1923-1973.html (death-row 1923-1973) <br>
https://www.tdcj.texas.gov/death_row/dr_offenders_no_longer_on_dr.html (commuted/removed from deathrow) <br>
https://www.tdcj.texas.gov/death_row/dr_citizenship.html (non-citizens)

We can start off by putting all of the raw HTML data into a variable and begin processing it. We need to specify certain parameters or else the protection software of the website will reject our requests. The requests library will gather all of the HTML data and put it into a variable for us.

Beautiful soup in this case just helps us with formatting the data such as finding data within tags.

### Phase 1 - Main Page

In [3]:
headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}

baseurl = "https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html"

req = requests.get(baseurl, headers = headers, verify=False)

texas_raw_data = BeautifulSoup(req.text)




Each row here appears to be seperated by a \<tr> tag, so we can place all tr values in an array. We can use the find all function to gather all the tags with tr and place it into an array list.  

In [4]:
texas_groups = texas_raw_data.find_all('tr')
print(len(texas_groups))
texas_groups[1] #First row for reference

592


<tr>
<td style="text-align: center">591</td>
<td style="text-align: center"><a href="dr_info/whitegarcia.jpg" title="Inmate Information for Garcia White">Inmate Information</a></td>
<td style="text-align: center"><a href="dr_info/whitegarcialast.html" title="Last Statement of Garcia White">Last Statement</a></td>
<td style="text-align: center">White</td>
<td style="text-align: center">Garcia</td>
<td style="text-align: center">999205</td>
<td style="text-align: center">61</td>
<td style="text-align: center">10/1/2024</td>
<td style="text-align: center">Black</td>
<td style="text-align: center"> Harris</td>
</tr>

We can also notice that in each row, every column value is seperated by a td tag. Let us print out the contants of the td tag of the first row. We're using index 1 here, since index 0 is just the title. The strip function just removes white space around the edges of the string.

In [5]:
for group in texas_groups[1].find_all("td"):
    print(group.text.strip())
    

591
Inmate Information
Last Statement
White
Garcia
999205
61
10/1/2024
Black
Harris


Let us now try writing a parser for this entire page page. I'm initializing a simple array to represent our final data frame.

Through each iteration on each row, I store the contents of a row page in a JSON style payload, and then append that payload onto the final Texas array.

In [6]:
#Array to store results
texas_data_array = []


for x in range(1, len(texas_groups)): #For each row
        
    line_number = -1
    
    execution_number = 0
    last_name = 0
    first_name = 0
    tdcj_number = 0
    age = 0
    date = 0
    race = 0
    county = 0
 
    #for each column in the row
    for group in texas_groups[x].find_all("td"):
        
        
        #the line number determine which attribute of the inmmate we are talk about
        
        line_number = line_number + 1
        
        if(line_number == 0):
            execution_number = group.text.strip()
            
        elif(line_number == 1):
            continue

        elif(line_number == 2):
            continue

        elif(line_number == 3):
            last_name = group.text.strip()

        
        elif(line_number == 4):
            first_name = group.text.strip()

        
        elif(line_number == 5):
            tdcj_number = group.text.strip()

        
        elif(line_number == 6):
            age = group.text.strip()

        
        elif(line_number == 7):
            date = group.text.strip()

        
        elif(line_number == 8):
             race = group.text.strip()

        elif(line_number == 9):
             county = group.text.strip()
        
    
    #package it all
    row_payload = {'Execution Number': execution_number,
                   'Last Name': last_name,
                   'First Name': first_name,
                   'TDCJ Number': tdcj_number,
                   'Age Executed': age,
                   'Date Executed': date,
                   'Race': race,
                   'County of Offense': county
                  }
    
    texas_data_array.append(row_payload) #add it to the array
    

texas_df = pd.DataFrame(texas_data_array) #make it a dataframe
texas_df.head(10)

Unnamed: 0,Execution Number,Last Name,First Name,TDCJ Number,Age Executed,Date Executed,Race,County of Offense
0,591,White,Garcia,999205,61,10/1/2024,Black,Harris
1,590,Mullis,Travis,999563,38,9/24/2024,White,Galveston
2,589,Burton,Arthur,999283,44,8/7/2024,Black,Harris
3,588,Gonzales,Ramiro,999513,41,6/26/2024,Hispanic,Medina
4,587,Cantu,Ivan,999399,50,2/28/2024,Hispanic,Collin
5,586,Renteria,David,999460,53,11/16/2023,Other,El Paso
6,585,Brewer,Brent,999000,53,11/9/2023,White,Randall
7,584,Murphy,Jedidiah,999392,48,10/10/2023,White,Dallas
8,583,"Brown, Jr.",Arthur,999110,52,3/9/2023,Black,Harris
9,582,Green,Gary,999561,51,3/7/2023,Black,Dallas


### Phase 1.2: Last Statement

In this phase, we want to add the last words of each executed innmate onto their respective row

The State of Texas formats each last statement notice in the following URL format:
https://www.tdcj.texas.gov/death_row/dr_info/lastnamefirstnamelast.html

For example:
https://www.tdcj.texas.gov/death_row/dr_info/brownarthurlast.html

We will go through each row and scrape the respective page.

In [7]:
for index, row in texas_df.iterrows():
    last_name = str(row['Last Name'])
    first_name = str(row["First Name"])
    
    
    #Texas adds a comma if there is a jr. at the end of a last name
    #However, the jr and sr is not included in the URL
    #Some last names also consist of two words, we strip the text there
    head, sep, tail = last_name.partition(',')
    last_name = head
    last_name = last_name.replace(" ", "")
    last_name = last_name.replace("\'", "")
    last_name = last_name.replace("-", "")

    
    
    # Compile it into the URL
    last_words_url = "https://www.tdcj.texas.gov/death_row/dr_info/" + last_name + first_name + 'last.html'
    #If the URL doesn't exist, then the innmate declined to give a last statement
    
    headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
    }

    baseurl = last_words_url

    req = requests.get(baseurl, headers = headers, verify=False)
    
    words_raw = BeautifulSoup(req.text)
        
    words_groups = words_raw.find_all('p') 
    
    
    #Grab all tags with last statement in it
    try:
        last_statement = None
        for i, element in enumerate(words_groups):
            if element.get_text(strip=True) == "Last Statement:":
                if i + 1 < len(words_groups):
                    last_statement = words_groups[i + 1].get_text(strip=True)
                break
    except:
        last_statement=None
  
    
    sleep(randrange(3))
        
    texas_df.at[index, "Last Statement"] = last_statement 















































In [8]:
texas_df.head(10)

Unnamed: 0,Execution Number,Last Name,First Name,TDCJ Number,Age Executed,Date Executed,Race,County of Offense,Last Statement
0,591,White,Garcia,999205,61,10/1/2024,Black,Harris,"Yes ma'am, first I would like to apologize for..."
1,590,Mullis,Travis,999563,38,9/24/2024,White,Galveston,"Yes Warden, I would like to thank everyone, al..."
2,589,Burton,Arthur,999283,44,8/7/2024,Black,Harris,Yes. I want to say thank you to all the people...
3,588,Gonzales,Ramiro,999513,41,6/26/2024,Hispanic,Medina,"Yes ma'am, to the Townsend Family, I'm sorry ..."
4,587,Cantu,Ivan,999399,50,2/28/2024,Hispanic,Collin,I'd like to address the Kitchens and Mosqueda ...
5,586,Renteria,David,999460,53,11/16/2023,Other,El Paso,Yes. I would Warden I call upon peace. To the ...
6,585,Brewer,Brent,999000,53,11/9/2023,White,Randall,"Yes Warden, I would like to tell the family of..."
7,584,Murphy,Jedidiah,999392,48,10/10/2023,White,Dallas,"Yes Warden, To the family of the victim I want..."
8,583,"Brown, Jr.",Arthur,999110,52,3/9/2023,Black,Harris,What is occurring here tonight is not justice...
9,582,Green,Gary,999561,51,3/7/2023,Black,Dallas,"Vetta, Jared, Ray I’m sorry, no I’m not sorry..."


### Phase 2.3: Adding additional information such as crime committed, height, etc. 
This information is found on a seperate page and is not always availble for each innmate, since some were just documented on paper and uploaded

We use the same procedure as last statement.

The State of Texas formats each last statement notice in the following URL format:
https://www.tdcj.texas.gov/death_row/dr_info/lastnamefirstname.html

On some inmates, their records were on a jpg image of a paper. We cannot scrape that as of now. 


In [20]:
for index, row in x.iterrows():
    last_name = str(row['Last Name'])
    first_name = str(row["First Name"])
    
    
    #Texas adds a comma if there is a jr. at the end of a last name
    #However, the jr and sr is not included in the URL
    #Some last names also consist of two words, we strip the text there
    head, sep, tail = last_name.partition(',')
    last_name = head
    last_name = last_name.replace(" ", "")
    last_name = last_name.replace("\'", "")
    last_name = last_name.replace("-", "")
    
    info_url = "https://www.tdcj.texas.gov/death_row/dr_info/" + last_name + first_name + '.html'
    
    headers = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
     }

    baseurl = info_url
    
    req = requests.get(baseurl, headers = headers, verify=False)
    
    inmate_raw_data = BeautifulSoup(req.text)
    
    words_groups_p = inmate_raw_data.find_all('p') 
    
    words_groups_tr = inmate_raw_data.find_all('td') 
    
    try:
        dob = None
        for i, element in enumerate(words_groups_tr):
            if element.get_text(strip=True) == "Date of Birth":
                if i + 1 < len(words_groups_tr):
                    dob = words_groups_tr[i + 1].get_text(strip=True)
                break
    except:
        dob=None
        
    texas_df.at[index, "Date of Birth"] = dob    















































169    12/22/1965
86     12/27/1975
180    12/28/1978
150      12/28/71
525       12/5/66
          ...    
584          None
585          None
586          None
587          None
588          None
Name: Date of Birth, Length: 450, dtype: object