# Webscraper
#### TX Death Row Last Words
#### Kwadwo Alfajiri Shah
#### 9 July 2024

Webscraping contains two main tasks
* Scraping the main table of [TX Death Row Information](https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html), which includes identifying information for prisoners and links to personal details and last words.
* Programmatically accessing the links in the main table and scraping prisoner details and last words from all.

In [58]:
import requests
import bs4
import pandas as pd

### 1. Scraping the main table

#### Access webpage

In [205]:
## review
url = 'https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'}

In [206]:
## access
status = requests.get(url, headers = headers, verify = False) # verify = True results in an SSL error. not the best practice to proceed without verifying TLS, but...
status



<Response [200]>

In [207]:
status.close()

#### Parse results

In [208]:
## parse
soup = bs4.BeautifulSoup(status.text, features = 'html.parser')
html_table = soup.find_all('tr')

#### Transform

Get column names:

In [209]:
columns_html = html_table.pop(0)
columns_html

<tr>
<th scope="col" style="text-align: center">Execution</th>
<th scope="col" style="text-align: center; width: 16%">Link</th>
<th scope="col" style="text-align: center; width: 13%">Link</th>
<th scope="col" style="text-align: center">Last Name</th>
<th scope="col" style="text-align: center">First Name</th>
<th scope="col" style="text-align: center; width: 7%">TDCJ<br/>Number</th>
<th scope="col" style="text-align: center">Age</th>
<th scope="col" style="text-align: center">Date</th>
<th scope="col" style="text-align: center">Race</th>
<th scope="col" style="text-align: center">County</th>
</tr>

In [210]:
## create list of col names
columns = []

for tag in columns_html.find_all('th'):
    column = tag.text
    columns.append(column)

# (custom names - webpage names are both simply 'Link')
columns[1] = "Inmate Information"
columns[2] = "Last Words"


columns

['Execution',
 'Inmate Information',
 'Last Words',
 'Last Name',
 'First Name',
 'TDCJNumber',
 'Age',
 'Date',
 'Race',
 'County']

Create data table:

In [213]:
## create list of table values
executed_list = []

for row in html_table:
    row_values = []
    for cell in row.find_all('td'):
        if cell.a is None:
            cell_value = cell.text
        else:
            cell_value = cell.a['href']
        row_values.append(cell_value)
    executed_list.append(row_values)

In [214]:
## convert list to df
executed = pd.DataFrame(executed_list, columns = name_list)
executed.head(10)

Unnamed: 0,Execution,Inmate Information,Last Words,Last Name,First Name,TDCJNumber,Age,Date,Race,County
0,588,dr_info/gonzalesramiro.html,dr_info/gonzalesramirolast.html,Gonzales,Ramiro,999513,41,6/26/2024,Hispanic,Medina
1,587,dr_info/cantuivan.html,dr_info/cantuivanlast.html,Cantu,Ivan,999399,50,2/28/2024,Hispanic,Collin
2,586,dr_info/renteriadavid.html,dr_info/renteriadavidlast.html,Renteria,David,999460,53,11/16/2023,Other,El Paso
3,585,dr_info/brewer.jpg,dr_info/brewerbrentlast.html,Brewer,Brent,999000,53,11/9/2023,White,Randall
4,584,dr_info/murphyjedidiah.html,dr_info/murphyjedidiahlast.html,Murphy,Jedidiah,999392,48,10/10/2023,White,Dallas
5,583,dr_info/brownarthur.jpg,dr_info/brownarthurlast.html,"Brown, Jr.",Arthur,999110,52,3/9/2023,Black,Harris
6,582,dr_info/greengary.html,dr_info/greengarylast.html,Green,Gary,999561,51,3/7/2023,Black,Dallas
7,581,dr_info/balentinejohn.html,dr_info/balentinejohnlast.html,Balentine,John,999315,54,2/9/2023,Black,Potter
8,580,dr_info/ruizwesley.html,dr_info/ruizwesleylast.html,Ruiz,Wesley,999536,43,2/1/2023,Hispanic,Dallas
9,579,dr_info/frattarobert.jpg,dr_info/frattarobertlast.html,Fratta,Robert,999189,65,1/10/2023,White,Harris


In [84]:
## save
info.to_csv('../data/executed_inmates.csv', index=False)

## 2. Scraping prisoner info and last words

#### Scraping info

In [178]:
url_base = 'https://www.tdcj.texas.gov/death_row/'

In [199]:
def scrape_info(path_name):
    '''Scrape prisoner information programatically.'''
    
    if path_name.endswith('.html'):
    
        url = url_base + path_name
        print(f'Scraping {url} ...')
    
        ## access
        status = requests.get(url, headers = headers, verify = False) # verify = True results in an SSL error. not the best practice to proceed without verifying TLS, but...
        status.close()
        
        if status.status_code != 200:
            raise ValueError(f'Invalid status code: {status.status_code}')
            
        ## parse
        soup = bs4.BeautifulSoup(status.text, features = 'html.parser')
        html_table = soup.find_all('tr')
        
        ## transform
        table_dict = {}

        for row in html_table:
                row_content = [x for x in row.text.replace('\xa0','').split('\n') if x]
                table_dict[row_content[0]] = row_content[1]
        
        return(table_dict)
    
    else:
        print(f'Invalid filetype {url}. Continuing...')
        
        
        
    
    

In [202]:
# scraped_info = [scrape_info(pathname) for pathname in info['Inmate Information']]

# using for loop allows to save results even if there's an error in the process
info_list = []
for pathname in info['Inmate Information']:
    scraped_info = scrape_info(pathname)
    info_list.append(scraped_info)

Scraping https://www.tdcj.texas.gov/death_row/dr_info/gonzalesramiro.html ...




Scraping https://www.tdcj.texas.gov/death_row/dr_info/cantuivan.html ...




Scraping https://www.tdcj.texas.gov/death_row/dr_info/renteriadavid.html ...




Invalid filetype dr_info/brewer.jpg. Continuing...
Scraping https://www.tdcj.texas.gov/death_row/dr_info/murphyjedidiah.html ...




Invalid filetype dr_info/brownarthur.jpg. Continuing...
Scraping https://www.tdcj.texas.gov/death_row/dr_info/greengary.html ...




IndexError: list index out of range

In [203]:
info_list

[{'Name': 'Gonzales, Ramiro',
  'TDCJ Number': '999513',
  'Date of Birth': '11/05/1982',
  'Date Received': '09/27/2006',
  'Age (when Received)': '23',
  'Education Level (Highest Grade Completed)': '7th Grade',
  'Date of Offense': '01/15/2001',
  'Age (at the time of Offense)': '18',
  'County': 'Medina',
  'Race': 'Hispanic',
  'Gender': 'Male',
  'Hair Color': 'Black',
  'Height (in Feet and Inches)': '5′ 2″',
  'Weight (in Pounds)': '136',
  'Eye Color': 'Brown',
  'Native County': 'Frio',
  'Native State': 'Texas'},
 {'Name': 'Cantu, Ivan Abner',
  'TDCJ Number': '999399',
  'Date of Birth': '06/14/1973',
  'Date Received': '11/08/2001',
  'Age (when    Received)': '28',
  'Education Level (Highest Grade Completed)': '12',
  'Date of Offense': '11/04/2000',
  'Age (at the time of Offense)': '27',
  'County': 'Collin',
  'Race': 'Hispanic',
  'Gender': 'Male',
  'Hair Color': 'Black',
  'Height (in Feet and Inches)': '5′ 7″',
  'Weight (in Pounds)': '176',
  'Eye Color': 'Brown'

#### Starting with one (0)

In [86]:
url_base = 'https://www.tdcj.texas.gov/death_row/'
url = url_base + info['Inmate Information'][0]
url

'https://www.tdcj.texas.gov/death_row/dr_info/gonzalesramiro.html'

In [164]:
## access
status = requests.get(url, headers = headers, verify = False) # verify = True results in an SSL error. not the best practice to proceed without verifying TLS, but...
status



<Response [200]>

In [167]:
status.close()

In [97]:
## parse
soup = bs4.BeautifulSoup(status.text, features = 'html.parser')
html_table = soup.find_all('tr')

In [160]:
table_dict = {}

for row in html_table:
        row_content = [x for x in row.text.replace('\xa0','').split('\n') if x]
        table_dict[row_content[0]] = row_content[1]
        
table_dict

{'Name': 'Gonzales, Ramiro',
 'TDCJ Number': '999513',
 'Date of Birth': '11/05/1982',
 'Date Received': '09/27/2006',
 'Age (when Received)': '23',
 'Education Level (Highest Grade Completed)': '7th Grade',
 'Date of Offense': '01/15/2001',
 'Age (at the time of Offense)': '18',
 'County': 'Medina',
 'Race': 'Hispanic',
 'Gender': 'Male',
 'Hair Color': 'Black',
 'Height (in Feet and Inches)': '5′ 2″',
 'Weight (in Pounds)': '136',
 'Eye Color': 'Brown',
 'Native County': 'Frio',
 'Native State': 'Texas'}

In [162]:
table_dict2 = table_dict
table_dict_list = [table_dict, table_dict2]
test = pd.DataFrame(table_dict_list)
test

Unnamed: 0,Name,TDCJ Number,Date of Birth,Date Received,Age (when Received),Education Level (Highest Grade Completed),Date of Offense,Age (at the time of Offense),County,Race,Gender,Hair Color,Height (in Feet and Inches),Weight (in Pounds),Eye Color,Native County,Native State
0,"Gonzales, Ramiro",999513,11/05/1982,09/27/2006,23,7th Grade,01/15/2001,18,Medina,Hispanic,Male,Black,5′ 2″,136,Brown,Frio,Texas
1,"Gonzales, Ramiro",999513,11/05/1982,09/27/2006,23,7th Grade,01/15/2001,18,Medina,Hispanic,Male,Black,5′ 2″,136,Brown,Frio,Texas


In [153]:
tupList = [[('commentID', 'commentText', 'date'), ('123456', 'blahblahblah', '2019')], [('45678', 'hello world', '2018'), ('0', 'text', '2017')]]
tupList


[[('commentID', 'commentText', 'date'), ('123456', 'blahblahblah', '2019')],
 [('45678', 'hello world', '2018'), ('0', 'text', '2017')]]

In [None]:
[t for lst in tupList for t in lst]