# Approximating the number of Hispanic inmates

Loading the libraries to be used:

In [None]:
qtconsole

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import numpy as np

## Getting a list of common Hispanic names in the US

We are going to use the same list as in the case of Polk County (view the other notebook in this repo). This list is a reduced version of the one found at the website Mongobay.com and the data is taken from the 2010 US Census. This list will be used to compare with the names of the inmates and guess whether or not they are hispanic.

In [2]:
common_names_list = np.loadtxt('common_names_list.txt', dtype=str, unpack=True)
print('Total number of names =', len(common_names_list))

Total number of names = 899


# Trying with only 100 names!

In [None]:
common_names_list = common_names_list[:100]
print('Total number of names =', len(common_names_list))

The next function makes it easier to access the search page with parameters such as 'Search Aliases = No'.

In [3]:
def make_url(last_name):
    return 'http://www.dc.state.fl.us/OffenderSearch/list.aspx?TypeSearch=AI&Page=List&DataAction=Filter&dcnumber=&LastName={}&FirstName=&SearchAliases=0&OffenseCategory=&CurrentLocation=&CountyOfCommitment=&photosonly=0&nophotos=1&matches=50'.format(last_name)

Setting up the webdriver which will simulate accessing the website in a Firefox Browser.
We set it to headless so that an actual browser window.

In [4]:
ffoptions = webdriver.firefox.options.Options()
ffoptions.headless = True
driver = webdriver.Firefox(options=ffoptions)
driver.implicitly_wait(10)

Setting up the list of tables which will be collected and then running the collection loop.

In [5]:
%%time
list_of_tables = []

for i, name in enumerate(common_names_list):
    print('i:',i,'; name=',name)
    number_of_results = last_elem_page = 50 #number of results per page and a variable to check if it is finished loading
    try:
        driver.get(make_url(name))
        last_elem_total = driver.find_element_by_id('ctl00_ContentPlaceHolder1_lblgrdListPage')
        last_elem_total = int(last_elem_total.text[last_elem_total.text.find('of') + 3:])
    except:
        #no results and no table found, therefore continue on to the next name
        continue

    while True:
        temp_table = pd.read_html(driver.page_source, attrs={'id':'ctl00_ContentPlaceHolder1_grdList'})[0]
        time.sleep(.5)
        if (str(last_elem_page) in temp_table.iloc[-1,0]) or (str(last_elem_total) in temp_table.iloc[-1,0]):
            #sometimes the next page won't load quickly enough and the same table is reloaded
            #this checks where we are in the collection and avoids duplicate tables
            list_of_tables.append(temp_table)
            last_elem_page += number_of_results
            try:
                #moving on to the next page of results.
                elem = driver.find_element_by_name('ctl00$ContentPlaceHolder1$btnListNext')
                elem.send_keys(Keys.RETURN)
            except:
                #no more results. move on to next name.
                break

i: 0 ; name= GARCIA
i: 1 ; name= RODRIGUEZ
i: 2 ; name= HERNANDEZ
i: 3 ; name= MARTINEZ
i: 4 ; name= LOPEZ
i: 5 ; name= GONZALEZ
i: 6 ; name= PEREZ
i: 7 ; name= SANCHEZ
i: 8 ; name= RAMIREZ
i: 9 ; name= TORRES
i: 10 ; name= FLORES
i: 11 ; name= RIVERA
i: 12 ; name= GOMEZ
i: 13 ; name= DIAZ
i: 14 ; name= CRUZ
i: 15 ; name= MORALES
i: 16 ; name= REYES
i: 17 ; name= GUTIERREZ
i: 18 ; name= ORTIZ
i: 19 ; name= CHAVEZ
i: 20 ; name= RAMOS
i: 21 ; name= RUIZ
i: 22 ; name= MENDOZA
i: 23 ; name= ALVAREZ
i: 24 ; name= JIMENEZ
i: 25 ; name= CASTILLO
i: 26 ; name= VASQUEZ
i: 27 ; name= ROMERO
i: 28 ; name= MORENO
i: 29 ; name= GONZALES
i: 30 ; name= HERRERA
i: 31 ; name= AGUILAR
i: 32 ; name= MEDINA
i: 33 ; name= VARGAS
i: 34 ; name= CASTRO
i: 35 ; name= GUZMAN
i: 36 ; name= MENDEZ
i: 37 ; name= FERNANDEZ
i: 38 ; name= MUNOZ
i: 39 ; name= SALAZAR
i: 40 ; name= GARZA
i: 41 ; name= SOTO
i: 42 ; name= VAZQUEZ
i: 43 ; name= ALVARADO
i: 44 ; name= CONTRERAS
i: 45 ; name= DELGADO
i: 46 ; name= PENA
i: 4

i: 369 ; name= ALANIZ
i: 370 ; name= TOLEDO
i: 371 ; name= CORRAL
i: 372 ; name= TORREZ
i: 373 ; name= DELOSSANTOS
i: 374 ; name= ROBLEDO
i: 375 ; name= MONTANEZ
i: 376 ; name= BUSTOS
i: 377 ; name= ADAME
i: 378 ; name= PALMA
i: 379 ; name= VALDIVIA
i: 380 ; name= ABREU
i: 381 ; name= HOLGUIN
i: 382 ; name= FRIAS
i: 383 ; name= ACUNA
i: 384 ; name= VIDAL
i: 385 ; name= HENRIQUEZ
i: 386 ; name= MONTERO
i: 387 ; name= CARRERA
i: 388 ; name= BUENO
i: 389 ; name= CEBALLOS
i: 390 ; name= BATISTA
i: 391 ; name= DELGADILLO
i: 392 ; name= ESPINO
i: 393 ; name= PANTOJA
i: 394 ; name= OLIVA
i: 395 ; name= NEGRETE
i: 396 ; name= NARANJO
i: 397 ; name= APONTE
i: 398 ; name= ELIZONDO
i: 399 ; name= CHAPA
i: 400 ; name= SOLORZANO
i: 401 ; name= AQUINO
i: 402 ; name= URBINA
i: 403 ; name= GAYTAN
i: 404 ; name= RESENDIZ
i: 405 ; name= ALMANZA
i: 406 ; name= PUENTE
i: 407 ; name= TAMAYO
i: 408 ; name= MONROY
i: 409 ; name= TEJEDA
i: 410 ; name= ARMENTA
i: 411 ; name= TEJADA
i: 412 ; name= PERALES
i: 41

i: 730 ; name= DELTORO
i: 731 ; name= OROPEZA
i: 732 ; name= ARAIZA
i: 733 ; name= MONTEMAYOR
i: 734 ; name= MANZANARES
i: 735 ; name= DESANTIAGO
i: 736 ; name= PINO
i: 737 ; name= CARDOSO
i: 738 ; name= TENORIO
i: 739 ; name= CORONEL
i: 740 ; name= GALAN
i: 741 ; name= VERDUZCO
i: 742 ; name= PUGA
i: 743 ; name= DUQUE
i: 744 ; name= PALOMO
i: 745 ; name= CASAREZ
i: 746 ; name= ALEJANDRO
i: 747 ; name= VERDUGO
i: 748 ; name= PALOMARES
i: 749 ; name= BARCENAS
i: 750 ; name= MURO
i: 751 ; name= CARRIZALES
i: 752 ; name= CARRASQUILLO
i: 753 ; name= VELIZ
i: 754 ; name= RAYA
i: 755 ; name= LUIS
i: 756 ; name= VELARDE
i: 757 ; name= GASTELUM
i: 758 ; name= GARDUNO
i: 759 ; name= CARRENO
i: 760 ; name= PASCUAL
i: 761 ; name= BARELA
i: 762 ; name= DAVALOS
i: 763 ; name= WARD
i: 764 ; name= ARANGO
i: 765 ; name= URRUTIA
i: 766 ; name= BAILEY
i: 767 ; name= PINON
i: 768 ; name= PIZARRO
i: 769 ; name= ARRIOLA
i: 770 ; name= SORTO
i: 771 ; name= RECINOS
i: 772 ; name= BEJARANO
i: 773 ; name= LEIJ

We close the webdriver object and concatenate all DataFrames collected.

In [6]:
driver.close()
hispanic_inmates = pd.concat(list_of_tables)
hispanic_inmates.reset_index(inplace=True)

Table inspection

In [7]:
hispanic_inmates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12818 entries, 0 to 12817
Data columns (total 9 columns):
index                       12818 non-null int64
Click Number for Details    12818 non-null object
Name                        12818 non-null object
DC Number                   12818 non-null object
Race                        12818 non-null object
Sex                         12818 non-null object
Release Date                12813 non-null object
Current Facility            12818 non-null object
Birth Date                  12818 non-null object
dtypes: int64(1), object(8)
memory usage: 901.4+ KB


In [8]:
hispanic_inmates.sample(3)

Unnamed: 0,index,Click Number for Details,Name,DC Number,Race,Sex,Release Date,Current Facility,Birth Date
1454,35,*86,"LOPEZ, FRANCISCO J",M85744,HISPANIC,MALE,09/15/2024,SANTA ROSA ANNEX,12/03/1993
10485,9,*10,"BATISTA, RICARDO",069939,WHITE,MALE,12/22/2024,EVERGLADES RE-ENTRY,12/09/1960
10914,14,*15,"CARABALLO-TORRES, MIGUEL A",X86526,WHITE,MALE,12/19/2020,DADE C.I.,08/29/1977


We can clear this table a little bit by getting rid of some columns we don't need.

In [9]:
hispanic_inmates = hispanic_inmates[['Name', 'Race', 'Sex']]

In [10]:
hispanic_inmates.sample(3)

Unnamed: 0,Name,Race,Sex
11303,"LUCAS, DELPHONE L",BLACK,MALE
189,"GARCIA, OLDEMAR",WHITE,MALE
12223,"MICHEL, JASLEN",BLACK,MALE


## Distribution of race assignments

Even though most of the people in the table seem to be hispanic (some might have a hispanic name but not be hispanic), not all are labeled as such. The race categories in this database are:

In [11]:
hispanic_inmates['Race'].unique()

array(['WHITE', 'BLACK', 'HISPANIC', 'ALL OTHERS/UNKNOWN',
       'AMERICAN INDIAN OR PACIFIC ISL'], dtype=object)

The distribution of races in this database are shown in the following table.

In [12]:
race_distribution = hispanic_inmates.groupby(['Race']).count()[['Name']]
race_distribution.columns = ['Count']
race_distribution

Unnamed: 0_level_0,Count
Race,Unnamed: 1_level_1
ALL OTHERS/UNKNOWN,87
AMERICAN INDIAN OR PACIFIC ISL,6
BLACK,1508
HISPANIC,3896
WHITE,7321


A possible number for the total hispanic population would be the sum of this previous table.

In [13]:
race_distribution.sum()

Count    12818
dtype: int64