In [1]:
from bs4 import BeautifulSoup, SoupStrainer
import bs4
import urllib
import pathlib

## Get HTML Data From The Heritage Foundation

These are public-facing sites and freely released data. Capturing it and storing it in a pandas DataFrame should be acceptable.

An example URL that we can scrape data from is [`https://www.heritage.org/voterfraud/search?combine=&state=All&year=&case_type=All&fraud_type=All&page=0`](https://www.heritage.org/voterfraud/search?combine=&state=All&year=&case_type=All&fraud_type=All&page=0). Their entire database should have be been queried, and it shows 15 results per page over 63 pages. This makes our lives a little difficult. By this reckoning, there should be 62 * 15 + 7 = 937 records. 

This is a little worrying, as in big bold letters on The Heritage Foundation [Voter Fraud report splash page](https://www.heritage.org/voterfraud), they claim they have collected "1,277 proven instances of voter fraud". This is different than their [printable report](https://www.heritage.org/sites/default/files/voterfraud_download/VoterFraudCases_5.pdf), which lists "1,088 proven instances", and different yet again from the [White House document](https://www.whitehouse.gov/sites/whitehouse.gov/files/docs/pacei-voterfraudcases.pdf) saved as part of the now-defunct PACEI effort (decudible from the URL), which lists 1,071. Since the commission ended in 2018, it stands to reason that The Heritage Foundation might have added to their online database...it remains to be seen if that data is public-facing. 

PACEI was the Presidential Advisory Commission on Election Integrity, which was disbanded after a lawsuit from commision member Secetary of State Matthew Dunlap of the State of Maine that alleged that he and other members were illegally excluded from its work. The generally-held consensus is that the commission found no evidence of widespread voter fraud, and the PACEI was formed for the sole purpose of backing up President's Trump unproven claims of the same. An overview of the situation can be found on [Maine's government website](https://www.maine.gov/sos/news/2018/paceidocs.html) and on a [left-leaning administration watchdog](https://www.americanoversight.org/investigation/dunlap-v-pacei-investigating-voter-fraud-commission). 

Given that the Heritage Foundation was considered a worthy source of data for a presidental commission, it's worth getting a handle on what is actually in their database. Unfortunately, they have not made the database easily available, and we have to resort to scraping individual webpages. 

### Get HTML of The Heritage Foundation Webpages

In [2]:
links = ['https://www.heritage.org/voterfraud/search' +
         '?combine=&state=All&year=&case_type=All&fraud_type=All&page={}'.format(i) for i in range(63)]

In [3]:
def get_table_html(url):
    headers = {'Referer': 'https://www.heritage.org/', 
           'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
    request = urllib.request.Request(url, headers=headers)
    response = urllib.request.urlopen(request)
    html = response.read().decode('utf-8')
    return html

In [4]:
def get_all_table_htmls(check_local=True):
    htmls_path = pathlib.Path('table_htmls')
    if check_local and (htmls_path / 'page_63.txt').exists():
        htmls = []
        for fpath in [x for x in sorted(htmls_path.iterdir(), key=lambda y: y.name) if x.is_file()]: 
            if fpath.name.startswith('page_') and fpath.suffix == '.txt':
                htmls.append(fpath.read_text())
    else:
        htmls = [get_table_html(link) for link in links]
        htmls_path.mkdir(exist_ok=True)
        for i, html in enumerate(htmls):
            (htmls_path / 'page_{:02d}.txt'.format(i+1)).write_text(html)
    return htmls

In [5]:
htmls = get_all_table_htmls()

## Wrangle Data

### Test on 1 HTML File

In [6]:
print(htmls[0])

</h4>
    <ul class="pager__items js-pager__items">
                                                        <li class="pager__item is-active">
                                          <a href="?combine=&amp;state=All&amp;year=&amp;case_type=All&amp;fraud_type=All&amp;page=0" title="Current page">
            <span class="visually-hidden">
              Current page
            </span>1</a>
        </li>
              <li class="pager__item">
                                          <a href="?combine=&amp;state=All&amp;year=&amp;case_type=All&amp;fraud_type=All&amp;page=1" title="Go to page 2">
            <span class="visually-hidden">
              Page
            </span>2</a>
        </li>
              <li class="pager__item">
                                          <a href="?combine=&amp;state=All&amp;year=&amp;case_type=All&amp;fraud_type=All&amp;page=2" title="Go to page 3">
            <span class="visually-hidden">
              Page
            </span>3</a>
        </li>


In [7]:
only_spans = SoupStrainer('span')
span_soup = BeautifulSoup(htmls[0], parse_only=only_spans)
print(span_soup.prettify())

-field-name">
 <span class="views-label views-label-name">
  Name
 </span>
 <span class="field-content">
  Gay Nell Tinker
 </span>
</span>
<span class="views-field views-field-field-case-type">
 <span class="views-label views-label-field-case-type">
  Case Type
 </span>
 <span class="field-content">
  Criminal Conviction
 </span>
</span>
<span class="views-field views-field-field-fraud-type">
 <span class="views-label views-label-field-fraud-type">
  Fraud Type
 </span>
 <span class="field-content">
  Fraudulent Use Of Absentee Ballots
 </span>
</span>
<span class="views-field views-field-field-outcome-of-case">
 <span class="views-label views-label-field-outcome-of-case">
  Details
 </span>
 <span class="field-content">
  <p>
   Gay Nell Tinker, a former circuit clerk for Hale County, pleaded guilty to multiple counts of absentee ballot fraud after her scheme to orchestrate fraudulent absentee ballots for the benefit of multiple candidates was uncovered. She admitted to falsifying th

In [337]:
field_contents = span_soup.find_all(class_='field-content')
field_contents

[<span class="field-content"><a href="/agriculture" hreflang="en">Agriculture</a></span>,
 <span class="field-content"><a href="/education" hreflang="en">Education</a></span>,
 <span class="field-content"><a href="/government-regulation" hreflang="en">Government Regulation</a></span>,
 <span class="field-content"><a href="/housing" hreflang="en">Housing</a></span>,
 <span class="field-content"><a href="/american-founders" hreflang="en">American Founders</a></span>,
 <span class="field-content"><a href="/conservatism" hreflang="en">Conservatism</a></span>,
 <span class="field-content"><a href="/progressivism" hreflang="en">Progressivism</a></span>,
 <span class="field-content"><a href="/public-opinion" hreflang="en">Public Opinion</a></span>,
 <span class="field-content"><a href="/asia" hreflang="en">Asia</a></span>,
 <span class="field-content"><a href="/europe" hreflang="en">Europe</a></span>,
 <span class="field-content"><a href="/global-politics" hreflang="en">Global Politics</a></s

In [338]:
tag_strings = []
for tag in field_contents:
    child = tag.findChild()
    if child:
        if child.name == 'p': 
            tag_strings.append(child.string)
    else:
        tag_strings.append(tag.string)

In [339]:
tag_strings

['Alabama',
 '2016',
 'Daniel W. Reynolds',
 'Criminal Conviction',
 'Fraudulent Use Of Absentee Ballots',
 "Daniel W. Reynolds pleaded guilty to three counts of absentee ballot fraud and was sentenced to two years' probation. Reynolds, the chief campaign volunteer for Commissioner Amos Newsome, participated in falsifying absentee ballots in the Dothan District 2 election between Newsome and his rival Lamesa Danzey in the summer of 2013.",
 'Alabama',
 '2015',
 'Janice Lee Hart',
 'Criminal Conviction',
 'Fraudulent Use Of Absentee Ballots',
 'Janice Lee Hart pleaded guilty to eight misdemeanor counts of attempted absentee ballot fraud in connection with misconduct while working on the 2013 campaign for District 2 City Commissioner Amos Newsome. Prosecutors charged that Hart was not present when absentee ballots were signed even though she was listed as a witness on the ballots. In the election, Newsome defeated his challenger by only 14 votes and received 119 out of the 124 absentee b

Each result page includes definitions that are organized under the same span class as the data we're after and appear at the end of the list `tag_strings`. Each record contains 6 fields and starts with a state, so we skip ahead by 6 until we don't encounter a state. The rest `tag_strings` is the defined terms and their definition, which we save for reference.

In [340]:
states = ["Alabama","Alaska","Arizona","Arkansas","California","Colorado",
    "Connecticut","Delaware","Florida","Georgia","Hawaii","Idaho","Illinois",
    "Indiana","Iowa","Kansas","Kentucky","Louisiana","Maine","Maryland",
    "Massachusetts","Michigan","Minnesota","Mississippi","Missouri","Montana",
    "Nebraska","Nevada","New Hampshire","New Jersey","New Mexico","New York",
    "North Carolina","North Dakota","Ohio","Oklahoma","Oregon","Pennsylvania",
    "Rhode Island","South Carolina","South Dakota","Tennessee","Texas","Utah",
    "Vermont","Virginia","Washington","West Virginia","Wisconsin","Wyoming"]

i = 0
while tag_strings[i] in states and i < len(tag_strings):
    i += 6

In [341]:
data, def_list = tag_strings[:i], tag_strings[i:]

In [342]:
defs = dict()
for i in range(len(def_list)): 
    if i % 2 == 0: 
        defs[def_list[i]] = def_list[i+1]
defs

{'Criminal Conviction': 'Any case that results in a defendant entering a plea of guilty or no contest, or being found guilty in court of election-related offenses.',
 'Judicial Finding': 'A finding by a court of law that fraud occurred in an election, including judicial orders overturning election results or ordering a new election due to fraud.',
 'Civil Penalty': 'Any civil case resulting in fines or other penalties imposed for a violation of election laws.',
 'Official Finding': 'A finding by a government body that fraud occurred in an election, including orders overturning election results or ordering a new election due to fraud.',
 'Diversion Program': 'Any criminal case in which a judge directs a defendant into a pre-trial diversion program, or stays or defers adjudication with the understanding that the conviction will be cleared upon completion of the program.',
 'Fraudulent Use Of Absentee Ballots': 'Requesting absentee ballots and voting without the knowledge of the actual vo

In [343]:
import pandas as pd
import numpy as np

In [436]:
cols = ['state', 'year', 'name', 'case_type', 'fraud_type', 'description']
df = pd.DataFrame(np.array(data).reshape(15, 6), columns=cols)
print('shape of data is', df.shape)
df.head()

shape of data is (15, 6)


Unnamed: 0,state,year,name,case_type,fraud_type,description
0,Alabama,2016,Daniel W. Reynolds,Criminal Conviction,Fraudulent Use Of Absentee Ballots,Daniel W. Reynolds pleaded guilty to three cou...
1,Alabama,2015,Janice Lee Hart,Criminal Conviction,Fraudulent Use Of Absentee Ballots,Janice Lee Hart pleaded guilty to eight misdem...
2,Alabama,2015,Lesa Coleman,Criminal Conviction,Fraudulent Use Of Absentee Ballots,A Houston County jury found Lesa Coleman guilt...
3,Alabama,2015,Olivia Lee Reynolds,Criminal Conviction,Fraudulent Use Of Absentee Ballots,Olivia Lee Reynolds was convicted of 24 counts...
4,Alabama,2012,Venustiano Hernandez-Hernandez,Criminal Conviction,Ineligible Voting,"Venustiano Hernandez-Hernandez, an illegal imm..."


We've successfully scraped the first result page and loaded the data into a dataframe. Let's do the same for the rest of the data. 

### Scrape all HTML Files

This will be a general restatement of the last section's work in a single function.

In [351]:
def scrape_data_from_html(html):
    only_spans = SoupStrainer('span')
    span_soup = BeautifulSoup(html, parse_only=only_spans)
    field_contents = span_soup.find_all(class_='field-content')
    tag_strings = []
    for tag in field_contents:
        child = tag.findChild()
        if child:
            if child.name == 'p': 
                tag_strings.append(child.string)
        else:
            tag_strings.append(tag.string)
    i = 0
    while i < len(tag_strings) and tag_strings[i] in states:
        i += 6
    data = tag_strings[:i]
    df = pd.DataFrame(np.array(data).reshape(len(data)//6, 6), columns=cols)
    return df

In [367]:
df = pd.concat([scrape_data_from_html(html) for html in htmls], ignore_index=True)

In [368]:
df.sample(5)

Unnamed: 0,state,year,name,case_type,fraud_type,description
600,Pennsylvania,2014,Richard Allan Toney,Criminal Conviction,Fraudulent Use Of Absentee Ballots,"The former police chief of Harmar Township, pl..."
582,Ohio,2005,Chad Staton,Criminal Conviction,False Registrations,Chad Staton pleaded guilty to 10 felony counts...
112,Florida,2013,Christian David Price,Criminal Conviction,False Registrations,"Christian David Price, a campaign worker in Fl..."
49,California,2012,"Vernon, CA",Judicial Finding,Election Overturned,A City Council election (originally decided by...
314,Michigan,2012,"Lorianne O'Brady, Don Yowchuang,…",Criminal Conviction,Ballot Petition Fraud,Former staff members for U.S. Representative T...


In [369]:
df.shape

(937, 6)

Indeed, there are only 937 public-facing records when you click on the "See All Results" button. This immediately calls into question the accuracy of the number on the frontpage. 

Digging into the HTML of the results page offered this [printable option](https://www.heritage.org/voterfraud-print/search?case_type=All&combine=&fraud_type=All&state=All&year=), which might've saved some time with the scraping. It also includes the source information, which is also nice. I've saved a copy for future inclusion, but what's important right now is that the number of records on the printable version matches the 937 number given earlier. We take it as given that these are legitimate, sourced instances of voter fraud. 

## Data Cleaning

Looks like some incidents are marked as multiple types of fraud. We should break this out somehow. 

In [377]:
df.loc[df.fraud_type.str.split(', ').str.len() > 1, 'fraud_type'].value_counts()

Fraudulent Use Of Absentee Ballots, Duplicate Voting                     5
False Registrations, Fraudulent Use Of Absentee Ballots                  4
False Registrations, Duplicate Voting                                    4
Buying Votes, Fraudulent Use Of Absentee Ballots                         4
Ineligible Voting, False Registrations                                   3
Duplicate Voting, Fraudulent Use Of Absentee Ballots                     3
Fraudulent Use Of Absentee Ballots, Election Overturned                  2
Ineligible Voting, Fraudulent Use Of Absentee Ballots                    2
Fraudulent Use Of Absentee Ballots, False Registrations                  2
False Registrations, Election Overturned                                 2
Duplicate Voting, Ineligible Voting                                      1
Fraudulent Use Of Absentee Ballots, Ineligible Voting                    1
Ballot Petition Fraud, Buying Votes                                      1
Fraudulent Use Of Absente

In [406]:
fraud = df.fraud_type.str.split(', ')
fraud.loc[fraud.str.len() > 1].value_counts().sum()

44

In [400]:
fraud_sets = fraud.apply(set)
fraud_sets.loc[fraud_sets.str.len() > 1].value_counts().sum()

44

44 records have two types of fraud associated with them. 

One option is to use this nifty import from `scikit-learn` to make dummy variables. The different types of fraud are listed as column variables: a 1 in that column indicates that fraud type applies to that incident. BIG shout-out to [this Stack Overflow answer](https://stackoverflow.com/a/51420716/7227829).

In [411]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
fraud_dummy_df = pd.DataFrame(mlb.fit_transform(fraud_sets), columns=mlb.classes_, index=df.index)
fraud_dummy_df[fraud_dummy_df.sum(1) > 1].sample(5)

Unnamed: 0,Altering The Vote Count,Ballot Petition Fraud,Buying Votes,Duplicate Voting,Election Overturned,False Registrations,Fraudulent Use Of Absentee Ballots,"Illegal ""Assistance"" At The Polls",Impersonation Fraud At The Polls,Ineligible Voting,Miscellaneous
811,0,0,0,0,1,1,0,0,0,0,0
468,0,0,0,1,0,0,1,0,0,0,0
174,0,0,0,0,0,1,1,0,0,0,0
729,0,0,0,1,0,1,0,0,0,0,0
560,0,1,0,0,0,0,1,0,0,0,0


Another option is just to duplicate the records that have multiple fraud types. Therefore a single incident listed as "False Registrations" and "Ineligible Voting" will be listed twice with only their fraud type changed. 

Trying to give every benefit of the doubt to the The Heritage Foundation here...

In [434]:
for x in df.year:
    if type(x) != str: print(x, 'is type', type(x))

2019 is type <class 'bs4.element.NavigableString'>
2018 is type <class 'bs4.element.NavigableString'>
2016 is type <class 'bs4.element.NavigableString'>
2016 is type <class 'bs4.element.NavigableString'>
2016 is type <class 'bs4.element.NavigableString'>
2016 is type <class 'bs4.element.NavigableString'>
None is type <class 'NoneType'>
2016 is type <class 'bs4.element.NavigableString'>
2014 is type <class 'bs4.element.NavigableString'>
2014 is type <class 'bs4.element.NavigableString'>
2018 is type <class 'bs4.element.NavigableString'>
2018 is type <class 'bs4.element.NavigableString'>
2016 is type <class 'bs4.element.NavigableString'>
2016 is type <class 'bs4.element.NavigableString'>
2016 is type <class 'bs4.element.NavigableString'>
