# VPAP Webscraper
##### George P. Tryfiates

### *Introduction*
While trying to do econometric research, I have been frequently disappointed by the lack of data reporting from American institutions. 

For instance, the API of VA Department of Elections to download campaign finance information, publishes the individual reports during a filing period. Aggregating data is, thus, very time consuming. This makes analysis of political contributions much more difficult and much more rare. The data is also self-reported by the campaigns who have an incentive to be dishonest. This also means that there is wild variation in listed industries for donors.

The Virginia Public Access Project has stepped up to compile all the individual campaign finance reports and make it more accessible to the public on their website (https://www.vpap.org/). They also remove the street addresses from the donors' mailing address since the Dept. of Elections lists those. While it is much easier to find the data and VPAP does have useful graphics, aggregate data is not available which is necessary for statistical analysis. As such, I wanted to collect and structure VPAP data to create useable (analysis-ready) data. I chose the local level for convenience (this is my first independent webscraper project). 

### *Overview*

I webscraped publicly available data from the Virginia Public Access Project website for political contributions to local candidates in Spotsylvania, VA by Spotsylvania donors for the period of January 1, 2015 - December 2, 2020. I exported this information to a csv file (including the donor, the donor's employer, the donor's industry, the recipient, the amount, and the date of the contribution). However, not all donations had this information and much was left as "Unknown."

The following is the link to the first page of search results on VPAP for donors in the Spotsylvania locality for
all Spotsylvania donors to local Candidates: 2015 - Present (Dec. 2, 2020).
https://www.vpap.org/localities/spotsylvania-county-va/donors/?start_year=2015&end_year=2020&recip_type=local_cands

There are twenty five pages of results. The results only give the name of the donor and the total amount they gave. I navigated to the donor page from the results to get the name of the recipients but it listed the total they donated during the 2015-2020 time period and the recipient. I then navigated to the donor's recipient page from the previous page and I gathered the date and amount of the contribution.

Hopefully, this will serve as a reporting tool that can be expanded to increase civic knowledge. This is my first independently made webscraper so please let me know of ways to improve the program. Some code chunks take some time to run.

### *The Webscrape*

##### Importing Libraries

In [2]:
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
import pandas as pd

##### Webscraping the Search Results

I built out the program starting on page 1 of results. Note that page one does not specify it is such in the url. Also, the results are the total amounts that the donors gave during the time period but it does not say to whom or when.

In [138]:
URL = 'https://www.vpap.org/localities/spotsylvania-county-va/donors/?start_year=2015&end_year=2020&recip_type=local_cands'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
#scraping the contents of the table of results
results = soup.find("table", class_="table table-striped")
#print(results.prettify()) #you can remove the first hashtag if you would like to see the results from the HTML 'soup'

From these results, I scraped the donors' name and made a list.

In [139]:
nameList = []
def localdonors(results):
    namelineList = results.find_all("td", width="80%")
    i = 0
    for x in namelineList:
        text = namelineList[i].text.strip()
        name = re.search(r".*", text)
        nameList.append(name[0])
        i += 1
localdonors(results)
nameList

['Cosner, Hugh C',
 'Descano, Dorothy',
 'Descano, Steve',
 'Dead Hand Design',
 'Bird, Travis Duane',
 'King, Alfred M',
 'Berry, Moses',
 'McGyver Group LLC',
 'Curcie, Debbie',
 'Tricord Homes',
 'Boulter, Harvey E',
 'Ross, David',
 'Trivett, Michael',
 'Dechat, Peter',
 'Feaster, Angela',
 'Kingman, Rebecca S',
 'Less Trucking Inc',
 'Gore, Lorrie Jones',
 'Trampe, Paul',
 'Thillmann, John Horst']

I then scraped the donor's url and put the results in a list.

In [140]:
urlList = []
def localurls(results):
    personal_urls = results.find_all("a", href=True)
    for x in personal_urls:
        urlList.append('https://www.vpap.org'+x["href"])
localurls(results)
urlList

['https://www.vpap.org/donors/673-hugh-c-cosner/?start_year=2015&end_year=2020&recip_type=local_cands',
 'https://www.vpap.org/donors/330558-dorothy-descano/?start_year=2015&end_year=2020&recip_type=local_cands',
 'https://www.vpap.org/donors/330895-steve-descano/?start_year=2015&end_year=2020&recip_type=local_cands',
 'https://www.vpap.org/donors/361928-dead-hand-design/?start_year=2015&end_year=2020&recip_type=local_cands',
 'https://www.vpap.org/donors/216020-travis-duane-bird/?start_year=2015&end_year=2020&recip_type=local_cands',
 'https://www.vpap.org/donors/67126-alfred-m-king/?start_year=2015&end_year=2020&recip_type=local_cands',
 'https://www.vpap.org/donors/233154-moses-berry/?start_year=2015&end_year=2020&recip_type=local_cands',
 'https://www.vpap.org/donors/251360-mcgyver-group-llc/?start_year=2015&end_year=2020&recip_type=local_cands',
 'https://www.vpap.org/donors/181131-debbie-curcie/?start_year=2015&end_year=2020&recip_type=local_cands',
 'https://www.vpap.org/donors/

I repeated the process with pages 2-26 of the search results. It takes a minute (time varies with user).

In [141]:
i=2
for page in range(2,27):
    URL = "".join(["https://www.vpap.org/localities/spotsylvania-county-va/donors/?page=",str(i),"&start_year=2015&end_year=2020&recip_type=local_cands"])
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    results = soup.find("table", class_="table table-striped")
    localdonors(results)
    localurls(results)
    i += 1

##### Validating the Lists

I printed the first ten observations in each list to check that they were appropriate.

In [142]:
for p in range(0,10):
    print(nameList[p], urlList[p])
    print("")

Cosner, Hugh C https://www.vpap.org/donors/673-hugh-c-cosner/?start_year=2015&end_year=2020&recip_type=local_cands

Descano, Dorothy https://www.vpap.org/donors/330558-dorothy-descano/?start_year=2015&end_year=2020&recip_type=local_cands

Descano, Steve https://www.vpap.org/donors/330895-steve-descano/?start_year=2015&end_year=2020&recip_type=local_cands

Dead Hand Design https://www.vpap.org/donors/361928-dead-hand-design/?start_year=2015&end_year=2020&recip_type=local_cands

Bird, Travis Duane https://www.vpap.org/donors/216020-travis-duane-bird/?start_year=2015&end_year=2020&recip_type=local_cands

King, Alfred M https://www.vpap.org/donors/67126-alfred-m-king/?start_year=2015&end_year=2020&recip_type=local_cands

Berry, Moses https://www.vpap.org/donors/233154-moses-berry/?start_year=2015&end_year=2020&recip_type=local_cands

McGyver Group LLC https://www.vpap.org/donors/251360-mcgyver-group-llc/?start_year=2015&end_year=2020&recip_type=local_cands

Curcie, Debbie https://www.vpap.

I then checked to make sure the lists were the same size.

In [143]:
print(len(nameList), len(urlList))

501 501


##### Scraping Donor Pages

Herein, I scraped each donor's page for the recipients' name and got the url for the donor's recipient page. I also collected the donor's listed employer and industry. It is not a fast program and takes some time.

In [151]:
#initialize the lists
recipientList = []
amounturlList = []
donorList = []
industryList = []
employerList = []

#loop through all the donor url's
for x in range(0, 501):
    #initialize a counter to expand nameList to donorList to match the length of recipientList
    count = 0
    URL = urlList[x]
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    results = soup.find_all("td", class_="right")
    #get the recipients' name
    for namebloc in results:
        recipientList.append(namebloc.find("td").text.strip())
    #get the donor's recipient url
    for urlbloc in results:
        amounturlList.append('https://www.vpap.org'+urlbloc.find("a")["href"])
        count += 1
    #add the donor name for each donation made
    for y in range(0,count):
        donorList.append(nameList[x])
    #find the donor's employer and industry and add to appropriate list for ea. donation made
    results = soup.find_all("ul")
    #add variable to check if we added employers to employerList
    a = 0
    for ulbloc in results:
        #converted ulbloc to text, stripped it of whitespace at beginning and end, searched for string that
        #starts with "Industry: "
        industrybloc = re.search("Industry:.*", ulbloc.text.strip())
        #filtered out non-matches
        if(industrybloc !=  None):
            #dropped the "Industry: " and combined remaining words w/spacing
            industryName = " ".join(industrybloc.group().split()[1:])
            #added for each donation
            for z in range(0,count):
                industryList.append(industryName)
        #repeated the same process for "Employer: " but many donors do not have one listed
        employerbloc = re.search("Employer:.*", ulbloc.text.strip())
        if(employerbloc != None):
            #if we are in this bloc we have a match and thus will be adding to the employerList
            a = 1
            employerName = " ".join(employerbloc.group().split()[1:])
            for zz in range(0,count):
                employerList.append(employerName)
    #check if "a" changed value. If not, add "Employer Unknown" for each donation made
    if(a == 0):
        for zzz in range(0,count):
            employerList.append("Employer Unknown")
    

##### *Validating the Lists*

In [44]:
#recipientList

In [45]:
#amounturlList

In [2]:
#donorList

In [152]:
print(len(amounturlList), len(recipientList), len(donorList), len(industryList), len(employerList))

640 640 640 640 640


In [156]:
for x in range(0,640):
    print(donorList[x]+"--" + employerList[x]+"--" +industryList[x])

Cosner, Hugh C--Pizza Hut--Real Estate Developers
Cosner, Hugh C--Pizza Hut--Real Estate Developers
Cosner, Hugh C--Pizza Hut--Real Estate Developers
Cosner, Hugh C--Pizza Hut--Real Estate Developers
Cosner, Hugh C--Pizza Hut--Real Estate Developers
Cosner, Hugh C--Pizza Hut--Real Estate Developers
Cosner, Hugh C--Pizza Hut--Real Estate Developers
Descano, Dorothy--Paragon Autism Services LLC--Mental Health
Descano, Steve--Employer Unknown--Retired
Dead Hand Design--Employer Unknown--Advertising/PR/Direct Mail/Marketing
Bird, Travis Duane--Spotsylvania County--Constitutional Officers
Bird, Travis Duane--Spotsylvania County--Constitutional Officers
King, Alfred M--Valuation Research Corp--Appraisers/Auctioneers
King, Alfred M--Valuation Research Corp--Appraisers/Auctioneers
King, Alfred M--Valuation Research Corp--Appraisers/Auctioneers
King, Alfred M--Valuation Research Corp--Appraisers/Auctioneers
King, Alfred M--Valuation Research Corp--Appraisers/Auctioneers
King, Alfred M--Valuatio

###### *The Final Scrape*

I initialized the dataframe variables to store the information. I scraped the donors' recipient page for the date and donation amount and then expanded the donorList and recipientList to match those entries.

In [157]:
#initialize DataFrame variables
dollarList = []
dateList = []
donor = []
recipient = []
industry = []
employer = []
url = []

In [158]:
#iterate via the donors' recipient page
for x in range(0, 640):
    #generate a counter for later
    count = 0
    URL = amounturlList[x]
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    #get the donation value
    results = soup.find_all("td", class_="right")
    for y in results:
        num = y.text.strip()
        num = re.sub("[$,]", "", num)
        dollarList.append(num)
    #get the date of the donation
    results = soup.find_all("td", class_="center")
    for z in results:
        dateList.append(z.text.strip())
        #iterate the counter so we know how many times to append the recipient and donor
        count += 1
    for a in range(0, count):
        donor.append(donorList[x])
        recipient.append(recipientList[x])
        employer.append(employerList[x])
        industry.append(industryList[x])
        url.append(amounturlList[x])

###### Validating the Lengths

In [159]:
print(len(donor), len(employer), len(industry), len(recipient), len(dateList),len(dollarList), len(url))

1052 1052 1052 1052 1052 1052 1052


###### *Creating a DataFrame*

In [160]:
df = pd.DataFrame(data={"date":dateList, "donor":donor, "employer":employer, "industry":industry , "recipient": recipient, "donation_amount":dollarList, "url":url}, columns=["date", "donor", "employer", "industry", "recipient", "donation_amount", "url"])

In [161]:
df.index.name="index"

In [162]:
pd.set_option("display.max_rows", None)

In [1]:
#df

###### *Exporting the DataFrame to a CSV File*

In [164]:
df.to_csv("VPAP_SpotsyDonors_2015_2020_Dec_2.csv", index=False)