# Scrapeing data about US congress members from Wikipedia
In this notebook we will start a project with the aim to have a machine learning model which can determine what party a US congressional district is most likely to vote for. The first part for every project is to aquire the data required. This can be done by being given it from a third party, downloading data files or, as in this case, scrapeing a webpage for it.

In [1]:
from bs4 import BeautifulSoup as bs #For inspecting html webpage in notebook
import pandas as pd #To put data into frames for joining into a final result, also sued for printing to csv
import lxml #For parsing html
import requests #For requesting the webpages which we will srape
import time #To have a wait timer when scraping, for  politeness sake
import os #For saving in folder
import multiprocessing as mp

For this project we will start with the wikipedia page detailing the current (2020-06-21) list of US congress members. From this base page we can get the representative from each congress district together with data about their party affiliation previous experience, education, when they assumed their current office, residence, and which year they were born.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives" #Url to wikipedia page
response = requests.get(url) #The received page when requesting the specified url
soup = bs(response.content, 'lxml') #creating a BeautifulSoup object which we can display in the notebook and inspect

In [3]:
print(soup.prettify()) #Print the parsed html page

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of current members of the United States House of Representatives - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"9197d7e1-9e95-4521-a4e0-3f068ff99264","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_current_members_of_the_United_States_House_of_Representatives","wgTitle":"List of current members of the United States House of Representatives","wgCurRevisionId":968333171,"wgRevisionId":968333171,"wgArticleId":12498224,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"]

In [4]:
tables = soup.find_all('table') #Returns all tables on the webpage
tables #Print all tables in jupyter

[<table class="vertical-navbox nowraplinks plainlist" style="float:right;clear:right;width:22.0em;margin:0 0 1.0em 1.0em;background:#f8f9fa;border:1px solid #aaa;padding:0.2em;border-spacing:0.4em 0;text-align:center;line-height:1.4em;font-size:88%;width:18em;"><tbody><tr><td style="padding-top:0.4em;line-height:1.2em">This article is part of <a href="/wiki/Category:United_States_House_of_Representatives" title="Category:United States House of Representatives">a series</a> on the</td></tr><tr><th style="padding:0.2em 0.4em 0.2em;padding-top:0;font-size:145%;line-height:1.2em;font-size:175%;font-weight:normal;"><a href="/wiki/United_States_House_of_Representatives" title="United States House of Representatives">United States House<br/>of Representatives</a></th></tr><tr><td style="padding:0.2em 0 0.4em"><div class="floatnone"><a class="image" href="/wiki/File:Seal_of_the_United_States_House_of_Representatives.svg" title="Great Seal of the United States House of Representatives"><img alt

In [5]:
#Returns the tables where you can sort the data on the webpage.
members_table = soup.find_all("table", class_ ="wikitable sortable")[2] #The webpage which we are interested in
print(members_table.prettify()) #Print the table of interest 

<table class="wikitable sortable" id="votingmembers">
 <tbody>
  <tr style="vertical-align:bottom">
   <th>
    District
   </th>
   <th>
    Member
   </th>
   <th colspan="2">
    Party
   </th>
   <th>
    Prior experience
   </th>
   <th>
    Education
   </th>
   <th>
    Assumed office
   </th>
   <th>
    Residence
   </th>
   <th>
    Born
   </th>
  </tr>
  <tr>
   <td>
    <span data-sort-value="Alabama01 !">
     <a href="/wiki/Alabama%27s_1st_congressional_district" title="Alabama's 1st congressional district">
      Alabama 1
     </a>
    </span>
   </td>
   <td data-sort-value="Byrne, Bradley">
    <a class="image" href="/wiki/File:Rep_Bradley_Byrne_(cropped).jpg">
     <img alt="Rep Bradley Byrne (cropped).jpg" data-file-height="2337" data-file-width="1700" decoding="async" height="103" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/71/Rep_Bradley_Byrne_%28cropped%29.jpg/75px-Rep_Bradley_Byrne_%28cropped%29.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/th

Pandas has built in function to instantly scrape the wikipedia table and put the information into a pandas frame.

In [6]:
congress_members_frame = pd.read_html("https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_House_of_Representatives")[6] #Index specifies which table to put into a fram

In [7]:
congress_members_frame 

Unnamed: 0,District,Member,Party,Party.1,Prior experience,Education,Assumed office,Residence,Born
0,Alabama 1,Bradley Byrne,,Republican,Alabama SenateAlabama State Board of Education,Duke University (BA)University of Alabama (JD),2014 (special),Fairhope,1955.0
1,Alabama 2,Martha Roby,,Republican,"Montgomery, Alabama City Council",New York University (BM)Samford University (JD),2011,Montgomery,1976.0
2,Alabama 3,Mike Rogers,,Republican,"Calhoun County, Alabama CommissionerAlabama Ho...","Jacksonville State University (BA, MPA)Birming...",2003,Saks,1958.0
3,Alabama 4,Robert Aderholt,,Republican,"Haleyville, Alabama Municipal Judge",University of North AlabamaBirmingham–Southern...,1997,Haleyville,1965.0
4,Alabama 5,Mo Brooks,,Republican,Alabama House of RepresentativesMadison County...,Duke University (BA)University of Alabama (JD),2011,Huntsville,1954.0
...,...,...,...,...,...,...,...,...,...
430,Wisconsin 5,Jim Sensenbrenner,,Republican,Wisconsin State SenateWisconsin State Assembly,Stanford University (BA)University of Wisconsi...,1979,Menomonee Falls,1943.0
431,Wisconsin 6,Glenn Grothman,,Republican,Wisconsin SenateWisconsin State Assembly,"University of Wisconsin–Madison (BA, JD)",2015,Campbellsport,1955.0
432,Wisconsin 7,Tom Tiffany,,Republican,Wisconsin SenateWisconsin State Assembly,University of Wisconsin–River Falls (BS),2020 (special),Minocqua,1957.0
433,Wisconsin 8,Mike Gallagher,,Republican,Political advisorU.S. Marine Corps,Princeton University (BA)National Intelligence...,2017,Green Bay,1984.0


Now we will go through the table of congress members and scrape the links to their wikipedia pages

In [8]:
links_to_members = [] #list to store links
for row in members_table.findAll('tr'): #find all rows
    cells = row.findAll('td') #find all columns
    if len(cells) == 9: #the number of columns in the table of interest is 9
        links = cells[1].findAll('a') #By inspecting the parsed html side we can see that links are started with an a hence we want to find all links in the second column
        if links != []: #Make sure that there is a link, vacancies have no links for example 
            link = links[1].get('href') #Since the table has a link to an image of the congress member before the link to their page we need to chose the second link
            links_to_members.append('https://en.wikipedia.org' + link) #Add the unique link to the list  
        else: 
            continue #If no link is found continue to next row

Use the list created above to visit each members page and extract the name of their spouse, if any, and number of childre, if any. Names are scraped to get a unique key for later joining.

In [9]:

names = [] #List to keep the names used as keys.
spouses = [] #List to keep name of spouses
childrens = [] #List to keep number of childrens
for member in links_to_members:
    #Set the three items of interest to a base case, in case we don't find the data we want we don't want to save the data from the previous
    #candidata again.
    cname = " "
    bname = " "
    spouse = "none"
    children = " "
    url = member #link to specific member
    resp = requests.get(url, params={'action': 'raw'}) #request the page as raw wikidata page for easy of scrapeing the info box
    page = resp.text
    for line in page.splitlines(): #go through each line
        #We are looking for names which might most likely be under birth_name, name, or Name with either a white space after the '|' or no whitespace. 
        if line.startswith('| birth_name'):
            bname = line.partition('=')[-1].strip()
        elif line.startswith('|birth_name'):
            bname = line.partition('=')[-1].strip()
        elif line.startswith('|name'):
            cname = line.partition('=')[-1].strip()
        elif line.startswith('| name'):
            cname = line.partition('=')[-1].strip()
        elif line.startswith('|Name'):
            cname = line.partition('=')[-1].strip()
        elif line.startswith('| Name'):
            cname = line.partition('=')[-1].strip()
        #Spouse are most likelt found under spouse or Spouse
        elif line.startswith('|spouse'):
            spouse = line.partition('=')[-1].strip()
        elif line.startswith('|Spouse'):
            spouse = line.partition('=')[-1].strip()
        elif line.startswith('| Spouse'):
            spouse = line.partition('=')[-1].strip()
        elif line.startswith('| spouse'):
            spouse = line.partition('=')[-1].strip()
        #number of childrens might be udner children, Children, childrens, or Childrens
        elif line.startswith('| children'):
            children = line.partition('=')[-1].strip()
        elif line.startswith('| Children'):
            children = line.partition('=')[-1].strip()
        elif line.startswith('|children'):
            children = line.partition('=')[-1].strip()
        elif line.startswith('|Children'):
            children = line.partition('=')[-1].strip()
        elif line.startswith('|Childrens'):
            children = line.partition('=')[-1].strip()
        elif line.startswith('| Childrens'):
            children = line.partition('=')[-1].strip()
        elif line.startswith('| childrens'):
            children = line.partition('=')[-1].strip()
        elif line.startswith('| childrens'):
            children = line.partition('=')[-1].strip()
        #Website appears to be the last part of the infobox so when we reach it we stop scan their page.
        elif line.startswith('|website'):  
            break 
        elif line.startswith('| website'):  
            break
    if cname != " ": #We will prefere their called name which should correspond better between tables
        name = cname
    elif bname != " ": #If we only find their birth name we will use that instead to make manual pairing easier when cleaning data
        name = bname 
    else: #If we do not find any name we wil lfill it in as blank
        name = " "
    names.append(name) #Add the name to the list
    spouses.append(spouse) #Add the name of the spouse to the list
    childrens.append(children) #Add the number of childrens to the list
    time.sleep(0.5) #Wait this time to be polite


In [10]:
member_personal_data = pd.DataFrame(names,columns=['Member'])  #Put the new data into a frame with first column being member.
member_personal_data['Spouse'] = spouses
member_personal_data['Childrens'] = childrens

Join the two tables using the member name as the key. In this case a full outer join will be used in order to include data which we fail to find the correct keys, e.g. one of frame might have th name Joe while another has the name Joseph. Another alternative would be do join on the position in the frames however the vacancies will mess up this ordering so we would need to place these last, or first.

In [11]:
result = pd.merge(congress_members_frame, member_personal_data,how = 'outer', on = 'Member')
result.to_csv(os.getcwd()+'/data/resultingData/congress_members.csv') #Print the results to a csv file.

This then concoludes the first part of trying to create a model for which party a congress district in the US will vote for, the code in here is written as a runable program in the file "web_scrapeing_us_congress.py". In the next part we will go through and clean the data which we just scraped.