### To-do
* Grab state information

### Overview
Congressional voting patterns go beyond Democrat vs. Republican party lines. In this project, I "scrape" voting record data from GovTrack, extract all the representatives and their votes, and perform clustering in order to determine subgroups within the Democratic and Republican parties. Interestingly, for both parties, there is a "mainstream" voting cluster and distant clusters that represent smaller factions within the parties. I performed this analysis for both the House of Representatives and the Senate, but I decided to focus my analysis on the Senate because there are fewer members, and those members are more widely known.
### Goals
My main goal was to build a data extraction pipeline for GovTrack that could be used to run interesting analyses on Congressional voting patterns from any year. 
### Data
Data was "scraped" from https://www.govtrack.us/congress/votes. On this page are all the votes for both chambers of Congress that can be filtered by year, chamber, category, etc. In each vote's page, there is a .csv file that contains the names of all the present representatives and their vote.
### Methods
Note: Many code cells are commented out because they take a long time to run.
First, I "scraped" voting records from https://www.govtrack.us/congress/votes. "Scraped" is in quotations because BeautifulSoup could not detect the relevant information on the page for some reason. However, I noticed a pattern that each Congress (e.g. 116th Congress of 2020) is represented as "votes/116-2020," the Senate is represented as "s," and each vote is numbered numerically from the first to the last (e.g. s1-s239). This way, I could easily run a for loop to download the .csv of each vote from https://www.govtrack.us/congress/votes/116-2020/s[x]/export/csv. 

### Data retrieval

In [None]:
# Import packages
import pandas as pd
import urllib.request
import os.path
from os import mkdir
import re
import datetime

In [None]:
govtrack1 = pd.read_csv('govtrack1.csv')
govtrack2 = pd.read_csv('govtrack2.csv')

In [None]:
# Download voting records (first batch)
for x in range(1, len(govtrack1)):
    congress = govtrack1['Congress'][x]
    year = govtrack1['Year'][x]
    folder = 'C:/Users/HP/Dropbox/UMBC/Datasets/Political Polarization from 1941-2020/Data-driven-Congressional-subgroups/data_senate/' + str(year)
    mkdir(folder)
    base = 'https://www.govtrack.us/congress/votes/' + str(congress) + '-' + str(year)
    items = govtrack1['Items 1'][x]
    for y in range(1, items + 1):
        url = base + '/s' + str(y) + '/export/csv'
        file = 'data_senate/' + str(year) + '/' + str(y) + '.csv'
        try:
            urllib.request.urlretrieve(url, file)
        except:
            pass

In [None]:
# Download voting records (second batch, different items structure)
for x in range(len(govtrack2)):
    congress = govtrack2['Congress'][x]
    year = govtrack2['Year'][x]
    folder = 'C:/Users/HP/Dropbox/UMBC/Datasets/Political Polarization from 1941-2020/Data-driven-Congressional-subgroups/data_senate/' + str(year)
    mkdir(folder)
    base = 'https://www.govtrack.us/congress/votes/' + str(congress) + '-' + str(year)
    if (year % 2) == 0:
        items_beg = govtrack2['Items 2'][x + 1] + 1
        items_end = govtrack2['Items 2'][x]
        for y in range(items_beg, items_end + 1):
            url = base + '/s' + str(y) + '/export/csv'
            file = 'data_senate/' + str(year) + '/' + str(y) + '.csv'
            try:
                urllib.request.urlretrieve(url, file)
            except:
                pass
    else:
        items_beg = 1
        items_end = govtrack2['Items 2'][x]
        for y in range(items_beg, items_end + 1):
            url = base + '/s' + str(y) + '/export/csv'
            file = 'data_senate/' + str(year) + '/' + str(y) + '.csv'
            try:
                urllib.request.urlretrieve(url, file)
            except:
                pass

In [None]:
# Define extract bills function
def process_within_year(year, items_beg, items_end, govtrack):
    print(datetime.datetime.now())
    print('Starting ' + str(year) + ' processing')
    bills = []
    passed = []
    for x in range(items_beg, items_end + 1):
        file = 'data_senate/' + str(year) + '/' + str(x) + '.csv'
        try: 
            bills_data = pd.read_csv(file, nrows = 1)
            congress = govtrack[govtrack['Year'] == year]['Congress'].iloc[0]
            base = 'https://www.govtrack.us/congress/votes/' + str(congress) + '-' + str(year)
            url = base + '/s' + str(x)
            # Label bills with url as opposed to name
            bills.append(url)
            
#             bill = bills_data.columns[0]
#             bill = bill.split(' - ')[1]
#             # Remove commas
#             bill = bill.replace(',', '')
#             bills.append(bill)

        except:
            passed.append(x)
        
    # Extract votes
    bill_nos = []
    names = []
    votes = []
    party = []
    for x in range(items_beg, items_end + 1):
        file = 'data_senate/' + str(year) + '/' + str(x) + '.csv'
        # Some files have extra bill rows or a blank row or are missing the header...
        try: 
            if pd.read_csv(file, skiprows = 1).iloc[0].index[0] == 'person':
                votes_data = pd.read_csv(file, skiprows = 1)
            elif pd.read_csv(file, skiprows = 2).iloc[0].index[0] == 'person':
                votes_data = pd.read_csv(file, skiprows = 2)
            elif pd.read_csv(file, skiprows = 3).iloc[0].index[0] == 'person':
                votes_data = pd.read_csv(file, skiprows = 3)
            else: 
                votes_data = pd.read_csv(file, skiprows = 0, header = None)
                votes_data = votes_data.iloc[:,0:6]
                votes_data.columns = ['person', 'state', 'district', 'vote', 'name', 'party']
            length = len(votes_data)
            for y in range (0, length):
                bill_nos.append(x)
                names.append(votes_data.iloc[y]['name'])
                votes.append(votes_data.iloc[y]['vote'])
                party.append(votes_data.iloc[y]['party'])
        except:
            pass
            
    # Get unique representatives
    unique_names = set(names)
    unique_names = list(unique_names)
    
    # Create data frame of all votes
    all_votes = pd.DataFrame({'BillNo': bill_nos, 'Name': names, 'Vote': votes, 'Party': party})
    
    # Create final data frame for analysis
    data = pd.DataFrame(columns = range(items_beg, items_end + 1))
    data.insert(0, 'Representative', unique_names)
    data.index = unique_names
    data = data.drop('Representative', axis = 1)
    data = data.drop(passed, axis = 1)
    
    # Fill in final data frame
    for representative in unique_names:
        for bill in range(items_beg, items_end + 1):
            rep_record = all_votes[all_votes['Name'] == representative]
            if bill in list(rep_record['BillNo']):
                rep_voteonbill = rep_record[rep_record['BillNo'] == bill]['Vote'].iloc[0]
                data.loc[representative][bill] = rep_voteonbill
            else:
                data.loc[representative][bill] = None
     
    # Name columns
    data.columns = bills
    
    # Quantify data
    data = data.replace('Nay', 0)
    data = data.replace('No', 0)
    data = data.replace('Yea', 1)
    data = data.replace('Aye', 1)
    data = data.replace('Not Voting', 0.5)
    data = data.replace('Present', 0.5)
    data = data.replace('Not Guilty', 0)
    data = data.replace('Guilty', 1)
    
    # Remove rows with NaN
    data = data.dropna(axis = 1, how = 'all')
    data = data.dropna(axis = 0, how = 'any')
    print(str(len(data)) + ' representatives in ' + str(year))
    
    # Extract and insert party
    party = []
    for representative in data.index:
        res = re.findall(r'\[.*?\]', representative)
        res = res[0][1]
        party.append(res)
    data.insert(0, 'Party', party)

    # Save to csv
    print(datetime.datetime.now())
    print('Saving ' + str(year) + ' data')
    data.to_csv('data_senate/data/' + str(year) + '.csv')

In [None]:
for year in range(1989, 2020 + 1):
    items_beg = 1
    items_end = int(govtrack1['Items 1'][govtrack1['Year'] == year])
    process_within_year(year, items_beg, items_end, govtrack1)

In [None]:
for year in range(1941, 1988 + 1):
    if (year % 2) == 0:
        items_beg = int(govtrack2['Items 1'][govtrack2['Year'] == year - 1] + 1)
        items_end = int(govtrack2['Items 2'][govtrack2['Year'] == year])
        process_within_year(year, items_beg, items_end, govtrack2)
    else:
        items_beg = 1
        items_end = int(govtrack2['Items 1'][govtrack2['Year'] == year])
        process_within_year(year, items_beg, items_end, govtrack2)