### To-do
* Impute NaN, by row, by column, by party vote?
* Remove columns with >80% NaN
* Save data in separate folder

### Overview
Congressional voting patterns go beyond Democrat vs. Republican party lines. In this project, I "scrape" voting record data from GovTrack, extract all the representatives and their votes, and perform clustering in order to determine subgroups within the Democratic and Republican parties. Interestingly, for both parties, there is a "mainstream" voting cluster and distant clusters that represent smaller factions within the parties. I performed this analysis for both the House of Representatives and the Senate, but I decided to focus my analysis on the Senate because there are fewer members, and those members are more widely known.
### Goals
My main goal was to build a data extraction pipeline for GovTrack that could be used to run interesting analyses on Congressional voting patterns from any year. 
### Data
Data was "scraped" from https://www.govtrack.us/congress/votes. On this page are all the votes for both chambers of Congress that can be filtered by year, chamber, category, etc. In each vote's page, there is a .csv file that contains the names of all the present representatives and their vote.

### Methods
Note: Many code cells are commented out because they take a long time to run.
First, I "scraped" voting records from https://www.govtrack.us/congress/votes. "Scraped" is in quotations because BeautifulSoup could not detect the relevant information on the page for some reason. However, I noticed a pattern that each Congress (e.g. 116th Congress of 2020) is represented as "votes/116-2020," the Senate is represented as "s," and each vote is numbered numerically from the first to the last (e.g. s1-s239). This way, I could easily run a for loop to download the .csv of each vote from https://www.govtrack.us/congress/votes/116-2020/s[x]/export/csv. 

In [None]:
import pandas as pd
govtrack1 = pd.read_csv('govtrack1.csv')

In [None]:
# Download voting records (first batch)
import urllib.request
import os.path
from os import mkdir
for x in range(len(govtrack1)):
    congress = govtrack1['Congress'][x]
    year = govtrack1['Year'][x]
    folder = 'C:/Users/HP/Dropbox/UMBC/Datasets/2020 Congress/Data-driven-Congressional-subgroups/data_senate/' + str(year)
    mkdir(folder)
    base = 'https://www.govtrack.us/congress/votes/' + str(congress) + '-' + str(year)
    items = govtrack1['Items 1'][x]
    for y in range(1, items + 1):
        url = base + '/s' + str(y) + '/export/csv'
        file = 'data_senate/' + str(year) + '/' + str(y) + '.csv'
        try:
            urllib.request.urlretrieve(url, file)
        except:
            pass

In [None]:
import pandas as pd
govtrack2 = pd.read_csv('govtrack2.csv')

In [None]:
# Download voting records (second batch, different items structure)
import urllib.request
import os.path
from os import mkdir
for x in range(1, len(govtrack2)):
    congress = govtrack2['Congress'][x]
    year = govtrack2['Year'][x]
    folder = 'C:/Users/HP/Dropbox/UMBC/Datasets/2020 Congress/Data-driven-Congressional-subgroups/data_senate/' + str(year)
    mkdir(folder)
    base = 'https://www.govtrack.us/congress/votes/' + str(congress) + '-' + str(year)
    if (year % 2) == 0:
        items_beg = govtrack2['Items 2'][x + 1] + 1
        items_end = govtrack2['Items 2'][x]
        for y in range(items_beg, items_end + 1):
            url = base + '/s' + str(y) + '/export/csv'
            file = 'data_senate/' + str(year) + '/' + str(y) + '.csv'
            try:
                urllib.request.urlretrieve(url, file)
            except:
                pass
    else:
        items_beg = 1
        items_end = govtrack2['Items 2'][x]
        for y in range(items_beg, items_end + 1):
            url = base + '/s' + str(y) + '/export/csv'
            file = 'data_senate/' + str(year) + '/' + str(y) + '.csv'
            try:
                urllib.request.urlretrieve(url, file)
            except:
                pass

In [None]:
# Extract bills
import re
import datetime
def process_within_year(year, items_beg, items_end):
    print(datetime.datetime.now())
    print('Starting ' + str(year) + ' processing')
    bills = []
    passed = []
    for x in range(items_beg, items_end + 1):
        file = 'data_senate/' + str(year) + '/' + str(x) + '.csv'
        try: 
            bills_data = pd.read_csv(file, nrows = 1)
            bill = bills_data.columns[0]
            bill = bill.split(' - ')[1]
            # Remove commas
            bill = bill.replace(',', '')
            bills.append(bill)
        except:
            passed.append(x)
        
    # Extract votes
    bill_nos = []
    names = []
    votes = []
    party = []
    for x in range(items_beg, items_end + 1):
        file = 'data_senate/' + str(year) + '/' + str(x) + '.csv'
        # Some files have extra bill rows or a blank row or are missing the header...
        try: 
            if pd.read_csv(file, skiprows = 1).iloc[0].index[0] == 'person':
                votes_data = pd.read_csv(file, skiprows = 1)
            elif pd.read_csv(file, skiprows = 2).iloc[0].index[0] == 'person':
                votes_data = pd.read_csv(file, skiprows = 2)
            elif pd.read_csv(file, skiprows = 3).iloc[0].index[0] == 'person':
                votes_data = pd.read_csv(file, skiprows = 3)
            else: 
                votes_data = pd.read_csv('data_senate/1943/87.csv', skiprows = 0, header = None)
                votes_data = test.iloc[:,0:6]
                votes_data.columns = ['person', 'state', 'district', 'vote', 'name', 'party']
            length = len(votes_data)
            for y in range (0, length):
                bill_nos.append(x)
                names.append(votes_data.iloc[y]['name'])
                votes.append(votes_data.iloc[y]['vote'])
                party.append(votes_data.iloc[y]['party'])
        except:
            pass
            
    # Get unique representatives
    unique_names = set(names)
    unique_names = list(unique_names)
    
    # Create data frame of all votes
    all_votes = pd.DataFrame({'BillNo': bill_nos, 'Name': names, 'Vote': votes, 'Party': party})
    
    # Create final data frame for analysis
    data = pd.DataFrame(columns = range(items_beg, items_end + 1))
    data.insert(0, 'Representative', unique_names)
    data.index = unique_names
    data = data.drop('Representative', axis = 1)
    data = data.drop(passed, axis = 1)
    
    # Fill in final data frame
    for representative in unique_names:
        for bill in range(items_beg, items_end + 1):
            rep_record = all_votes[all_votes['Name'] == representative]
            if bill in list(rep_record['BillNo']):
                rep_voteonbill = rep_record[rep_record['BillNo'] == bill]['Vote'].iloc[0]
                data.loc[representative][bill] = rep_voteonbill
            else:
                data.loc[representative][bill] = None
     
    # Name columns
    data.columns = bills
    
    # Quantify data
    data = data.replace('Nay', 0)
    data = data.replace('No', 0)
    data = data.replace('Yea', 1)
    data = data.replace('Aye', 1)
    data = data.replace('Not Voting', 0.5)
    data = data.replace('Present', 0.5)
    data = data.replace('Not Guilty', 0)
    data = data.replace('Guilty', 1)
    
    # Remove rows with NaN
    data = data.dropna()
    
    # Extract and insert party
    party = []
    for representative in data.index:
        res = re.findall(r'\[.*?\]', representative)
        res = res[0][1]
        party.append(res)
    data.insert(0, 'Party', party)

    # Save to csv
    print(datetime.datetime.now())
    print('Saving ' + str(year) + ' data')
    data.to_csv('data_senate/' + str(year) +'/data.csv')

In [None]:
for year in range(2003, 2020 + 1):
    items_beg = 1
    items_end = int(govtrack1['Items 1'][govtrack1['Year'] == year])
    process_within_year(year, items_beg, items_end)

In [None]:
for year in range(1972, 1988 + 1):
    if (year % 2) == 0:
        items_beg = int(govtrack2['Items 1'][govtrack2['Year'] == year - 1] + 1)
        items_end = int(govtrack2['Items 2'][govtrack2['Year'] == year])
        process_within_year(year, items_beg, items_end)
    else:
        items_beg = 1
        items_end = int(govtrack2['Items 1'][govtrack2['Year'] == year])
        process_within_year(year, items_beg, items_end)

In [None]:
# Principal component analysis
import pandas as pd
from sklearn.decomposition import PCA
import seaborn as sns

year = 1941
file = 'data_senate/' + str(year) +'/data.csv'
data = pd.read_csv(file)

pca = PCA(n_components = 0.95)
pca.fit(data.iloc[:,2:].T)
pca_data = pd.DataFrame(pca.components_)
pca_data = pca_data.T
pca_data.shape
# Scatterplot of first two PCs with party as label
sns.scatterplot(x = pca_data[0], y = pca_data[1], hue = data['Party'].values, palette = 'bright')
print(pca.explained_variance_ratio_[0]*100, '% variance explained by PC1')
print(pca.explained_variance_ratio_[1]*100, '% variance explained by PC2')

In [None]:
# Filter by Democrats
democrats = data[(data['Party'] == 'D') | (data['Party'] == 'I')]
# Most divisive issues within Democratic Party
democrats.std().sort_values(ascending = False)[0:5]

Because the contemporary American political system is so polarized, it is much more interesting to look at clusters within the two parties separately. Below, PCA is run on only Democratic (and Independent) senators.

In [None]:
# Principal component analysis
from sklearn.decomposition import PCA
pca = PCA(n_components = 0.95)
pca.fit(democrats.iloc[:,1:].T)
pca_data = pd.DataFrame(pca.components_)
pca_data = pca_data.T
pca_data.shape

Subjectively, at first glance, there appears to be one large "mainstream" cluster and then two distant subgroups.

In [None]:
# Scatterplot of first two PCs with party as label
import seaborn as sns
sns.scatterplot(x = pca_data[0], y = pca_data[1], hue = democrats['Party'].values, palette = 'bright')

In [None]:
print(pca.explained_variance_ratio_[0]*100, '% variance explained by PC1')
print(pca.explained_variance_ratio_[1]*100, '% variance explained by PC2')

In [None]:
# Insert PCs
democrats.insert(0, 'PC1', pca_data[0].values)
democrats.insert(0, 'PC2', pca_data[1].values)

In [None]:
# Scale PCs before clustering
from sklearn.preprocessing import StandardScaler

democrats_small = democrats.iloc[:,:2]
columns = democrats_small.columns
index = democrats_small.index
scaler = StandardScaler()
democrats_small = scaler.fit_transform(democrats_small)
democrats_small = pd.DataFrame(democrats_small)
democrats_small.columns = columns
democrats_small.index = index

Use the elbow method to guide clustering (although in the end, I will personally choose a k value that makes the most sense to me).

In [None]:
# Find optimal number of clusters for K-Means
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from warnings import simplefilter
simplefilter(action = 'ignore', category = FutureWarning)

model = KMeans(random_state = 20201124)
visualizer = KElbowVisualizer(model, k = (2,10), timings = False)
visualizer.fit(democrats_small[['PC1', 'PC2']])
elbow_visualizer = visualizer.show()

In [None]:
# K-means clustering
k_means = KMeans(n_clusters = 4, random_state = 20201124)
k_means.fit(democrats_small[['PC1', 'PC2']])
cluster = k_means.predict(democrats_small[['PC1', 'PC2']])
democrats_small.insert(0, 'Cluster', cluster)

With k = 4, Democratic senators appear to be neatly divided into two groups of "mainstream" senators, and the two subgroups mentioned earlier.

In [None]:
# Scatterplot of first two PCs with cluster as label
import seaborn as sns
sns.scatterplot(x = democrats_small['PC1'], y = democrats_small['PC2'], 
                hue = democrats_small['Cluster'].values, palette = 'bright')

The first subgroup picks out the senators known to be the most progressive.

In [None]:
# Cluster 0 = progressive Democrats
democrats_small[democrats_small['Cluster'] == 2]

The second subgroup appears to pick out the ["Red State Democrats"](https://www.politico.com/news/2020/02/05/doug-jones-impeachment-vote-110818)

In [None]:
# Cluster 3 = Red State Democrats
democrats_small[democrats_small['Cluster'] == 3]

In [None]:
# Insert cluster into main data frame
democrats.insert(0, 'Cluster', cluster)

Okay, now let's try the same thing for Republican senators.

In [None]:
# Filter by Republicans
republicans = data[data['Party'] == 'R']

The most divisive issue within the Republican Party in 2020 was the Great American Outdoors Act, which aimed to provide funding for outdoors-related agencies.

In [None]:
# Most divisive issues within Republican Party
republicans.std().sort_values(ascending = False)[0:5]

In [None]:
# Principal component analysis
from sklearn.decomposition import PCA
pca = PCA(n_components = 0.95)
pca.fit(republicans.iloc[:,1:].T)
pca_data = pd.DataFrame(pca.components_)
pca_data = pca_data.T
pca_data.shape

As with Democratic senators, there appears to be a "mainstream" cluster and two distant subgroups.

In [None]:
# Scatterplot of first two PCs
import seaborn as sns
sns.scatterplot(x = pca_data[0], y = pca_data[1], palette = 'bright')

In [None]:
print(pca.explained_variance_ratio_[0]*100, '% variance explained by PC1')
print(pca.explained_variance_ratio_[1]*100, '% variance explained by PC2')

In [None]:
# Insert PCs
republicans.insert(0, 'PC1', pca_data[0].values)
republicans.insert(0, 'PC2', pca_data[1].values)

In [None]:
# Scale PCs before clustering
from sklearn.preprocessing import StandardScaler

republicans_small = republicans.iloc[:,:2]
columns = republicans_small.columns
index = republicans_small.index
scaler = StandardScaler()
republicans_small = scaler.fit_transform(republicans_small)
republicans_small = pd.DataFrame(republicans_small)
republicans_small.columns = columns
republicans_small.index = index

In [None]:
# Find optimal number of clusters for K-Means
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from warnings import simplefilter
simplefilter(action = 'ignore', category = FutureWarning)

model = KMeans(random_state = 20201124)
visualizer = KElbowVisualizer(model, k = (2,10), timings = False)
visualizer.fit(republicans_small[['PC1', 'PC2']])
elbow_visualizer = visualizer.show()

In [None]:
# K-means clustering
k_means = KMeans(n_clusters = 4, random_state = 20201124)
k_means.fit(republicans_small[['PC1', 'PC2']])
cluster = k_means.predict(republicans_small[['PC1', 'PC2']])
republicans_small.insert(0, 'Cluster', cluster)

k = 4 nicely picks out the two subgroups and divides the mainstream cluster into two halves.

In [None]:
# Scatterplot of first two PCs with cluster as label
import seaborn as sns
sns.scatterplot(x = republicans_small['PC1'], y = republicans_small['PC2'], 
                hue = republicans_small['Cluster'].values, palette = 'bright')

The first subgroup appears to pick out [the more liberal Republicans who took a firm stance against Trump](https://www.politico.com/news/2020/01/31/alexander-murkowski-collins-romney-impeachment-trial-110138)

In [None]:
# Cluster 2 =  liberal Republicans
republicans_small[republicans_small['Cluster'] == 0]

The other subgroup appears to pick out the [radically conservative Tea Party members](https://www.politico.com/story/2011/01/4th-senator-joins-tea-party-caucus-048302)

In [None]:
# Cluster 3 = Tea Party Republicans
republicans_small[republicans_small['Cluster'] == 2]

In [None]:
# Insert cluster into main data frame
republicans.insert(0, 'Cluster', cluster)

### Limitations
One major limitation is that, as much as I tried, I could not find a meaningful interpretation of what PC1 and PC2 represent for both the Democratic and Republican parties. While it does appear to provide separation and aid in picking out the more fringe senators (e.g. progressives, Tea Partiers, etc.), it is very difficult to arrive at a sound conclusion as to what it actually represents. This is in comparison to PC1 for the Senate as a whole, which very obviously represents Democrats vs. Republicans.
### Future directions
In this project, I focus on the current year 2020, but in the future, it would be interesting to run analyses on multiple years and do a comparative analysis with regards to the degree of polarization, the most polarizing issues, the cluster structure, etc.
### Conclusion
I saw [this Tweet](https://twitter.com/seanjtaylor/status/1331426161356808192) today, and found it to encapsulate my motivation for pursuing this project: "Perhaps my most controversial opinion: In machine learning education, the focus on supervised learning and particularly on classification problems gives people a totally misguided idea about how to use data to solve real problems." I wanted to avoid the mostly banal Kaggle/UCI datasets and build a pipeline to gather my own data. Congressional data does not lend itself to sophisticated machine learning classification/regression models, but I believe that by applying PCA + clustering to it, I learned a lot about the structure of our political system - data-driven knowledge that could not be attained just by reading the news. This pipeline could be used to gather data on any Congress, all the way back to the first Congress of 1789, and could provide the basis for very interesting quantitative comparisons of how American politics has evolved over time.