### To-do
* Display by party or by cluster

### Data analysis

In [1]:
# Import packages
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from warnings import simplefilter
import plotly.express as px

In [63]:
# PCA + clustering
def pca_clustering(year, clusters, color):
    # year = year between 1941-2020
    # clusters = # clusters
    # color = 'Party' or 'Cluster'
    file = 'data_senate/data/' + str(year) +'.csv'
    data = pd.read_csv(file)
    pca = PCA(n_components = 0.95)
    pca.fit(data.iloc[:,2:].T)
    pca_data = pd.DataFrame(pca.components_)
    pca_data = pca_data.T
    pca_data.shape
    pca_data.insert(0, 'Party', data['Party'].values)
    pca_data.insert(0, 'Senator', data['Unnamed: 0'])
    data.insert(0, 'PC1', pca_data[0].values)
    data.insert(0, 'PC2', pca_data[1].values)
    data_small = data.iloc[:,:2]
    data_small.insert(0, 'Party', data['Party'].values)
    data_small.insert(0, 'Senator', data['Unnamed: 0'])
    k_means = KMeans(n_clusters = clusters, random_state = 20210318)
    k_means.fit(data_small[['PC1', 'PC2']])
    cluster = k_means.predict(data_small[['PC1', 'PC2']])
    data_small.insert(0, 'Cluster', cluster)
    data_small['Cluster']= data_small['Cluster'].astype(str)
    data_small = data_small.sort_values('Cluster')
    fig = px.scatter(data_small, x = 'PC1', y = 'PC2', color = color, hover_name = 'Senator')
    fig.show()
    print(str(len(data)) + ' Senators in the ' + str(year) + ' dataset')
    print('-' * 10)
    print(round(pca.explained_variance_ratio_[0]*100, 2), '% variance explained by PC1')
    print(round(pca.explained_variance_ratio_[1]*100, 2), '% variance explained by PC2')
    print('-' * 10)
    print('Most divisive issues (Senate):')
    print(data.iloc[:, 4:].std().sort_values(ascending = False)[0:5])
    print('-' * 10)
    print('Least divisive issues (Senate):')
    print(data.iloc[:, 4:].std().sort_values(ascending = True)[0:5])
    print('-' * 10)
    democrats = data[(data['Party'] == 'D') | (data['Party'] == 'I')]
    print('Most divisive issues (Democratic Party):')
    print(democrats.iloc[:,4:].std().sort_values(ascending = False)[0:5])
    print('-' * 10)
    print('Least divisive issues (Democratic Party):')
    print(democrats.iloc[:,4:].std().sort_values(ascending = True)[0:5])
    print('-' * 10)
    republicans = data[data['Party'] == 'R']
    print('Most divisive issues (Republican Party):')
    print(republicans.iloc[:,4:].std().sort_values(ascending = False)[0:5])
    print('-' * 10)
    print('Least divisive issues (Republican Party):')
    print(republicans.iloc[:,4:].std().sort_values(ascending = True)[0:5])

In [64]:
pca_clustering(2020, 8, 'Party')

99 Senators in the 2020 dataset
----------
54.73 % variance explained by PC1
17.46 % variance explained by PC2
----------
Most divisive issues (Senate):
https://www.govtrack.us/congress/votes/116-2020/s29     0.502519
https://www.govtrack.us/congress/votes/116-2020/s30     0.502519
https://www.govtrack.us/congress/votes/116-2020/s27     0.502519
https://www.govtrack.us/congress/votes/116-2020/s132    0.502314
https://www.govtrack.us/congress/votes/116-2020/s51     0.502314
dtype: float64
----------
Least divisive issues (Senate):
https://www.govtrack.us/congress/votes/116-2020/s216    0.000000
https://www.govtrack.us/congress/votes/116-2020/s167    0.050252
https://www.govtrack.us/congress/votes/116-2020/s47     0.050252
https://www.govtrack.us/congress/votes/116-2020/s115    0.070703
https://www.govtrack.us/congress/votes/116-2020/s54     0.086146
dtype: float64
----------
Most divisive issues (Democratic Party):
https://www.govtrack.us/congress/votes/116-2020/s135    0.505291
https:/

### Limitations
One major limitation is that, as much as I tried, I could not find a meaningful interpretation of what PC1 and PC2 represent for both the Democratic and Republican parties. While it does appear to provide separation and aid in picking out the more fringe senators (e.g. progressives, Tea Partiers, etc.), it is very difficult to arrive at a sound conclusion as to what it actually represents. This is in comparison to PC1 for the Senate as a whole, which very obviously represents Democrats vs. Republicans.
### Future directions
In this project, I focus on the current year 2020, but in the future, it would be interesting to run analyses on multiple years and do a comparative analysis with regards to the degree of polarization, the most polarizing issues, the cluster structure, etc.
### Conclusion
I saw [this Tweet](https://twitter.com/seanjtaylor/status/1331426161356808192) today, and found it to encapsulate my motivation for pursuing this project: "Perhaps my most controversial opinion: In machine learning education, the focus on supervised learning and particularly on classification problems gives people a totally misguided idea about how to use data to solve real problems." I wanted to avoid the mostly banal Kaggle/UCI datasets and build a pipeline to gather my own data. Congressional data does not lend itself to sophisticated machine learning classification/regression models, but I believe that by applying PCA + clustering to it, I learned a lot about the structure of our political system - data-driven knowledge that could not be attained just by reading the news. This pipeline could be used to gather data on any Congress, all the way back to the first Congress of 1789, and could provide the basis for very interesting quantitative comparisons of how American politics has evolved over time.