# TF-IDF for Sponsored Legislation
The goal is to generate characteristic words and phrases for one legislator's sponsored bills, to quickly give viewers a good idea of the legislator's work in Congress.

Steps:

1. Connect to either the relational Postgres DB or the Mongo DB, get the bill descriptions

2. Create single documents for all legislators by appending all bill descriptions

3. Use `sklearn` to calculate tf-idf for all 1/2/3 word combinations

4. Take the top 10 by tf-idf per legislator

5. Create a new table in the postgres DB with these charwords

In [1]:
import numpy as np
import pandas as pd
import psycopg
from sqlalchemy import create_engine
import os
import dotenv
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer

dotenv.load_dotenv()
POSTGRES_PASSWORD = os.getenv('POSTGRES_PASSWORD')


In [2]:
dbms = 'postgresql'
package = 'psycopg'
user = 'postgres'
password = POSTGRES_PASSWORD
host = 'localhost'
port = '5432'
db = 'contrans'

engine = create_engine(f'{dbms}+{package}://{user}:{password}@{host}:{port}/{db}')
engine

Engine(postgresql+psycopg://postgres:***@localhost:5432/contrans)

In [3]:
myquery = """
SELECT b.bioguide_id, bv.text
FROM bill_versions bv
INNER JOIN bills b
    ON bv.bill_type = b.bill_type AND bv.bill_number = b.bill_number
"""

bill_description = pd.read_sql_query(myquery, con=engine)

In [5]:
import re
def strip_html_tag(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, ' ', text)

bill_description['text'] = bill_description['text'].apply(strip_html_tag)
bill_description

Unnamed: 0,bioguide_id,text
0,J000293,Shutdown Fairness Act This bill provides a...
1,O000173,Federal Worker Childcare Protection Act of 2...
2,A000380,This bill requires the federal government to ...
3,S001194,Federal Employees Civil Relief Act This b...
4,S001203,Fair Pay for Federal Contractors Act of 2025...
...,...,...
2746,M001233,This joint resolution nullifies the final rul...
2747,C001129,Laken Riley Act This bill requires the Dep...
2748,B001319,Laken Riley Act This bill requires the Dep...
2749,L000598,This resolution disapproves of the Central Bu...


In [6]:
bill_description = bill_description.groupby('bioguide_id').agg({'text': ' '.join}).reset_index()
bill_description

Unnamed: 0,bioguide_id,text
0,A000055,"Departments of Labor, Health and Human Servi..."
1,A000148,Supporting Transit Commutes Act&nbsp; This...
2,A000369,Coin Metal Modification Authorization and Co...
3,A000370,This resolution recognizes (1) the Greensboro...
4,A000371,No Hungry Kids in Schools Act This bill di...
...,...,...
485,W000829,Facility for Runway Operations and Safe Tran...
486,W000830,Chiquita Canyon Tax Relief Act This bill e...
487,Y000064,Synthetic Biology Advancement Act of 2025 ...
488,Y000067,This resolution supports the designation of N...


In [8]:
tfIdfVectorizer = TfidfVectorizer(stop_words='english',
                                  ngram_range=(1,3),
                                  max_df=0.8)
tfIdf = tfIdfVectorizer.fit_transform(bill_description['text'])

In [9]:
tfIdf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 471851 stored elements and shape (490, 258665)>

In [None]:
# test

pd.DataFrame(tfIdf[0].T.todense(),
             index=tfIdfVectorizer.get_feature_names_out(),
             columns=["TF-IDF"]).sort_values('TF-IDF', ascending=False).head(10)

Unnamed: 0,TF-IDF
education,0.14023
health,0.12544
labor,0.125079
administration,0.119857
safety health,0.117186
english,0.106067
appropriations,0.102279
provides appropriations department,0.085867
provides appropriations,0.081503
services,0.077087


In [21]:
def tfidf_one_legislator(index):
    
    tfidf_data = pd.DataFrame(tfIdf[index].T.todense(),
                index=tfIdfVectorizer.get_feature_names_out(),
                columns=["TF-IDF"]).sort_values('TF-IDF', ascending=False).head(10)
    
    tfidf_data['bioguide_id'] = bill_description['bioguide_id'][index]

    tfidf_data = tfidf_data.reset_index()
    tfidf_data = tfidf_data.rename({'index': 'keyword'}, axis=1)

    return tfidf_data

In [16]:
tfidf_one_legislator(37)

Unnamed: 0,TF-IDF,bioguide_id
tobacco,0.209162,B001303
tobacco cessation,0.19247,B001303
cessation,0.18047,B001303
provides medicaid children,0.096235,B001303
specified guidelines applies,0.096235,B001303
applies 90 federal,0.096235,B001303
sharing diagnostic,0.096235,B001303
sharing diagnostic therapy,0.096235,B001303
program chip coverage,0.096235,B001303
cessation services specifically,0.096235,B001303


In [22]:
tfidf_list = [tfidf_one_legislator(i) for i in range(len(bill_description))]

In [23]:
tfidf_fulldata = pd.concat(tfidf_list)
tfidf_fulldata

Unnamed: 0,keyword,TF-IDF,bioguide_id
0,education,0.140230,A000055
1,health,0.125440,A000055
2,labor,0.125079,A000055
3,administration,0.119857,A000055
4,safety health,0.117186,A000055
...,...,...,...
5,belknap,0.132143,Z000018
6,community water,0.132143,Z000018
7,community,0.123087,Z000018
8,fort,0.117122,Z000018


In [24]:
tfidf_fulldata = tfidf_fulldata.rename({'TF-IDF': 'tf_idf'}, axis=1)
tfidf_fulldata.to_csv('../data/thirdNF/tfidf.csv', index=False)

In [25]:
tfidf_fulldata.to_sql('tfidf', con=engine, index=False,
                      chunksize=1000, if_exists='replace')

-5

## Check it for McGuire

In [26]:
myquery = """
SELECT t.keyword, t.tf_idf
FROM tfidf t
INNER JOIN members m
    ON t.bioguide_id = m.bioguide_id
WHERE m.state_abbrev = 'VA' AND m.district_code = 5
"""
pd.read_sql_query(myquery, con=engine)

Unnamed: 0,keyword,tf_idf
0,virginia,0.178559
1,dc,0.1592
2,highways,0.126308
3,covered agricultural,0.123232
4,agricultural vehicles,0.123232
5,agricultural,0.111185
6,interstate highways,0.105869
7,interstate,0.097632
8,vehicle,0.084546
9,district columbia safe,0.082155
