## Setup 

This section is for setting up the python environment. 

It loads a few python libraries we'll be using.

You can just run it and move on! 

In [1]:
# Pandas and NumPy settings for data analysis
import pandas as pd
import numpy as np

# Always show all dataframe columns
pd.set_option('display.max_columns', None)

# Some libraries for better displaying in Jupyter
from IPython.display import display, HTML

# TQDM for progress bars
from tqdm.notebook import tqdm
tqdm.pandas()

In [2]:
import os
import openai
import dotenv
dotenv.load_dotenv()

openai.organization = None
openai.api_key = os.getenv("OPENAI_API_KEY")
# openai.Model.list() # see all openai models

## Load Data

In this section I show the ouptput of some data we grabbed from BillTrack50.

In [3]:
 # Load bill summaries for Maryland bills
bills = pd.read_csv('bill_search.csv')

# Remove last row, which is a summary of all
bills = bills[0:-1] 

# Display the bills
bills 

Unnamed: 0,State,Bill Number,Name,Summary,Bill Progress,Last Action,Action Date
0,MD,HB1,Maryland Paint Stewardship,Requiring producers of architectural paint or ...,Crossed Over,"Referred Education, Energy, and the Environment",02/26/2024
1,MD,HB2,Baltimore City - Property Taxes - Authority to...,AN ACT concerning Baltimore City - Property Ta...,In Committee,House Ways and Means Hearing (13:00:00 1/25/20...,01/25/2024
2,MD,HB3,Land Use - Expedited Development Review Proces...,AN ACT concerning Land Use - Expedited Develop...,In Committee,House Environment and Transportation Hearing (...,01/30/2024
3,MD,HB6,Public Safety - Law Enforcement - Quotas (Comm...,AN ACT concerning Public Safety - Law Enforcem...,In Committee,House Judiciary Hearing (13:00:00 1/23/2024 ),01/23/2024
4,MD,HB8,Maryland Police Training and Standards Commiss...,AN ACT concerning Maryland Police Training and...,In Committee,House Judiciary Hearing (13:00:00 1/23/2024 ),01/23/2024
...,...,...,...,...,...,...,...
995,MD,SB671,Foreclosure Proceedings - Residential Mortgago...,Requiring that individuals have access to lega...,In Committee,Senate Judicial Proceedings Hearing (13:00:00 ...,02/20/2024
996,MD,SB676,Tax Assistance for Low-Income Marylanders - Fu...,"Requiring the Comptroller, beginning in fiscal...",In Committee,Senate Budget and Taxation Hearing (13:00:00 2...,02/14/2024
997,MD,SB677,Comptroller - Electronic Tax and Fee Return Fi...,"Requiring, beginning in calendar year 2026, th...",In Committee,Senate Budget and Taxation Hearing (13:00:00 2...,02/14/2024
998,MD,SB678,Income Tax - Technical Corrections,Repealing certain obsolete provisions of law c...,Crossed Over,Referred Ways and Means,03/01/2024


# Keywords

This section extracts some keywords from the bills using a library called `yake` (Yet Another Keyword Extractor)

It also uses `pandarallel` to parallelize this process and get it done faster!

In [4]:
from yake import KeywordExtractor
from pandarallel import pandarallel

# Define a function to extract ketwords from a text
kw_extractor = KeywordExtractor()
def get_keywords(text):
    keywords = kw_extractor.extract_keywords(text)
    return [x for x,y in keywords]

# Run the function in parallel to speed things up
pandarallel.initialize(progress_bar=True)
bills['keywords'] = bills['Summary'].parallel_apply(get_keywords)

# display the billls with the keywords attatched
bills

INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=125), Label(value='0 / 125'))), HB…

Unnamed: 0,State,Bill Number,Name,Summary,Bill Progress,Last Action,Action Date,keywords
0,MD,HB1,Maryland Paint Stewardship,Requiring producers of architectural paint or ...,Crossed Over,"Referred Education, Energy, and the Environment",02/26/2024,"[Paint Stewardship Program, paint beginning Ja..."
1,MD,HB2,Baltimore City - Property Taxes - Authority to...,AN ACT concerning Baltimore City - Property Ta...,In Committee,House Ways and Means Hearing (13:00:00 1/25/20...,01/25/2024,"[special property tax, property tax rate, Vaca..."
2,MD,HB3,Land Use - Expedited Development Review Proces...,AN ACT concerning Land Use - Expedited Develop...,In Committee,House Environment and Transportation Hearing (...,01/30/2024,"[Expedited Development Review, Development Rev..."
3,MD,HB6,Public Safety - Law Enforcement - Quotas (Comm...,AN ACT concerning Public Safety - Law Enforcem...,In Committee,House Judiciary Hearing (13:00:00 1/23/2024 ),01/23/2024,"[law enforcement quotas, law enforcement offic..."
4,MD,HB8,Maryland Police Training and Standards Commiss...,AN ACT concerning Maryland Police Training and...,In Committee,House Judiciary Hearing (13:00:00 1/23/2024 ),01/23/2024,"[United States armed, Police Officer Certifica..."
...,...,...,...,...,...,...,...,...
995,MD,SB671,Foreclosure Proceedings - Residential Mortgago...,Requiring that individuals have access to lega...,In Committee,Senate Judicial Proceedings Hearing (13:00:00 ...,02/20/2024,"[Foreclosure Proceedings Program, Maryland Leg..."
996,MD,SB676,Tax Assistance for Low-Income Marylanders - Fu...,"Requiring the Comptroller, beginning in fiscal...",In Committee,Senate Budget and Taxation Hearing (13:00:00 2...,02/14/2024,"[Low-Income Marylanders Fund, mobile tax clini..."
997,MD,SB677,Comptroller - Electronic Tax and Fee Return Fi...,"Requiring, beginning in calendar year 2026, th...",In Committee,Senate Budget and Taxation Hearing (13:00:00 2...,02/14/2024,"[Comptroller be filed, beginning in calendar, ..."
998,MD,SB678,Income Tax - Technical Corrections,Repealing certain obsolete provisions of law c...,Crossed Over,Referred Ways and Means,03/01/2024,"[income tax revenue, Repealing certain obsolet..."


## Remove documents that are too big

OpenAI models have a context window of ~8000 tokens. 

https://platform.openai.com/docs/guides/embeddings/embedding-models

So we'll remove any documents that are longer than that for now.

We can use `tiktoken` to [count the tokens](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb) in each document.

In [5]:
# Import modules
import tiktoken
from openai import OpenAI
client = OpenAI()

# Set embedding model parameters
embedding_model = "text-embedding-3-small" # this is the model we will use to make embeddings
embedding_encoding = "cl100k_base"  # this the encoding for text-embedding-ada-002
max_tokens = 8000  # the maximum for text-embedding-ada-002 is 8191

# Get the encoding for the specified model
encoding = tiktoken.get_encoding(embedding_encoding)

# Make a new column with the combined title and summary
bills["combined"] = (
    "Title: " + bills.Name.str.strip() + "; Content: " + bills.Summary.str.strip()
)

# Make a new column with the number of tokens in the combined title and summary
bills["n_tokens"] = bills.combined.apply(lambda x: len(encoding.encode(x)))

# Sort by that column
bills = bills.sort_values(by='n_tokens', ascending=False)

# Display the bills
bills


Unnamed: 0,State,Bill Number,Name,Summary,Bill Progress,Last Action,Action Date,keywords,combined,n_tokens
691,MD,SB218,Physicians and Allied Health Professions - Reo...,AN ACT concerning Physicians and Allied Health...,In Committee,Senate Finance Hearing (13:00:00 1/30/2024 ),01/30/2024,"[Health Occupations Section, Allied Health Pro...",Title: Physicians and Allied Health Profession...,3466
769,MD,SB306,Chesapeake and Atlantic Coastal Bays Critical ...,AN ACT concerning Chesapeake and Atlantic Coas...,In Committee,"Senate Education, Energy, and the Environment ...",01/31/2024,"[Bays Critical Area, Coastal Bays Critical, Cr...",Title: Chesapeake and Atlantic Coastal Bays Cr...,931
205,MD,HB233,Chesapeake and Atlantic Coastal Bays Critical ...,AN ACT concerning Chesapeake and Atlantic Coas...,In Committee,House Environment and Transportation Hearing (...,02/07/2024,"[Bays Critical Area, Coastal Bays Critical, Cr...",Title: Chesapeake and Atlantic Coastal Bays Cr...,930
238,MD,HB273,Real Property - Regulation of Common Ownership...,AN ACT concerning Real Property - Regulation o...,In Committee,House Environment and Transportation Hearing (...,02/06/2024,"[Common Ownership Community, Program Evaluatio...",Title: Real Property - Regulation of Common Ow...,595
767,MD,SB304,Natural Resources - State Boat Act - Alterations,AN ACT concerning Natural Resources - State Bo...,In Committee,"Senate Education, Energy, and the Environment ...",01/31/2024,"[State Boat Act, Natural Resources Section, St...",Title: Natural Resources - State Boat Act - Al...,489
...,...,...,...,...,...,...,...,...,...,...
639,MD,SB156,Port of Baltimore - Renaming,Renaming the Port of Baltimore to be the Helen...,In Committee,Senate Finance Hearing (13:00:00 2/28/2024 ),02/28/2024,"[Helen Delich Bentley, Delich Bentley Port, Po...",Title: Port of Baltimore - Renaming; Content: ...,28
150,MD,HB171,State Board of Pharmacy - Membership - Veterin...,Adding a veterinary pharmacist member to the S...,In Committee,House Health and Government Operations Hearing...,02/08/2024,"[Board of Pharmacy, State Board, veterinary ph...",Title: State Board of Pharmacy - Membership - ...,27
507,MD,HB733,Baltimore City - Alcoholic Beverages - Licensi...,Increasing certain licensing fees for alcoholi...,In Committee,House Economic Matters Hearing (13:00:00 2/19/...,02/19/2024,"[Baltimore City, beverages in Baltimore, Incre...",Title: Baltimore City - Alcoholic Beverages - ...,26
566,MD,SB74,State Designations - State Fruit - Persimmon,Designating persimmon as the State fruit.,In Committee,"Senate Education, Energy, and the Environment ...",03/08/2024,"[State fruit, Designating persimmon, State, De...",Title: State Designations - State Fruit - Pers...,25


In [6]:
# Grab the rows where the text is too big for the context window of the mmodel (>8000 tokens)
too_long = bills.query("n_tokens > @max_tokens") 

# Print how many will be removed
print(f"Removing {len(too_long)} bills that are too long")

# Display the removed stories here in this cell so we can see what we're losing
display(too_long)  

# Remove the rows where the text is too big for the context window of the model
bills = bills.query("n_tokens <= @max_tokens")  

Removing 0 bills that are too long


Unnamed: 0,State,Bill Number,Name,Summary,Bill Progress,Last Action,Action Date,keywords,combined,n_tokens


## Embeddings

Now we take the "combined" column, which contains the 

In [7]:
from openai import OpenAI
client = OpenAI()

def get_embeddings(texts, model="text-embedding-3-small"):
    # Replace newlines in each text and ensure it's a list of texts
    texts = [text.replace("\n", " ") for text in texts]
    # OpenAI's embeddings.create can process multiple inputs as a list
    response = client.embeddings.create(input=texts, model=model)
    # Extract embeddings from the response
    embeddings = [item.embedding for item in response.data]
    return embeddings

# Function to process DataFrame in batches and return a list of embeddings
def process_in_batches(df, column_name, batch_size=10):
    # Break the DataFrame into batches of size `batch_size`
    batches = [df[column_name].iloc[i:i + batch_size] for i in range(0, len(df), batch_size)]
    # Process each batch and collect embeddings
    all_embeddings = []
    for batch in tqdm(batches, desc="Processing batches"):
        batch_embeddings = get_embeddings(batch.tolist())
        all_embeddings.extend(batch_embeddings)
    return all_embeddings

# Example usage
batch_size = 100  # Adjust based on your preference and rate limits
bills['embedding'] = process_in_batches(bills, 'combined', batch_size=batch_size)


Processing batches:   0%|          | 0/10 [00:00<?, ?it/s]

In [8]:
# drop combined column since those were only for the purposes of making the embeddings
bills = bills.drop(columns=['combined', 'n_tokens'])

## Dimensionality Reduction (t-SNE)

The embedding is a vector in 1536 dimensions. Viewing data in that many dimensions would break your brain. 🤯

Our brains can only handle 2 or 3 dimensions at a time. We'll use t-SNE to reduce the number of dimensions, flattening the multi-dimensional space into 2 dimensions.

> Here are some things to keep in mind about t-SNE if you use it in the future. You may have to tweak some parameters to fit the needs of your data.
>
> Blog Post: [How to Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)



In [9]:
# find the one where bill.embedding is nan
bills[bills.embedding.isna()]

Unnamed: 0,State,Bill Number,Name,Summary,Bill Progress,Last Action,Action Date,keywords,embedding


In [10]:
# remove where embedding is na 
bills = bills.dropna(subset=['embedding'])

umap seems to work better for bills than t-SNE so I've commented out t-SNE

In [11]:
# from sklearn.manifold import TSNE
# import numpy as np


# # Convert to a list of lists of floats
# matrix = np.array(bills.embedding.to_list())

# # Create a t-SNE model and transform the data
# tsne = TSNE(n_components=2, perplexity=30, random_state=42, init='random', learning_rate=400)
# vis_dims = tsne.fit_transform(matrix)

# # add to dataframe and write to csv
# bills = bills\
#     .assign(
#         x = vis_dims[:,0], 
#         y = vis_dims[:,1])


In [12]:
from umap import UMAP

matrix = np.array(bills.embedding.to_list())

# Create a UMAP model and transform the data
umap = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
vis_dims = umap.fit_transform(matrix)

# add to dataframe and write to csv
bills = bills\
    .assign(
        x = vis_dims[:,0], 
        y = vis_dims[:,1])

  warn(f"n_jobs value {self.n_jobs} overridden to 1 by setting random_state. Use no seed for parallelism.")


In [13]:
# Write the data to a CSV file
bills.to_csv('../bills-with-embeddings.csv', index=False)

# Display the bills
bills.head()

Unnamed: 0,State,Bill Number,Name,Summary,Bill Progress,Last Action,Action Date,keywords,embedding,x,y
691,MD,SB218,Physicians and Allied Health Professions - Reo...,AN ACT concerning Physicians and Allied Health...,In Committee,Senate Finance Hearing (13:00:00 1/30/2024 ),01/30/2024,"[Health Occupations Section, Allied Health Pro...","[0.0037022163160145283, 0.024315113201737404, ...",9.65366,5.665043
769,MD,SB306,Chesapeake and Atlantic Coastal Bays Critical ...,AN ACT concerning Chesapeake and Atlantic Coas...,In Committee,"Senate Education, Energy, and the Environment ...",01/31/2024,"[Bays Critical Area, Coastal Bays Critical, Cr...","[0.026499615982174873, 0.048990171402692795, 0...",9.374174,7.133152
205,MD,HB233,Chesapeake and Atlantic Coastal Bays Critical ...,AN ACT concerning Chesapeake and Atlantic Coas...,In Committee,House Environment and Transportation Hearing (...,02/07/2024,"[Bays Critical Area, Coastal Bays Critical, Cr...","[0.02629251778125763, 0.0483873076736927, 0.05...",9.374231,7.060091
238,MD,HB273,Real Property - Regulation of Common Ownership...,AN ACT concerning Real Property - Regulation o...,In Committee,House Environment and Transportation Hearing (...,02/06/2024,"[Common Ownership Community, Program Evaluatio...","[0.03730740770697594, 0.057970840483903885, 0....",9.782133,5.881553
767,MD,SB304,Natural Resources - State Boat Act - Alterations,AN ACT concerning Natural Resources - State Bo...,In Committee,"Senate Education, Energy, and the Environment ...",01/31/2024,"[State Boat Act, Natural Resources Section, St...","[0.057250022888183594, 0.09177238494157791, 0....",9.58744,7.679071
