## Topic Modeling with Bills

Use topic modeling to group bills into coherent topics within large legislation subject areas
(education, health, etc.).

For example, about 1/8th of the ~20,000 bills introduced in Tennessee General Assembly in the
107th-110th legislative session pertain to education. Partition these into more specific topics
like higher education, school funding, school choice, etc. to serve as additional features for
search and recommendation.

Steps:

* Use topic modeling to partition bills into topics
* Evaluate candidate models with coherence and select a well-performing topic model
* Inspect keywords, coherence, most typical bills to label each topic
* Inspect bills that don't fit well into any topics and label manually

In [1]:
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [2]:
import pandas as pd

legacy = pd.read_json('../data/legacy.json')
current = pd.read_json('../data/current.json')

bills = pd.concat([legacy, current], ignore_index=True, sort=False)

bills[['title', 'text', 'subjects']].head()

Unnamed: 0,title,text,subjects
0,"Taxes, Sales - As introduced, extends existing...","An act to amend Tennessee Code Annotated, Titl...","[Business and Consumers, Budget, Spending, and..."
1,"Highway Signs - As enacted, designates ""Benjam...",An act to name a bridge on State Route 386 (Vi...,[Transportation]
2,"Taxes, Exemption and Credits - As introduced, ...","An act to amend Tennessee Code Annotated, Titl...","[Budget, Spending, and Taxes]"
3,"Veterans' Affairs, Dept. of - As enacted, rena...","An act to amend Tennessee Code Annotated, Sect...","[Military, Legislative Affairs]"
4,"Workers Compensation - As introduced, increase...","An act to amend Tennessee Code Annotated, Titl...",[Labor and Employment]


Bill title and text, [preprocessed](https://github.com/alexander-poon/represent/blob/master/py/preprocess.py) to word
tokenize, lemmatize, and remove stop words:

In [3]:
tokens = pd.read_json('../data/tokens.json', typ='series')

tokens.head()

0    [taxis, sale, extend, exist, allow, person, re...
1    [highway, sign, designate, Benjamin, Pat, Hart...
2    [taxis, Exemption, Credits, commissioner, stud...
3    [Veterans, Affairs, Dept, rename, service, med...
4    [Workers, Compensation, increase, time, secret...
dtype: object

Pick a subject to run topic model (Education is the most common):

In [4]:
print(pd.Series([sub for bills in bills['subjects'] for sub in bills]).value_counts())

subject = 'Education'

Education                               3093
Legal Issues                            2776
Budget, Spending, and Taxes             1881
Health                                  1832
Municipal and County Issues             1474
Transportation                          1366
Crime                                   1138
Drugs                                   1017
Government Reform                        959
Legislative Affairs                      935
Federal, State, and Local Relations      876
Labor and Employment                     822
Business and Consumers                   717
Family and Children Issues               641
Housing and Property                     641
Commerce                                 591
Campaign Finance and Election Issues     485
Insurance                                481
State Agencies                           474
Judiciary                                468
Environmental                            396
Guns                                     336
Other     

Extract bills and preprocessed tokens corresponding to selected subject:

In [5]:
index = bills['subjects'].apply(lambda x: '|'.join(x)).str.contains(subject)

tokens = tokens[index].reset_index(drop=True)
bills = bills[index].reset_index(drop=True)

Use [Latent Dirchlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) as a topic model.

LDA estimates a probability distribution of documents over topics and a probability distribution of topics over words.
We can then use the words highly associated with each topic to infer topics and label bills with their associated topics.

In [6]:
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel

n_topics = 20

dictionary = Dictionary(tokens)
corpus = [dictionary.doc2bow(b) for b in tokens]

lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=n_topics,
    passes=50,
    alpha='auto',  # Low alpha => documents appear in fewer topics; high alpha => documents distributed across more topics
    eta='auto',    # Low eta => topics will have fewer terms; high eta => topics will have more terms
    random_state=79
)

Evaluate topic model using [coherence](http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf):

In [7]:
c = CoherenceModel(model=lda_model, texts=tokens, dictionary=dictionary, coherence='c_v')

print('Model Coherence:', c.get_coherence())

Model Coherence: 0.46699194637344477


What is the distribution of topics for each bill?

In [8]:
def get_bill_topics(model):
    bill_topics = pd.DataFrame()

    for i, row in enumerate(model[corpus]):
        # Extract proportion that document falls into each topic
        bill_topics = pd.concat([bill_topics, pd.DataFrame(model[corpus][i]).set_index(0)], axis=1)

    bill_topics = bill_topics.transpose().reset_index(drop=True)

    # Extract integer index of dominant topic
    dominant_topic = bill_topics.idxmax(axis=1).rename('dominant_topic')

    # Extract percentage that document represents dominant topic
    max_perc = bill_topics.max(axis=1, skipna=True).rename('max_perc')

    return pd.concat([bills[['session', 'bill_id', 'title', 'text']], dominant_topic, max_perc, bill_topics], axis=1)

bill_topics = get_bill_topics(lda_model)

bill_topics.head(2)

Unnamed: 0,session,bill_id,title,text,dominant_topic,max_perc,0,1,2,3,...,10,11,12,13,14,15,16,17,18,19
0,107,HB 1006,"Education - As introduced, requires the commis...","An act to amend Tennessee Code Annotated, Titl...",10,0.934423,,,,,...,0.934423,,,,,,,,,
1,107,HB 1027,"Education, Higher - As introduced, requires th...","An act to amend Tennessee Code Annotated, Titl...",18,0.37527,,,,,...,0.152441,,,0.09553,0.348938,,,,0.37527,


(Optional) Write out topic model and output:

In [None]:
# lda_model.save('models/topic_model_' + subject.lower())

# bill_topics.to_csv('../data/' + subject.lower() + '_topics'.csv", index=False)

Visualize topic model using [LDAvis](https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf).

The scatterplot in the left panel indicates how distinct the topics are from one another. The size of the points
indicates the number of bills associated with each topic. 

The right panel indicates the most representative terms for each topic, as well as their frequency in a given topic
relative to their frequency in the corpus overall. Selecting lower values of $\lambda$ extracts terms that are more
uniquely associated with each topic.

In [9]:
from pyLDAvis import enable_notebook
from pyLDAvis.gensim import prepare

enable_notebook()

prepare(topic_model=lda_model, corpus=corpus, dictionary=dictionary, sort_topics=False)

  from collections import Iterable


Next, assign topic labels by looking at keywords and representative bills associated with each topic, as well as 
coherence per topic to judge whether a topic should be assigned a label:

In [10]:
pd.options.display.max_colwidth = 1000

pd.DataFrame({
    'Coherence': c.get_coherence_per_topic(),
    'Number of Bills': bill_topics['dominant_topic'].value_counts(),
    'Keywords': [', '.join([i[0] for i in lda_model.show_topic(j)]) for j in range(n_topics)],
    'Sample Title': bill_topics.loc[bill_topics.groupby('dominant_topic')['max_perc'].idxmax()].set_index('dominant_topic')['title']
}) \
    .sort_values('Coherence', ascending=False)

Unnamed: 0,Coherence,Number of Bills,Keywords,Sample Title
7,0.763054,50,"bond, District, note, Board, authorize, Acts, tax, provide, issue, time","School Districts, Special - As introduced, pursuant to the request of the Franklin special school district of Williamson County, permits the district to issue bonds or notes in an amount not to exceed $26.5 million and to issue bond anticipation notes."
12,0.599163,107,"school, board, county, education, director, system, superintendent, elect, office, election","Education - As introduced, enacts the ""Local School District Empowerment Act,"" which provides for reestablishment of elected office of school superintendent for county or city school systems upon two-thirds vote of county or city governing body and approval in an election on the question by the voters in 10 LEAs as a pilot program to allow the department to study the relevant procedures of reestablishing the office; provides for qualifications of candidates; adjusts duties of the local board of education in county or city school systems electing superintendents."
19,0.590463,275,"student, scholarship, year, institution, semester, program, receive, HOPE, time, grant","Lottery, Scholarships and Programs - As introduced, sets awards from net lottery proceeds for certain postsecondary scholarships and grants at the amount the student initially received or at the amount awarded for initial recipients in the current semester of enrollment, whichever is greater."
10,0.545988,366,"education, report, committee, representative, house, senate, department, year, study, commissioner","Education - As introduced, requires the director of the office of legislative budget analysis to provide the revised BEP funding formula to the speaker of the senate, the speaker of the house of representatives, and the education committees of the senate and the house of representatives, if the commissioner fails to provide the revised BEP funding formula for the ensuing fiscal year by January 1."
14,0.541283,85,"member, board, term, serve, appoint, student, year, University, trustee, appointment","University of Tennessee - As introduced, reconstitutes the board of trustees of the University of Tennessee system."
17,0.52503,243,"student, test, assessment, school, grade, score, year, education, state, examination","Education, Dept. of - As introduced, requires the department to release certain percentages of test questions and answers from the Tennessee comprehensive assessment program (TCAP) tests and end-of-course examinations to LEAs and public schools."
15,0.502286,103,"college, state, education, technology, university, community, program, fund, course, board","Education, Higher - As enacted, revises various provisions regarding cooperative innovative programs."
6,0.477302,93,"school, student, participate, department, scholarship, program, state, year, parent, provide","School Vouchers - As introduced, enacts the ""Tennessee Choice & Opportunity Scholarship Act."""
1,0.469865,185,"teacher, school, license, board, education, evaluation, year, expectation, Personnel, Principals","Teachers, Principals and School Personnel - As enacted, revises compensation provisions and other provisions regarding substitute teachers."
5,0.424407,184,"education, LEA, department, program, provide, school, training, include, lea, develop","Education - As enacted, allows LEAs to teach the history of traditional winter celebrations; allows students and staff to use traditional greetings of such celebrations; and allows LEAs to display winter celebration scenes or symbols under certain conditions."


Look at most representative bills per topic:

In [11]:
def get_most_typical(topic):
    return bill_topics[bill_topics['dominant_topic'] == topic] \
        .sort_values('max_perc', ascending=False) \
        .loc[:, ['title', 'text', 'max_perc']] \
        .head(10)

get_most_typical(14)

Unnamed: 0,title,text,max_perc
1466,"University of Tennessee - As introduced, reconstitutes the board of trustees of the University of Tennessee system.","An act to amend Tennessee Code Annotated, Section 49201; Section 49202; Section 49203; Section 49204; Section 49205 and Section 49206, relative to the board of trustees of the University of Tennessee system. Tennessee Code Annotated, Section 49202, is amended by deleting the section in its entirety and substituting instead: The board of trustees of the University of Tennessee shall consist of five ex officio members and nineteen additional members. The governor, speaker of the senate, and the speaker of the house of representatives shall each appoint five members to the board of trustees. The governor and the speakers shall each appoint one member from each grand division of the state who shall represent the grand division from which the member is appointed. The governor and the speakers shall each appoint two additional members who may be from any part of the state. Two additional members shall be members of the faculty of the University of Tennessee who served as faculty senate p...",0.996284
1808,"University of Tennessee - As introduced, reconstitutes the board of trustees of the University of Tennessee system.","An act to amend Tennessee Code Annotated, Section 49201; Section 49202; Section 49203; Section 49204; Section 49205 and Section 49206, relative to the board of trustees of the University of Tennessee system. Tennessee Code Annotated, Section 49202, is amended by deleting the section in its entirety and substituting instead: The board of trustees of the University of Tennessee shall consist of five ex officio members and nineteen additional members. The governor, speaker of the senate, and the speaker of the house of representatives shall each appoint five members to the board of trustees. The governor and the speakers shall each appoint one member from each grand division of the state who shall represent the grand division from which the member is appointed. The governor and the speakers shall each appoint two additional members who may be from any part of the state. Two additional members shall be members of the faculty of the University of Tennessee who served as faculty senate p...",0.996284
2571,"Board of Regents - As introduced, changes the status of the student representative on state university boards to a voting member rather than a nonvoting member; establishes a selection process for the student representative.","An act to amend Tennessee Code Annotated, Section 49201, relative to state university board composition. WHEREAS, the state university boards vote on initiatives and policies that directly impact the respective university and student body; and WHEREAS, one of the board members is designated to a student representative who is appointed in order to present a student perspective on how the board's actions will affect the university and its students; and WHEREAS, while the student representative is allowed to discuss the board's initiatives, the student representative may not vote on these issues; and WHEREAS, allowing this student representative to vote will ensure adequate representation of collectively over 75,000 students at the 6 four-year state universities that will begin operating under their respective boards, rather than the Tennessee Board of Regents, beginning in the spring of 2017; now, therefore, Tennessee Code Annotated, Section 49201, is amended by deleting the subdivis...",0.985235
2214,"Board of Regents - As introduced, changes the status of the student representative on state university boards to a voting member rather than a nonvoting member; establishes a selection process for the student representative.","An act to amend Tennessee Code Annotated, Section 49201, relative to state university board composition. WHEREAS, the state university boards vote on initiatives and policies that directly impact the respective university and student body; and WHEREAS, one of the board members is designated to a student representative who is appointed in order to present a student perspective on how the board's actions will affect the university and its students; and WHEREAS, while the student representative is allowed to discuss the board's initiatives, the student representative may not vote on these issues; and WHEREAS, allowing this student representative to vote will ensure adequate representation of collectively over 75,000 students at the 6 four-year state universities that will begin operating under their respective boards, rather than the Tennessee Board of Regents, beginning in the spring of 2017; now, therefore, Tennessee Code Annotated, Section 49201, is amended by deleting the subdivis...",0.985221
1015,"Board of Regents - As introduced, removes a position on the board of regents reserved by statute for a person now deceased.","An act to amend Tennessee Code Annotated, Title 49, relative to higher education. Tennessee Code Annotated, Section 49201, is amended by deleting the subsection in its entirety. Tennessee Code Annotated, Section 49201, is amended by deleting the language ""nineteen members"" and by substituting instead the language ""eighteen members"".",0.940618
732,"Board of Regents - As introduced, removes a position on the board of regents reserved by statute for a person now deceased.","An act to amend Tennessee Code Annotated, Title 49, relative to higher education. Tennessee Code Annotated, Section 49201, is amended by deleting the subsection in its entirety. Tennessee Code Annotated, Section 49201, is amended by deleting the language ""nineteen members"" and by substituting instead the language ""eighteen members"".",0.940618
752,"Board of Regents - As introduced, removes a position on the board of regents reserved by statute for a person now deceased.","An act to amend Tennessee Code Annotated, Title 49, relative to higher education. Tennessee Code Annotated, Section 49201, is amended by deleting the subsection in its entirety. Tennessee Code Annotated, Section 49201, is amended by deleting the language ""nineteen members"" and by substituting instead the language ""eighteen members"".",0.940618
682,"Education, State Board of - As introduced, requires members of the board to be elected by congressional district for four-year term as current terms expire and vacancies arise.","An act to amend Tennessee Code Annotated, Title 49, relative to education. Tennessee Code Annotated, Section 49301, is amended by deleting it in its entirety and substituting instead the following: The state board of education shall be composed of nine elected members, one appointed public high school student member and one ex officio member. One elected member shall be elected from each congressional district. To be eligible to run for election and to serve, the member shall reside within the congressional district from which the member is elected as such district is apportioned at the time of the member's election. No incumbent member shall be removed from the incumbent member's seat prior to the expiration of the incumbent member's current term as a result of changes in congressional districts occasioned by reapportionment. The members of the board shall be elected for a term of four years, and members may succeed themselves for one additional term only. As the terms of the nine...",0.854202
1038,"Education, State Board of - As introduced, requires members of the board to be elected by congressional district for four-year term as current terms expire and vacancies arise.","An act to amend Tennessee Code Annotated, Title 49, relative to education. Tennessee Code Annotated, Section 49301, is amended by deleting it in its entirety and substituting instead the following: The state board of education shall be composed of nine elected members, one appointed public high school student member and one ex officio member. One elected member shall be elected from each congressional district. To be eligible to run for election and to serve, the member shall reside within the congressional district from which the member is elected as such district is apportioned at the time of the member's election. No incumbent member shall be removed from the incumbent member's seat prior to the expiration of the incumbent member's current term as a result of changes in congressional districts occasioned by reapportionment. The members of the board shall be elected for a term of four years, and members may succeed themselves for one additional term only. As the terms of the nine...",0.854198
1532,"Education, State Board of - As introduced, modifies the composition of the members of the board from nine appointed members to nine members elected by the people during the regular November elections; authorizes the governor to fill any vacancies on the board, subject to confirmation by the senate.","An act to amend Tennessee Code Annotated, Title 2 and Title 49, relative to the election of members of the state board of education. Tennessee Code Annotated, Section 49301, is amended by deleting the subsection and substituting instead the following: The state board of education shall be composed of nine elected members, one appointed public high school student member, and one ex officio member. One elected member shall be elected from each congressional district. To be eligible to run for election and to serve, a person shall reside within the congressional district from which the member is to be elected as the district is apportioned at the time of the election. No incumbent member shall be removed from the incumbent member's seat prior to the expiration of the incumbent member's current term as a result of changes in congressional districts occasioned by reapportionment. The terms for all members shall begin January 1, 2017. The members of the board shall be elected for a term ...",0.837828


From these, assign baseline labels where bills substantially fall under one topic. Setting this threshold lower will
result in more mislabels but fewer bills to look through manually.

In [12]:
bill_topics['label'] = ''

bill_topics.loc[(bill_topics['dominant_topic'] == 7) & (bill_topics['max_perc'] > 0.4), 'label'] = 'K-12 Funding; Special School Districts'
bill_topics.loc[(bill_topics['dominant_topic'] == 12) & (bill_topics['max_perc'] > 0.4), 'label'] = 'K-12 Governance'
bill_topics.loc[(bill_topics['dominant_topic'] == 19) & (bill_topics['max_perc'] > 0.4), 'label'] = 'Postsecondary Financial Aid'
bill_topics.loc[(bill_topics['dominant_topic'] == 10) & (bill_topics['max_perc'] > 0.4), 'label'] = 'Reports; Studies'
bill_topics.loc[(bill_topics['dominant_topic'] == 14) & (bill_topics['max_perc'] > 0.4), 'label'] = 'Higher Education Governance'
# etc.

Finally, iterate through bills to:

- Inspect and correct labels
- Assign additional labels
- Remove incorrect labels
- Inspect bills in topics with low coherence

Depending on how much manual work we want to do, we can just look at bills with missing labels; topics with low 
coherence; bills that didn't fit into any topic; or anything up to and including all bills.

In [13]:
def replace_topic(session, bill_id, label):
    if bill_topics.loc[(bill_topics['session'] == session) & (bill_topics['bill_id'] == bill_id)].empty:
        print(f'{bill_id} does not exist.')
    elif (bill_topics.loc[(bill_topics['session'] == session) & (bill_topics['bill_id'] == bill_id), 'label'] == label).item():
        print(f'{bill_id} is already labeled {label}.')
    else:
        bill_topics.loc[(bill_topics['session'] == session) & (bill_topics['bill_id'] == bill_id), 'label'] = label

In [14]:
pd.options.display.max_colwidth = 5000

missing_labels = bill_topics[bill_topics['label'] == ''] \
    .sort_values(['session', 'title', 'bill_id']) \
    .iterrows()

In [15]:
next(missing_labels)

(214,
 session                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                107
 bill_id                                                                                                                                                                                                                                                                                                                                                                                                                              

In [16]:
replace_topic(107, 'HB 2999', 'K-12 Funding')
# Etc.