# Comparing 1.2 million bills to thousands of pieces of model legislation

In our mission to reproduce [this piece on model legislation](https://www.usatoday.com/pages/interactives/asbestos-sharia-law-model-bills-lobbyists-special-interests-influence-state-laws/), we need to find all examples of "cut and paste" legislation in our database.

Our previous approach found one piece of model legislation at a time, while this time we'll be looking to process all of them in one batch.

<p class="reading-options">
  <a class="btn" href="/azcentral-text-reuse-model-legislation/06-search-for-model-legislation-in-over-one-million-bills-using-postgres-and-solr">
    <i class="fa fa-sm fa-book"></i>
    Read online
  </a>
  <a class="btn" href="/azcentral-text-reuse-model-legislation/notebooks/06-Search for model legislation in over one million bills using Postgres and Solr.ipynb">
    <i class="fa fa-sm fa-download"></i>
    Download notebook
  </a>
  <a class="btn" href="https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/azcentral-text-reuse-model-legislation/notebooks/06-Search for model legislation in over one million bills using Postgres and Solr.ipynb" target="_new">
    <i class="fa fa-sm fa-laptop"></i>
    Interactive version
  </a>
</p>

### Prep work: Downloading necessary files
Before we get started, we need to download all of the data we'll be using.
* **alec-model-policies.csv:** alec model legislation - TK


In [None]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/azcentral-text-reuse-model-legislation/data/alec-model-policies.csv -P data

In [9]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import pysolr
import requests
from sqlalchemy import create_engine
import tqdm

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_colwidth", 1000)

## Read in model bills

In [2]:
model_df = pd.read_csv("data/alec-model-policies.csv")
model_df = model_df.rename(columns={'text': 'content'})
model_df.head()

Unnamed: 0,title,url,content
0,Resolution Supporting Congressional Approval of the United States-Mexico-Canada Agreement (USMCA),https://www.alec.org/model-policy/resolution-supporting-congressional-approval-of-the-united-states-mexico-canada-agreement-usmca/,"\n\nDraft\nResolution Supporting Congressional Approval of the United States-Mexico-Canada Agreement (USMCA)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhereas, the imposition of artificial barriers to free and open trade are harmful to American economic interests; and\nWhereas, together, the United States, Canada and Mexico promote a shared belief in freedom, representative democracy and market principles as recognized in the U.S. Constitution; and\nWhereas, a longstanding, close tri-lateral relationship, codified in the North American Free Trade Agreement (NAFTA), has existed between the United States, Canada, and Mexico for more than 25 years and has proven economically, culturally and strategically important for all parties and this relationship will continue with ratification of USMCA; and\nWhereas, trade with Canada and Mexico supports nearly 12 million American jobs, and nearly 5 million of those jobs are supported by increased trade generated by NAFTA and these benefits will co..."
1,Resolution Supporting the Intellectual Property (IP) Provisions in the United States-Mexico-Canada Agreement (USMCA),https://www.alec.org/model-policy/draft-resolution-supporting-the-intellectual-property-ip-provisions-in-the-united-states-mexico-canada-agreement-usmca/,"\n\nDraft\nResolution Supporting the Intellectual Property (IP) Provisions in the United States-Mexico-Canada Agreement (USMCA)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWhereas, the American Legislative Exchange Council (ALEC) policy on free trade acknowledges that, “the imposition of artificial barriers to free and open trade…are deterrents to American economic interests;” and\nWhereas, the United States, Canada and Mexico share a belief in freedom, representative democracy and market principles as recognized in the U.S. Constitution; and\nWhereas, trade among our North American trading partners is made up predominantly of intellectual property (IP)-intensive goods and services that employ millions of Americans in high paying jobs and generate billions of dollars in economic output; and\nWhereas, many of the IP-intensive goods, services and exchanges through which trade is facilitated in the NAFTA bloc did not exist when the agreement was drafted and this situation has resulted in u..."
2,Victims of Communism Memorial Day Resolution,https://www.alec.org/model-policy/draft-victims-of-communism-memorial-day-resolution/,"\n\nDraft\nVictims of Communism Memorial Day Resolution\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModel Policy\nWHEREAS, the year 2017 marked 100 years since the Bolshevik Revolution in Russia resulting in the world’s first communist regime under Vladimir Lenin, which led to decades of oppression and violence under communist regimes throughout the world; and\nWHEREAS, based on the philosophy of Karl Marx, communism has proven incompatible with the ideals of liberty, prosperity, and dignity of human life and has given rise to such infamous totalitarian dictators as Joseph Stalin, Mao Zedong, Ho Chi Minh, Pol Pot, Nicolae Ceaușescu, the Castro brothers, and the Kim dynasty; and\nWHEREAS, President Donald Trump declared November 7, 2017 a National Day for the Victims of Communism, condemning communism as a political philosophy “incompatible with liberty, prosperity, and the dignity of human life;” and\nWHEREAS, the bipartisan U.S. Congressional Caucus for the Victims of Communism stated ..."
3,Resolution in Support of the Taiwan Travel Act,https://www.alec.org/model-policy/draft-resolution-in-support-of-the-taiwan-travel-act/,"\n\nDraft\nResolution in Support of the Taiwan Travel Act\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModel Policy\nWhereas, a longstanding, close bilateral relationship, codified in the Taiwan Relations Act, has existed between the United States and Taiwan and has proven economically, culturally and strategically important to both; and\nWhereas, Taiwan is a robust democracy, significant American trading partner and U.S. ally; and\nWhereas, together, Taiwan and the United States promote a shared belief in freedom, democracy and free market principles; and\nWhereas, Taiwan has consistently ranked among the top 12 U.S. trading partners for more than two decades; and\nWhereas, Taiwan serves as a free market, democratic beacon and protector of the rules-based international order in the region.\nTherefore be it resolved, that ALEC applauds the adoption of the Taiwan Travel Act which will encourage the high-level official to official exchanges facilitated by the Act.\nBe it further resolved, ..."
4,Draft Resolution Urging the Presidential Administration and Congress to Support Continued U.S. Participation in the U.S.-Korea Free Trade Agreement (KORUS FTA),https://www.alec.org/model-policy/draft-resolution-urging-the-presidential-administration-and-congress-to-support-continued-u-s-participation-in-the-u-s-korea-free-trade-agreement-korus-fta/,"\n\nDraft\nDraft Resolution Urging the Presidential Administration and Congress to Support Continued U.S. Participation in the U.S.-Korea Free Trade Agreement (KORUS FTA)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWHEREAS, the American Legislative Exchange Council (ALEC) policy on free trade acknowledges that “the imposition of artificial barriers to free and open trade…are deterrents to American economic interests;” and\nWHEREAS, KORUS FTA was entered into force on March 15, 2012; and\nWHEREAS, KORUS FTA has been the largest U.S. FTA in more than 16 years and is the highest standard trade framework the U.S. currently has in force; and\nWHEREAS, retaining the KORUS FTA at this time would send a strong signal to U.S. trading partners that America’s historic commitment to free trade and economic liberalization remains strong; and\nWHEREAS, the Republic of Korea is the 15th largest economy in the world; and\nWHEREAS, the Republic of Korea is the United States’ seventh largest export marke..."


# Find matches

In [7]:
SOLR_RESULTS = 500

solr = pysolr.Solr('http://localhost:8983/solr/legislation', always_commit=True)
engine = create_engine('postgresql://localhost:5432/legislation')

def find_matches(target):
    # If there are leftovers from a previous match search, remove them
    solr.delete(q='bill_id:0')
    # Insert the model legislation to do a MLT search
    solr.add([{ 'content': target['content'], 'bill_id': 0 }])

    # What's like the one we just added?
    response = requests.get(f'http://localhost:8983/solr/legislation/mlt?q=bill_id:0&rows={SOLR_RESULTS}')
    data = response.json()

    # Extract bill ids, pass to postgres database
    bill_ids = [result['bill_id'] for result in data['response']['docs']]
    query = "select * from bills where bill_id = ANY(ARRAY{})".format(bill_ids)
    matches_df = pd.read_sql_query(query, engine)

    # Vectorize original and compare to search results
    vectorizer = CountVectorizer(binary=True, ngram_range=(6,6))
    vectorizer.fit([target['content']])
    matrix = vectorizer.transform(matches_df.content)

    # Count up matches
    sums = matrix.sum(axis=1)

    # Delete the model legislation that we're done
    solr.delete(q='bill_id:0')

    return pd.DataFrame({
        'matches': np.squeeze(np.asarray(sums)),
        'code': matches_df.state_code + "-" + matches_df.basename,
        'matched_with': target['title']
    })

In [12]:
# We can use iterrows because the speed of this part is really not that important

results = []
model_df = model_df.head(20)
for index, row in tqdm.tqdm_notebook(model_df.iterrows(), total=model_df.shape[0]):
    result = find_matches(row)
    results.append(result)    
df = pd.concat(results)

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))

SolrError: Connection to server 'http://localhost:8983/solr/legislation/update/?commit=true' timed out: HTTPConnectionPool(host='localhost', port=8983): Read timed out. (read timeout=60)

In [None]:
df.sort_values(by='matches', ascending=False).head(100)