# Using `spaCy` for cosine similarity

Prior to running this code, it's necessary to install `spaCy` on your machine, and also to download its English libraries. 

In [18]:
import spacy
import pandas as pd
from spacy.tokens import Doc
from spacy.vocab import Vocab

In [19]:
# Load the pre-defined English model:
nlp = spacy.load('en_core_web_sm')

In [20]:
# Read in a CSV file with a column of text abstracts.
df = pd.read_csv('resources/fedreg.csv')

#### Prepare the text data for processing

In [21]:
df=df[['document_number','abstract']] # Keep only the columns we need 
df=df.head(20) # Trim the dataset down to size, for example purposes
df=df.dropna(how='any') # Drop any rows with missing data
# df['abstract_utf']=df['abstract'].apply(lambda x: x.decode('utf-8')) # Convert the text to UTF8 format
df.head()
df.to_csv('resources/fedreg.csv', index=False)

In [22]:
# Preprocess and vectorize the text column.
df['tokens'] = df['abstract'].apply(lambda x: nlp(x))

In [23]:
# Display POS tagging for first abstract.
spacy.displacy.render(df['tokens'][1], style='ent',jupyter=True)

#### Note that the 4th and 5th abstracts are similar but not identical. We would expect these to have a high cosine similarity score.

In [24]:
print(df['abstract'][0])
print('\n')
print(df['abstract'][3])
print('\n')
print(df['abstract'][4])
print('\n')
print(df['abstract'][5])

The quick brown fox jumps over the lazy dog.


We are extending the expiration date of Endocrine Disorders body system in the Listing of Impairments (listings) in our regulations. We are making no other revisions to the body system in this final rule. This extension ensures that we will continue to have the criteria we need to evaluate impairments in the affected body system at step three of the sequential evaluation processes for initial claims and continuing disability reviews.


The Coast Guard has issued a temporary deviation from the operating schedule that governs the Monmouth County Highway Bridge (alternatively referred to as the ``Sea Bright Bridge'' or the ``S-32 Bridge'') across the Shrewsbury River, mile 4.0, at Sea Bright, New Jersey. This deviation will test a proposed change to the drawbridge operation schedule to determine whether a permanent change to the schedule is warranted. This deviation will allow the bridge to operate under an alternate schedule that seeks to ba

In [25]:
# Assign variable names.
doc0=df['tokens'][0]
doc3=df['tokens'][3]
doc4=df['tokens'][4]
doc5=df['tokens'][5]

In [26]:
# As expected, abstracts 4 and 5 are highly similar.
print(doc4.similarity(doc5)) 

0.9083938093180972


In [27]:
# Abstracts 4 and 3 are somewhat similar.
print(doc4.similarity(doc3)) 

0.8385613523655009


In [28]:
# Abstracts 4 and 0 are not really very similar.
print(doc4.similarity(doc0)) 

0.7396282776292413


## Add a new column in `pandas` dataframe, indicating similarity to a given text.

In [36]:
# Deliberately chosen for similarity to a few existing abstracts, especially # 05.
test_text='TEST TEST TEST The Coast TEST Guard has TEST issued TEST a temporary deviation TEST from the operating schedule that governs the Loop Parkway Bridge across Long Creek, TEST mile 0.7, and TEST Meadowbrook State Parkway Bridge across TEST Sloop Channel, mile 12.8, both at Hempstead, New York. This deviation is necessary in order to facilitate a motorcycle ride event and allows both bridges to remain in the closed position for two hours. TEST TEST TEST'

In [40]:
# vectorize that, and display its similarity to abstract #05.
test_doc = nlp(test_text)
doc5.similarity(doc)

0.9601535905458123

In [41]:
# For every abstract in the dataset, display its similarity to the test document.
df['sim_score']=df['tokens'].apply(lambda x: x.similarity(test_doc))
df.head(3)

Unnamed: 0,document_number,abstract,tokens,sim_score
0,testing12345,The quick brown fox jumps over the lazy dog.,"(The, quick, brown, fox, jumps, over, the, laz...",0.570798
1,2018-10583,We are superseding Airworthiness Directive (AD...,"(We, are, superseding, Airworthiness, Directiv...",0.832701
2,2018-10902,The Commodity Futures Trading Commission (Comm...,"(The, Commodity, Futures, Trading, Commission,...",0.85657


In [47]:
# Show the top ten most similar abstracts, in ascending order.
pd.set_option('display.max_colwidth', 1000)
df.sort_values(['sim_score'], ascending=False)[['abstract', 'sim_score']].head()

Unnamed: 0,abstract,sim_score
5,"The Coast Guard has issued a temporary deviation from the operating schedule that governs the Loop Parkway Bridge across Long Creek, mile 0.7, and Meadowbrook State Parkway Bridge across Sloop Channel, mile 12.8, both at Hempstead, New York. This deviation is necessary in order to facilitate a motorcycle ride event and allows both bridges to remain in the closed position for two hours.",0.960154
16,"The Coast Guard proposes to modify the Marine Air Terminal, LaGuardia Airport Security Zone. The modification of the security zone would expand the existing security zone boundary north along the Rikers Island Bridge to the intersecting point on the southern tip of Rikers Island then east to the western end of LaGuardia Airport. This expanded security zone is necessary to protect the port, waterfront facilities, and waters of the United States from terrorism, sabotage, or other subversive acts and incidents of a similar nature during visits to New York City by various dignitaries. We invite your comments on this proposed rulemaking.",0.920878
9,"This document corrects technical errors that appeared in the final rule with comment period and interim final rule with comment period published in the Federal Register on November 16, 2017 entitled ``Medicare Program; CY 2018 Updates to the Quality Payment Program; and Quality Payment Program: Extreme and Uncontrollable Circumstance Policy for the Transition Year'' (hereinafter referred to as the ``CY 2018 Quality Payment Program final rule'').",0.916684
6,"The Coast Guard will enforce the temporary safety zone for the 42nd Annual Swim Around Key West, Key West, Florida from 9 a.m. until 6 p.m. on June 16, 2018. Our regulation for Recurring Safety Zones in Captain of the Port Key West Zone identifies the regulated area for this event. This action is necessary to ensure the safety of event participants and spectators. During the enforcement period, no person or vessel may enter, transit through, anchor in, or remain within the regulated area without approval from the Captain of the Port Key West or a designated representative.",0.914215
7,"The Coast Guard will enforce the temporary safety zone for the Annual FKCC Swim Around Key West, in Key West, Florida from 9 a.m. until 6 p.m. on June 30, 2018. Our regulation for Recurring Safety Zones in Captain of the Port Key West Zone identifies the regulated area for this event. This action is necessary to ensure the safety of event participants and spectators. During the enforcement period, no person or vessel may enter, transit through, anchor in, or remain within the regulated area without approval from the Captain of the Port Key West or a designated representative.",0.910621


In [48]:
# And the least similar abstracts.
pd.set_option('display.max_colwidth', 1000)
df.sort_values(['sim_score'], ascending=False)[['abstract', 'sim_score']].tail()

Unnamed: 0,abstract,sim_score
1,We are superseding Airworthiness Directive (AD) 2017-11-03 for DG Flugzeugbau GmbH Model DG-500MB gliders that are equipped with a Solo 2625 02 engine modified with a fuel injection system following the instructions of Solo Kleinmoteren GmbH Technische Mitteilung 4600-3 and identified as Solo 2625 02i. This AD results from mandatory continuing airworthiness information (MCAI) issued by an aviation authority of another country to identify and correct an unsafe condition on an aviation product. The MCAI describes the unsafe condition as failure of the connecting rod bearing resulting from too much load on the rod bearings from the engine control unit. This AD adds a model to the applicability. We are issuing this AD to require actions to address the unsafe condition on these products.,0.832701
15,"The Coast Guard proposes to establish a temporary safety zone for certain waters of Murrells Inlet, SC. This action is necessary to provide for the safety of the general public, spectators, vessels, and the marine environment from potential hazards during a fireworks display. This proposed rulemaking would prohibit persons and vessels from entering, transiting through, anchoring in, or remaining within the safety zone unless authorized by the Captain of the Port Charleston (COTP) or a designated representative. We invite your comments on this proposed rulemaking.",0.8291
19,"This notice is to invite applications for loans and grants under the Rural Economic Development Loan and Grant (REDLG) Programs for fiscal year (FY) 2018, subject to the availability of funding. This notice is being issued in order to allow applicants sufficient time to leverage financing, prepare and submit their applications, and give the Agency time to process applications within FY 2018. Successful applications will be selected by the Agency for funding and subsequently awarded to the extent that funding may ultimately be made available through appropriations. An announcement on the website at http://www.rd.usda.gov/newsroom/notices-solicitation-applications-nosas will identify the amount received in the appropriations. All applicants are responsible for any expenses incurred in developing their applications.",0.789927
3,We are extending the expiration date of Endocrine Disorders body system in the Listing of Impairments (listings) in our regulations. We are making no other revisions to the body system in this final rule. This extension ensures that we will continue to have the criteria we need to evaluate impairments in the affected body system at step three of the sequential evaluation processes for initial claims and continuing disability reviews.,0.719938
0,The quick brown fox jumps over the lazy dog.,0.570798
