# Using `spaCy` for cosine similarity

Prior to running this code, it's necessary to install `spaCy` on your machine, and also to download its English libraries. 

In [1]:
import spacy
import pandas as pd
from spacy.tokens import Doc
from spacy.vocab import Vocab

In [2]:
# Load the pre-defined English model:
nlp = spacy.load('en_core_web_md')

In [3]:
# Read in a CSV file with a column of text abstracts.
df = pd.read_csv('resources/fedreg.csv')

#### Prepare the text data for processing

In [4]:
df=df[['document_number','abstract']] # Keep only the columns we need 
df=df.head(20) # Trim the dataset down to size, for example purposes
df=df.dropna(how='any') # Drop any rows with missing data
df.head()
df.to_csv('resources/fedreg.csv', index=False)

In [5]:
# Preprocess and vectorize the text column.
df['tokens'] = df['abstract'].apply(lambda x: nlp(x))

In [6]:
# Display POS tagging for first abstract.
spacy.displacy.render(df['tokens'][1], style='ent',jupyter=True)

#### What this should look like:
<img src="resources/displacy2.png" alt="airworthiness directive" style="width: 1000px;"/>

#### Note that the 4th and 5th abstracts are similar but not identical. We would expect these to have a high cosine similarity score.

In [7]:
print(df['abstract'][0])
print('\n')
print(df['abstract'][3])
print('\n')
print(df['abstract'][4])
print('\n')
print(df['abstract'][5])

The quick brown fox jumps over the lazy dog.


We are extending the expiration date of Endocrine Disorders body system in the Listing of Impairments (listings) in our regulations. We are making no other revisions to the body system in this final rule. This extension ensures that we will continue to have the criteria we need to evaluate impairments in the affected body system at step three of the sequential evaluation processes for initial claims and continuing disability reviews.


The Coast Guard has issued a temporary deviation from the operating schedule that governs the Monmouth County Highway Bridge (alternatively referred to as the ``Sea Bright Bridge'' or the ``S-32 Bridge'') across the Shrewsbury River, mile 4.0, at Sea Bright, New Jersey. This deviation will test a proposed change to the drawbridge operation schedule to determine whether a permanent change to the schedule is warranted. This deviation will allow the bridge to operate under an alternate schedule that seeks to ba

In [8]:
# Assign variable names.
doc0=df['tokens'][0]
doc3=df['tokens'][3]
doc4=df['tokens'][4]
doc5=df['tokens'][5]

In [9]:
# As expected, abstracts 4 and 5 are highly similar.
print(doc4.similarity(doc5)) 

0.9731837737779757


In [10]:
# Abstracts 4 and 3 are somewhat similar.
print(doc4.similarity(doc3)) 

0.9263738332219478


In [11]:
# Abstracts 4 and 0 are not really very similar.
print(doc4.similarity(doc0)) 

0.7740400790450174


## Add a new column in `pandas` dataframe, indicating similarity to a given text.

In [12]:
# Deliberately chosen for similarity to a few existing abstracts, especially # 05.
test_text='TEST TEST TEST The Coast TEST Guard has TEST issued TEST a temporary deviation TEST from the operating schedule that governs the Loop Parkway Bridge across Long Creek, TEST mile 0.7, and TEST Meadowbrook State Parkway Bridge across TEST Sloop Channel, mile 12.8, both at Hempstead, New York. This deviation is necessary in order to facilitate a motorcycle ride event and allows both bridges to remain in the closed position for two hours. TEST TEST TEST'

In [14]:
# vectorize that, and display its similarity to abstract #05.
test_doc = nlp(test_text)
doc5.similarity(test_doc)

0.9556445091009929

In [15]:
# For every abstract in the dataset, display its similarity to the test document.
df['sim_score']=df['tokens'].apply(lambda x: x.similarity(test_doc))
df.head(3)

Unnamed: 0,document_number,abstract,tokens,sim_score
0,testing12345,The quick brown fox jumps over the lazy dog.,"(The, quick, brown, fox, jumps, over, the, laz...",0.760118
1,2018-10583,We are superseding Airworthiness Directive (AD...,"(We, are, superseding, Airworthiness, Directiv...",0.902931
2,2018-10902,The Commodity Futures Trading Commission (Comm...,"(The, Commodity, Futures, Trading, Commission,...",0.84542


In [16]:
# Show the top ten most similar abstracts, in ascending order.
pd.set_option('display.max_colwidth', 1000)
df.sort_values(['sim_score'], ascending=False)[['abstract', 'sim_score']].head()

Unnamed: 0,abstract,sim_score
5,"The Coast Guard has issued a temporary deviation from the operating schedule that governs the Loop Parkway Bridge across Long Creek, mile 0.7, and Meadowbrook State Parkway Bridge across Sloop Channel, mile 12.8, both at Hempstead, New York. This deviation is necessary in order to facilitate a motorcycle ride event and allows both bridges to remain in the closed position for two hours.",0.955645
4,"The Coast Guard has issued a temporary deviation from the operating schedule that governs the Monmouth County Highway Bridge (alternatively referred to as the ``Sea Bright Bridge'' or the ``S-32 Bridge'') across the Shrewsbury River, mile 4.0, at Sea Bright, New Jersey. This deviation will test a proposed change to the drawbridge operation schedule to determine whether a permanent change to the schedule is warranted. This deviation will allow the bridge to operate under an alternate schedule that seeks to balance the seasonally high volume of roadway traffic crossing the bridge during peak hours with the existing needs of marine traffic.",0.940076
7,"The Coast Guard will enforce the temporary safety zone for the Annual FKCC Swim Around Key West, in Key West, Florida from 9 a.m. until 6 p.m. on June 30, 2018. Our regulation for Recurring Safety Zones in Captain of the Port Key West Zone identifies the regulated area for this event. This action is necessary to ensure the safety of event participants and spectators. During the enforcement period, no person or vessel may enter, transit through, anchor in, or remain within the regulated area without approval from the Captain of the Port Key West or a designated representative.",0.913895
6,"The Coast Guard will enforce the temporary safety zone for the 42nd Annual Swim Around Key West, Key West, Florida from 9 a.m. until 6 p.m. on June 16, 2018. Our regulation for Recurring Safety Zones in Captain of the Port Key West Zone identifies the regulated area for this event. This action is necessary to ensure the safety of event participants and spectators. During the enforcement period, no person or vessel may enter, transit through, anchor in, or remain within the regulated area without approval from the Captain of the Port Key West or a designated representative.",0.913186
14,"The Coast Guard proposes to establish a temporary safety zone for certain navigable waters of Cooper River at Patriot's Point in Charleston, SC. This action is necessary to provide for the safety of the general public, spectators, vessels, and the marine environment from potential hazards during a fireworks display. This proposed rulemaking would prohibit persons and vessels from entering, transiting through, anchoring in, or remaining within the safety zone unless authorized by the Captain of the Port Charleston (COTP) or a designated representative. We invite your comments on this proposed rulemaking.",0.909708


In [17]:
# And the least similar abstracts.
pd.set_option('display.max_colwidth', 1000)
df.sort_values(['sim_score'], ascending=False)[['abstract', 'sim_score']].tail()

Unnamed: 0,abstract,sim_score
11,"The Federal Communications Commission (Commission) published a document in the Federal Register on May 14, 2018, announcing that the Office of Management and Budget (OMB) has approved, for a period of three years, the information collection associated with the Commission's discontinuance rules. The document incorrectly referred to the Commission's discontinuance rules rather than its network change disclosure rules.",0.850176
9,"This document corrects technical errors that appeared in the final rule with comment period and interim final rule with comment period published in the Federal Register on November 16, 2017 entitled ``Medicare Program; CY 2018 Updates to the Quality Payment Program; and Quality Payment Program: Extreme and Uncontrollable Circumstance Policy for the Transition Year'' (hereinafter referred to as the ``CY 2018 Quality Payment Program final rule'').",0.850033
2,"The Commodity Futures Trading Commission (Commission or CFTC) is granting an exemption to certain member firms designated by the National Stock Exchange of India Ltd. (NSE) from the application of certain of the Commission's foreign futures and option regulations based upon substituted compliance with certain comparable regulatory and self-regulatory requirements of a foreign regulatory authority consistent with conditions specified by the Commission, as set forth herein. This Order is issued pursuant to Commission Regulation 30.10, which permit persons to file a petition with the Commission for exemption from the application of certain of the Regulations set forth in part 30 and authorizes the Commission to grant such an exemption if such action would not be otherwise contrary to the public interest or to the purposes of the provision from which exemption is sought. The Commission notes that this Order does not pertain to any transaction in swaps, as defined in Section 1a(47) of t...",0.84542
18,"In accordance with the Paperwork Reduction Act of 1995 this notice announces the USDA, NFC's intention to request a review of a currently approved information collection for the Direct Premium Remittance System (DPRS) Form DPRS-2809.",0.843965
0,The quick brown fox jumps over the lazy dog.,0.760118
