# Using spaCy for cosine similarity

Prior to running this code, it's necessary to install `spaCy` on your machine, and also to download its English libraries. 

#### Install `spaCy` on a Windows machine, using Anaconda Prompt:
* create an Anaconda virtual environment:   
``` 
conda create --name spacyenv 
```   

* activate the virtual environment:   
```
conda activate spacyenv
```   
* install spaCy in the virtual environment:
```
conda install -c conda-forge spacy 
```
* download the small English library:
```
python -m spacy download en
```
Additional documentation is [here](https://spacy.io/usage/).

#### Install and activate ipython kernel:

* install the [ipython kernel package](https://ipython.readthedocs.io/en/stable/install/kernel_install.html) - this allows you to use virtual environments in `jupyter` and `atom hydrogen`:
```
conda install ipykernel
```
* create a [new ipython kernel](https://stackoverflow.com/questions/39604271/conda-environments-not-showing-up-in-jupyter-notebook) and give it a name:
```
python -m ipykernel install --user --name spacyenv --display-name "Python (spacyenv)" 
```

#### How to change the kernel in a jupyter notebook:
<img src="changekernel.png" alt="change kernel" style="width: 600px;"/>

In [1]:
import spacy
import pandas as pd
from spacy.tokens import Doc
from spacy.vocab import Vocab

In [2]:
# Load the pre-defined English model:
nlp = spacy.load('en_core_web_sm')

In [3]:
# Read in a CSV file with a column of text abstracts.
df = pd.read_csv('fedreg_18-05-22-14-45.csv')

#### Prepare the text data for processing

In [4]:
df=df[['document_number','abstract']] # Keep only the columns we need 
df=df.head(20) # Trim the dataset down to size, for example purposes
df=df.dropna(how='any') # Drop any rows with missing data
df['abstract_utf']=df['abstract'].apply(lambda x: x.decode('utf-8')) # Convert the text to UTF8 format
df.head()

Unnamed: 0,document_number,abstract,abstract_utf
0,testing12345,The quick brown fox jumps over the lazy dog.,The quick brown fox jumps over the lazy dog.
1,2018-10583,We are superseding Airworthiness Directive (AD...,We are superseding Airworthiness Directive (AD...
2,2018-10902,The Commodity Futures Trading Commission (Comm...,The Commodity Futures Trading Commission (Comm...
3,2018-10867,We are extending the expiration date of Endocr...,We are extending the expiration date of Endocr...
4,2018-10872,The Coast Guard has issued a temporary deviati...,The Coast Guard has issued a temporary deviati...


In [5]:
# Preprocess and vectorize the text column.
df['tokens'] = df['abstract_utf'].apply(lambda x: nlp(x))

In [6]:
# Display POS tagging for first abstract.
spacy.displacy.render(df['tokens'][1], style='ent',jupyter=True)

#### Note that the 4th and 5th abstracts are similar but not identical. We would expect these to have a high cosine similarity score.

In [7]:
print(df['abstract'][0])
print('\n')
print(df['abstract'][3])
print('\n')
print(df['abstract'][4])
print('\n')
print(df['abstract'][5])

The quick brown fox jumps over the lazy dog.


We are extending the expiration date of Endocrine Disorders body system in the Listing of Impairments (listings) in our regulations. We are making no other revisions to the body system in this final rule. This extension ensures that we will continue to have the criteria we need to evaluate impairments in the affected body system at step three of the sequential evaluation processes for initial claims and continuing disability reviews.


The Coast Guard has issued a temporary deviation from the operating schedule that governs the Monmouth County Highway Bridge (alternatively referred to as the ``Sea Bright Bridge'' or the ``S-32 Bridge'') across the Shrewsbury River, mile 4.0, at Sea Bright, New Jersey. This deviation will test a proposed change to the drawbridge operation schedule to determine whether a permanent change to the schedule is warranted. This deviation will allow the bridge to operate under an alternate schedule that seeks to ba

In [11]:
# Assign variable names.
doc0=df['tokens'][0]
doc3=df['tokens'][3]
doc4=df['tokens'][4]
doc5=df['tokens'][5]

In [12]:
# As expected, abstracts 4 and 5 are highly similar.
print(doc4.similarity(doc5)) 

0.9083937879900752


In [13]:
# Abstracts 4 and 3 are somewhat similar.
print(doc4.similarity(doc3)) 

0.8385613380704825


In [15]:
# Abstracts 4 and 0 are not really very similar.
print(doc4.similarity(doc0)) 

0.7396282670874782
