# Natural Language Processing with SpaCy

A quick exploration of SpaCy's capabilities, for potential future use!

In [14]:
% matplotlib inline

from __future__ import unicode_literals

import spacy
import pandas as pd
import numpy

The first step is to construct a natural language processing pipeline, using spacy.load()

In [3]:
nlp = spacy.load('en')

Now, I want to load the data. Since I have a sequence of documents, I will use the .pip() to work on the texts in parallel

In [21]:
data = pd.read_csv('train.csv', encoding = 'utf8')

In [22]:
len(data)

404290

In [23]:
data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


## First exploration of SpaCy

The way spaceyworks is by linguistically annotating text. So once you process text using SpaCy, you get access to a whole bunch of additional information. For instance, lets start by processing a single question in the Quora question database. 

## Tokenizing a single question

In [48]:
doc = nlp(data.question1[2])
doc

How can I increase the speed of my internet connection while using a VPN?

Now, I can break this question into sentences: 

In [49]:
print [sent for sent in doc.sents]

[How can I increase the speed of my internet connection while using a VPN?]


I can also 'tokenize' the sentence. This breaks down the sentence into its components, such as a word, punctuation, or whitespace: 

In [50]:
for token in doc: 
    print token

How
can
I
increase
the
speed
of
my
internet
connection
while
using
a
VPN
?


These tokens then contain information about themselves. For instance, as part of the token instance, I can find out its part of speech (.pos_), or its orthographic features (.shape), or if it is an entity (i.e. a name. Here, only VPN is an ent, so only VPN has an ent_type). 

Finally, I can also find the log probability of a word appearing in a large corpus. 

In [95]:
for token in doc: 
    print ('{} -> {} -> {} -> {} -> {}'.format(token, token.lemma_, token.pos_, token.ent_type_, token.prob))

How -> how -> ADV ->  -> -7.759141922
can -> can -> VERB ->  -> -5.9138712883
I -> -PRON- -> PRON ->  -> -4.06418085098
increase -> increase -> VERB ->  -> -9.37685871124
the -> the -> DET ->  -> -3.42544579506
speed -> speed -> NOUN ->  -> -9.76106262207
of -> of -> ADP ->  -> -4.12846374512
my -> -PRON- -> ADJ ->  -> -5.91812467575
internet -> internet -> NOUN ->  -> -8.6765203476
connection -> connection -> NOUN ->  -> -10.2381095886
while -> while -> ADP ->  -> -7.63133382797
using -> use -> VERB ->  -> -7.9104590416
a -> a -> DET ->  -> -3.98307538033
VPN -> vpn -> PROPN -> ORG -> -19.5793132782
? -> ? -> PUNCT ->  -> -4.93188428879


These tokens also allow Spacey to identify noun chunks, which are chunks containing all the relevant information about a noun. 

In [94]:
print ([chunk for chunk in doc.noun_chunks])

[I, the speed, my internet connection, a VPN]


SpaCy also automatically includes word vectors, with dimension 300 (these are taken from the [GloVe](https://nlp.stanford.edu/projects/glove/) vectors, which are - conveniently - the vectors I used before)

In [105]:
token.vector.shape

(300,)

Using these dimensions, we can then find vector similarities between words: 

In [103]:
print ("'" + str(doc[2]) + "' has a cosine similarity of " + str(doc[2].similarity(doc[7])) + " with '" + str(doc[7]) + "'")

'I' has a cosine similarity of 0.577479287934 with 'my'
