# Introduction to [spaCy](https://spacy.io/)
*Industrial-Strength Natural Language Processing in Python*

- spaCy is a Python libary for NLP
- supports multiple languages, staistical models
- provides support for tokenization, word vectors, tagging, parsing, segmentation, and more

Setup Resources:
- [spacy101](https://spacy.io/usage/spacy-101) 
- [Introduction to NLP with spaCy](https://towardsdatascience.com/a-short-introduction-to-nlp-in-python-with-spacy-d0aa819af3ad)

To install, go to terminal and run 
```
pip install -U spacy
```
After installation, also need to download the language model 
```
python -m spacy download en_core_web_lg
```

To use spacy with English:
```
import spacy
nlp = spacy.load("en_core_web_lg")
```

In [51]:
%%capture
# Install spacy for jupyter notebook.
try:
    from pip import main as pipmain
except:
    from pip._internal import main as pipmain
packages = ['spacy']
pipmain(['install'] + packages);

In [57]:
!python3 -m spacy download --user en_core_web_lg

Collecting en_core_web_lg==2.2.5 from https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz#egg=en_core_web_lg==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
[K     |████████████████████████████████| 827.9MB 10.9MB/s eta 0:00:01  |▏                               | 3.0MB 1.8MB/s eta 0:07:36     |▉                               | 21.1MB 9.8MB/s eta 0:01:23     |██████                          | 153.9MB 7.5MB/s eta 0:01:30     |████████▍                       | 216.4MB 11.6MB/s eta 0:00:53     |█████████▎                      | 240.1MB 3.4MB/s eta 0:02:55     |████████████▏                   | 314.9MB 11.2MB/s eta 0:00:46     |████████████▋                   | 327.6MB 8.3MB/s eta 0:01:01     |██████████████▏                 | 365.1MB 11.9MB/s eta 0:00:39     |██████████████▏                 | 367.1MB 11.9MB/s eta 0:00:39     |██████████

In [54]:
import spacy
nlp = spacy.load("en_core_web_lg")

OSError: [E050] Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

### Tokenization
- split text into words, symbols, punctuation a.k.a. tokens

In [25]:
doc = nlp("The hungry, hungry catepillar ate all of the food, and then he became a butterfly!")
doc.text.split() 

['The',
 'hungry,',
 'hungry',
 'catepillar',
 'ate',
 'all',
 'of',
 'the',
 'food,',
 'and',
 'then',
 'he',
 'became',
 'a',
 'butterfly!']

Note that some of the punctuation gets attached to the previous word. We don't want that.

In [26]:
[token.orth_ for token in doc] 

['The',
 'hungry',
 ',',
 'hungry',
 'catepillar',
 'ate',
 'all',
 'of',
 'the',
 'food',
 ',',
 'and',
 'then',
 'he',
 'became',
 'a',
 'butterfly',
 '!']

remove punctuation by using `.is_punct`   
remove spaces by using: `.is_space`   
remove stop words by using the `.is_stop`   

In [29]:
[token.orth_ for token in doc if not token.is_punct | token.is_space | token.is_stop] 

['hungry', 'hungry', 'catepillar', 'ate', 'food', 'butterfly']

Note how all the punctuation, white spaces, and stop words have been removed and we are left only with the "important" words.

### Lemmatization
- reducing a word to its base form or root form
- reduce various wordforms to its citation form

use spacy's `.lemma_` method

In [32]:
lemma_words = "going gone went goes" 
nlp_lemma_words = nlp(lemma_words) 
[word.lemma_ for word in nlp_lemma_words] 

['go', 'go', 'go', 'go']

This is especially useful for text classification because lemmatising the text helps avoids word duplication for building models like bag of words model.

### Parts-of-speech (POS) Tagging
- assign the to words 
- spacy uses [Penn Treebank POS tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

use the `.pos_` and `.tag_` methods

In [36]:
doc2 = nlp("My dog's toy actually belongs to the neighbor's cat.") 
pos_tags = [(i, i.tag_) for i in doc2]
pos_tags

[(My, 'PRP$'),
 (dog, 'NN'),
 ('s, 'POS'),
 (toy, 'NN'),
 (actually, 'RB'),
 (belongs, 'VBZ'),
 (to, 'IN'),
 (the, 'DT'),
 (neighbor, 'NN'),
 ('s, 'POS'),
 (cat, 'NN'),
 (., '.')]

create a list of owner-possesion tuples

In [37]:
[(i[0].nbor(-1), i[0].nbor(+1)) for i in pos_tags if i[1] == "POS"] 

[(dog, toy), (neighbor, cat)]

### Word Vectors
- the concept of word embeddings is that every word can be represented as a set of real numbers (vectors) that capture the word meaning and context
- each word has a unique embedding
- word embeddings are multidimensional
- similar words have similar embedding values

Resources:
- [spacy.io: Word Vectors and Semantic Similarity](https://spacy.io/usage/vectors-similarity)
- [Get Busy With Word Embeddings](https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/)
- [Word Embeddings in Python with Spacy and Gensim](https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/)

Spacy provides pre-trained models for word embeddings which downloaded when we downloaded the English model. Spacy can parse entire blocks of text and assigns word vectors using the loaded model. Then, use `.vector` to get the word vector. 

Important Note: spaCy's small models (models that end in `sm`) don't ship with word vectors. You can still use `.similarity` to compares, but the results won't be as good. To use real word vectors, make sure to download the large models:
```
python -m spacy download en_core_web_lg
```

In [50]:
tokens = nlp(u"cat dog water cloud")
print(tokens[0].text, tokens[0].vector)

cat [ 0.75340027  1.3175309  -1.7617409   0.31828502  1.638488    1.0552995
  0.31595042  5.7819796   0.0343391   4.0019946   5.2300787   0.15606856
  3.4243002  -2.4221869   1.6035937   1.337295   -1.2828892   1.8265244
 -1.3817635  -2.1414158  -0.18950051  0.08884555 -0.27133894 -0.47963017
 -0.2571426  -2.404962    1.0642331  -3.212206    0.37124443  1.3374927
  0.8305371  -0.90479285  0.8499906  -2.0038836  -0.9727297  -1.1461185
  3.1226678   0.9663619  -3.0638602   2.833336    1.0400918   1.2719803
 -1.575497   -3.35216    -0.17291349 -2.689811    0.5845911  -1.7116385
 -0.42257053 -0.7932979  -1.072156   -0.07230203  0.09205103 -0.05305272
 -2.469643    1.3820657   2.0382776   1.9970671  -0.41650915 -0.9046292
  1.887304   -2.9841347  -0.55531263  1.2111204  -1.9578846  -2.7545862
  1.9617162  -4.4497204   1.0900779   2.837864   -1.7545315   2.7429385
  1.5484421  -0.05734076 -1.2939063  -0.29932067  1.5413947  -1.9647777
 -0.3654107  -0.79876906  2.2411642  -1.6222702  -3.13208

Now we can use the word vectors we got from spacy to compare the similarity of the words using `.similarity`.

In [47]:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.48059145
dog banana 0.37596428
dog afskfsd 0.1499375
cat dog 0.48059145
cat cat 1.0
cat banana 0.4528873
cat afskfsd 0.26725474
banana dog 0.37596428
banana cat 0.4528873
banana banana 1.0
banana afskfsd 0.4848014
afskfsd dog 0.1499375
afskfsd cat 0.26725474
afskfsd banana 0.4848014
afskfsd afskfsd 1.0


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


Other Resources
- [PythonForLinguistsTalk Gitlab](https://gitlab.com/andersonh/PythonForLinguistsTalk)