# Introduction to [spaCy](https://spacy.io/)
*Industrial-Strength Natural Language Processing in Python*

- spaCy is a Python libary for NLP
- supports multiple languages, staistical models
- provides support for tokenization, word vectors, tagging, parsing, segmentation, and more

Setup Resources:
- [spacy101](https://spacy.io/usage/spacy-101) 
- [Introduction to NLP with spaCy](https://towardsdatascience.com/a-short-introduction-to-nlp-in-python-with-spacy-d0aa819af3ad)

To install, go to terminal and run 
```
pip install -U spacy
```
After installation, also need to download the language model 
```
python -m spacy download en_core_web_lg
```

To use spacy with English:
```
import spacy
nlp = spacy.load("en_core_web_lg")
```
Make sure you install in terminal first before trying to install in this jupyter notebook.

In [1]:
%%capture
# Install spacy for jupyter notebook.
try:
    from pip import main as pipmain
except:
    from pip._internal import main as pipmain
packages = ['spacy']
pipmain(['install'] + packages);

In [None]:
%%capture
!python -m spacy download en_core_web_lg

In [3]:
import spacy
nlp = spacy.load('en_core_web_lg')

### Tokenization
- split text into words, symbols, punctuation a.k.a. tokens

In [None]:
doc = nlp("The hungry, hungry catepillar ate all of the food, and then he became a butterfly!")
doc.text.split() 

Note that some of the punctuation gets attached to the previous word. We don't want that.

In [None]:
[token.orth_ for token in doc] 

remove punctuation by using `.is_punct`   
remove spaces by using: `.is_space`   
remove stop words by using the `.is_stop`   

In [None]:
[token.orth_ for token in doc if not token.is_punct | token.is_space | token.is_stop] 

Note how all the punctuation, white spaces, and stop words have been removed and we are left only with the "important" words.

### Lemmatization
- reducing a word to its base form or root form
- reduce various wordforms to its citation form

use spacy's `.lemma_` method

In [None]:
lemma_words = "going gone went goes" 
nlp_lemma_words = nlp(lemma_words) 
[word.lemma_ for word in nlp_lemma_words] 

This is especially useful for text classification because lemmatising the text helps avoids word duplication for building models like bag of words model.

### Parts-of-speech (POS) Tagging
- assign the to words 
- spacy uses [Penn Treebank POS tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

use the `.pos_` and `.tag_` methods

In [None]:
doc2 = nlp("My dog's toy actually belongs to the neighbor's cat.") 
pos_tags = [(i, i.tag_) for i in doc2]
pos_tags

create a list of owner-possesion tuples

In [None]:
[(i[0].nbor(-1), i[0].nbor(+1)) for i in pos_tags if i[1] == "POS"] 

### Word Vectors
- the concept of word embeddings is that every word can be represented as a set of real numbers (vectors) that capture the word meaning and context
- each word has a unique embedding
- word embeddings are multidimensional
- similar words have similar embedding values

Resources:
- [spacy.io: Word Vectors and Semantic Similarity](https://spacy.io/usage/vectors-similarity)
- [Get Busy With Word Embeddings](https://www.shanelynn.ie/get-busy-with-word-embeddings-introduction/)
- [Word Embeddings in Python with Spacy and Gensim](https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/)

Spacy provides pre-trained models for word embeddings which downloaded when we downloaded the English model. Spacy can parse entire blocks of text and assigns word vectors using the loaded model. Then, use `.vector` to get the word vector. 

Important Note: spaCy's small models (models that end in `sm`) don't ship with word vectors. You can still use `.similarity` to compares, but the results won't be as good. To use real word vectors, make sure to download the large models:
```
python -m spacy download en_core_web_lg
```

In [None]:
tokens = nlp(u"cat dog water cloud")
print(tokens[0].text, tokens[0].vector)

Now we can use the word vectors we got from spacy to compare the similarity of the words using `.similarity`.

In [None]:
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

Other Resources
- [PythonForLinguistsTalk Gitlab](https://gitlab.com/andersonh/PythonForLinguistsTalk)